How to make AI better at reading comprehension

机器学习

自然语言处理

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"Question answering through reading comprehension is a popular task in natural-language processing. It’s a task many people know from standardized tests: a student is given a passage and questions based on the passage — say, an article on William the Conqueror and the question “When did William invade England?” The student reads the passage and learns that the answer is 1066. In natural-language processing, we aim to teach machine learning models to do the same thing.\n\n![image.png](https://dev-media.amazoncloud.cn/9ac279ba8921452293f8e1955de9e063_image.png)\n\nIn natural-language understanding, reading comprehension involves finding an excerpt from a text that can serve as an answer to a question about that text.\n\nCREDIT: GLYNIS CONDON\n\nIn recent years, question-answering models have made a lot of progress. In fact, models have started outperforming human baselines on public leaderboards such as ++[SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/)++.\n\nAre the models really learning question answering, or are they learning heuristics that work only in some circumstances? We investigate this question in our paper “++[What do models learn from question answering datasets?](https://www.amazon.science/publications/what-do-models-learn-from-question-answering-datasets)++”, which we’re presenting at the Conference on Empirical Methods in Natural Language Processing (EMNLP).\n\nIn this paper, we subject question-answering models built atop the popular BERT linguistic model to a variety of simple yet informative attacks. We identify shortcomings that cast doubt on the idea that models are really outperforming humans. In particular, we find that\n\n\n#### **(1) Models don’t generalize well**\n\n\nA student who is a good critical reader should be able to answer questions about a variety of articles. A student who can answer questions about William the Conqueror but not Julius Caesar may not have learned reading comprehension —just information about William the Conqueror.\n\nQuestion-answering models do not generalize well across data sets. A model that does well on the SQuAD data set doesn’t do well on the Natural Questions data set, even though both contain questions about Wikipedia articles. This suggests that models can solve individual data sets without necessarily learning reading comprehension more generally.\n\n![image.png](https://dev-media.amazoncloud.cn/98ca90b4f86a4b8db3697801a80a2a3a_image.png)\n\nThis graph shows the performance of a question-answering model trained on SQuAD and evaluated across five other datasets. While the model does well on its own test set (75.6), its performance is lower on other data sets. \n\n\n#### **(2) Models take short cuts**\n\n\nWhen testing question-answering models, we assume that high performance means good understanding of the subject. But tests can be flawed. If a student takes a multiple-choice test where every answer is “C”, it’s hard to judge whether the student really understood the material or exploited the flaw. Similarly, models may be picking up on biases in test questions that let them arrive at the correct answer without doing reading comprehension. \n\nTo probe this, we conducted three experiments. The first was a modification at training time: we corrupted training sets by replacing correct answers with incorrect answers — for instance, “Q: ‘When did William invade England?’ A: ‘William is buried in Caen’”. \n\nThe other two were modifications at test time. In one, we shuffled the sentences in the input articles so that they no longer formed coherent paragraphs. In the other, we gave models incomplete questions (“When did William?”, “When?”, or no words at all). \n\nIn all these experiments, the models were suspiciously robust, continuing to return correct answers. This means that they didn’t need to do reading comprehension at training time or at test time to understand the structure of the articles or be asked the full question.\n\nHow can this be? It turns out that some questions in some data sets can be answered trivially. In our experiments, for example, one model was just answering all “who” questions with the first proper name in the passage. Simple rules like this can get us to almost 40% of current model baselines.\n\n\n#### **(3) Models aren’t prepared to handle variations**\n\n\n![image.png](https://dev-media.amazoncloud.cn/a035db468ff84a26bff053782b089c08_image.png)\n\nThis graph shows the performance of a Natural Questions model against various attacks: 50% corrupt, in which half the labeled answers in the training data are wrong; shuffled context, in which the sentences of the test excerpts are out of order; no question, in which the questions in the test data are incomplete; filler words, in which fillers such as “really” and “actually” are added in a syntactically correct way; and negation, in which the negative of the test question is substituted for the positive (“When didn’t William invade England?”). Where we would expect much lower performance in the first three cases, we instead see surprising robustness. Where we would expect to see little change with filler words, we see a drop of almost 7 F1 points. On the negation task, the model answers 94% of questions the same way it did when they were positively framed.\n\nA student should understand that “When did William invade England?”, “When did William march his army into England?”, and “When was England invaded by William?” are all asking the same question. But models can still struggle with this.\n\nWe conducted two experiments where we ran variations of questions through reading comprehension models. First, we tried the very simple change of adding filler words to questions (“When did William really invade England?”). In principle, this should have no effect on performance, but we found that it reduces the model’s F1 score — a metric that factors in both false positives and false negatives — by up to 8%. \n\nNext, we added negation (“When didn’t William invade England?”) to see if models understood the difference between positive and negative questions. We found that models ignore negation up to 94% of the time and return the same answers they would to positive questions.\n\n\n#### **Conclusions**\n\n\nOur experiments suggest that models are learning short cuts rather than performing reading comprehension. While this is disappointing, it can be fixed. We believe that following these five suggestions can lead to better question-answering data sets and evaluation methods in the future:\n\n- **Test for generalizability:** Report performance across multiple relevant data sets to make sure a model is not just solving a single data set;\n- **Challenge the models:** Discard questions that can be solved trivially — for example, by always returning the first proper noun;\n- **Good performance does not guarantee understanding:** Probe data sets to ensure models are not taking short cuts;\n- **Include variations:** Add variations to existing questions to check model flexibility;\n- **Standardize data set formats:** Consider following a standard format when releasing new data sets, as this makes cross-data-set experimentation easier. We offer some help in this regard by ++[releasing code](https://github.com/amazon-research/qa-dataset-converter)++ that converts the five data sets in our experiments into a shared format.\n\nABOUT THE AUTHOR\n\n#### **[Priyanka Sen](https://www.amazon.science/author/priyanka-sen)**\n\nPriyanka Sen is a computational linguist in the Alexa AI organization.","render":"Question answering through reading comprehension is a popular task in natural-language processing. It’s a task many people know from standardized tests: a student is given a passage and questions based on the passage — say, an article on William the Conqueror and the question “When did William invade England?” The student reads the passage and learns that the answer is 1066. In natural-language processing, we aim to teach machine learning models to do the same thing.\n<img src=\"https://dev-media.amazoncloud.cn/9ac279ba8921452293f8e1955de9e063_image.png\" alt=\"image.png\" />\nIn natural-language understanding, reading comprehension involves finding an excerpt from a text that can serve as an answer to a question about that text.\nCREDIT: GLYNIS CONDON\nIn recent years, question-answering models have made a lot of progress. In fact, models have started outperforming human baselines on public leaderboards such as <ins><a href=\"https://rajpurkar.github.io/SQuAD-explorer/\" target=\"_blank\">SQuAD 2.0</a></ins>.\nAre the models really learning question answering, or are they learning heuristics that work only in some circumstances? We investigate this question in our paper “<ins><a href=\"https://www.amazon.science/publications/what-do-models-learn-from-question-answering-datasets\" target=\"_blank\">What do models learn from question answering datasets?</a></ins>”, which we’re presenting at the Conference on Empirical Methods in Natural Language Processing (EMNLP).\nIn this paper, we subject question-answering models built atop the popular BERT linguistic model to a variety of simple yet informative attacks. We identify shortcomings that cast doubt on the idea that models are really outperforming humans. In particular, we find that\n<h4><a id=\"1_Models_dont_generalize_well_15\"></a>(1) Models don’t generalize well</h4>\nA student who is a good critical reader should be able to answer questions about a variety of articles. A student who can answer questions about William the Conqueror but not Julius Caesar may not have learned reading comprehension —just information about William the Conqueror.\nQuestion-answering models do not generalize well across data sets. A model that does well on the SQuAD data set doesn’t do well on the Natural Questions data set, even though both contain questions about Wikipedia articles. This suggests that models can solve individual data sets without necessarily learning reading comprehension more generally.\n<img src=\"https://dev-media.amazoncloud.cn/98ca90b4f86a4b8db3697801a80a2a3a_image.png\" alt=\"image.png\" />\nThis graph shows the performance of a question-answering model trained on SQuAD and evaluated across five other datasets. While the model does well on its own test set (75.6), its performance is lower on other data sets.\n<h4><a id=\"2_Models_take_short_cuts_27\"></a>(2) Models take short cuts</h4>\nWhen testing question-answering models, we assume that high performance means good understanding of the subject. But tests can be flawed. If a student takes a multiple-choice test where every answer is “C”, it’s hard to judge whether the student really understood the material or exploited the flaw. Similarly, models may be picking up on biases in test questions that let them arrive at the correct answer without doing reading comprehension.\nTo probe this, we conducted three experiments. The first was a modification at training time: we corrupted training sets by replacing correct answers with incorrect answers — for instance, “Q: ‘When did William invade England?’ A: ‘William is buried in Caen’”.\nThe other two were modifications at test time. In one, we shuffled the sentences in the input articles so that they no longer formed coherent paragraphs. In the other, we gave models incomplete questions (“When did William?”, “When?”, or no words at all).\nIn all these experiments, the models were suspiciously robust, continuing to return correct answers. This means that they didn’t need to do reading comprehension at training time or at test time to understand the structure of the articles or be asked the full question.\nHow can this be? It turns out that some questions in some data sets can be answered trivially. In our experiments, for example, one model was just answering all “who” questions with the first proper name in the passage. Simple rules like this can get us to almost 40% of current model baselines.\n<h4><a id=\"3_Models_arent_prepared_to_handle_variations_41\"></a>(3) Models aren’t prepared to handle variations</h4>\n<img src=\"https://dev-media.amazoncloud.cn/a035db468ff84a26bff053782b089c08_image.png\" alt=\"image.png\" />\nThis graph shows the performance of a Natural Questions model against various attacks: 50% corrupt, in which half the labeled answers in the training data are wrong; shuffled context, in which the sentences of the test excerpts are out of order; no question, in which the questions in the test data are incomplete; filler words, in which fillers such as “really” and “actually” are added in a syntactically correct way; and negation, in which the negative of the test question is substituted for the positive (“When didn’t William invade England?”). Where we would expect much lower performance in the first three cases, we instead see surprising robustness. Where we would expect to see little change with filler words, we see a drop of almost 7 F1 points. On the negation task, the model answers 94% of questions the same way it did when they were positively framed.\nA student should understand that “When did William invade England?”, “When did William march his army into England?”, and “When was England invaded by William?” are all asking the same question. But models can still struggle with this.\nWe conducted two experiments where we ran variations of questions through reading comprehension models. First, we tried the very simple change of adding filler words to questions (“When did William really invade England?”). In principle, this should have no effect on performance, but we found that it reduces the model’s F1 score — a metric that factors in both false positives and false negatives — by up to 8%.\nNext, we added negation (“When didn’t William invade England?”) to see if models understood the difference between positive and negative questions. We found that models ignore negation up to 94% of the time and return the same answers they would to positive questions.\n<h4><a id=\"Conclusions_55\"></a>Conclusions</h4>\nOur experiments suggest that models are learning short cuts rather than performing reading comprehension. While this is disappointing, it can be fixed. We believe that following these five suggestions can lead to better question-answering data sets and evaluation methods in the future:\n<ul>\n<li>Test for generalizability: Report performance across multiple relevant data sets to make sure a model is not just solving a single data set;</li>\n<li>Challenge the models: Discard questions that can be solved trivially — for example, by always returning the first proper noun;</li>\n<li>Good performance does not guarantee understanding: Probe data sets to ensure models are not taking short cuts;</li>\n<li>Include variations: Add variations to existing questions to check model flexibility;</li>\n<li>Standardize data set formats: Consider following a standard format when releasing new data sets, as this makes cross-data-set experimentation easier. We offer some help in this regard by <ins><a href=\"https://github.com/amazon-research/qa-dataset-converter\" target=\"_blank\">releasing code</a></ins> that converts the five data sets in our experiments into a shared format.</li>\n</ul>\nABOUT THE AUTHOR\n<h4><a id=\"Priyanka_Senhttpswwwamazonscienceauthorpriyankasen_68\"></a><a href=\"https://www.amazon.science/author/priyanka-sen\" target=\"_blank\">Priyanka Sen</a></h4>\nPriyanka Sen is a computational linguist in the Alexa AI organization.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家