How to produce factually accurate automatic text summaries

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"Abstractive summarization is the automatic extraction and recombination of phrases from a text in order to summarize that text. Deep-learning-based abstractive-summarization systems are usually trained to maximize the overlap between the summaries they generate and sample summaries in their training data.\n\nThe trouble with this approach is that a summary that overlaps significantly with a target summary may recombine phrases in factually inaccurate manner. In the example below, which concerns an upcoming boxing match, the summarization model correctly concludes that “has a chink in his armor” summarizes an important aspect of the input text, but it applies it to the wrong boxer:\n\n![image.png](https://dev-media.amazoncloud.cn/f4c23662d67642beae876f06106c5be8_image.png)\n\nConventional metrics for training abstractive-summarization models don’t account for factual accuracy.\n\nAlthough abstractive-summarization models have become very good at generating fluent, syntactically correct text, their frequent factual inaccuracy has severely hampered their adoption.\n\nIn a ++[paper](https://www.amazon.science/publications/improving-factual-consistency-of-abstractive-summarization-via-question-answering)++ we presented at this year’s meeting of the Association for Computational Linguistics ([ACL](https://www.amazon.science/conferences-and-events/acl-ijcnlp-2021)), we describe a new metric for measuring the performance of abstractive-summarization models, which accounts for factual accuracy. We also describe a methodology for using our metric to train abstractive-summarization models.\n\nOur metric adopts the same general strategy as the earlier QAGS metric, but it’s 55 times as fast to apply, which makes it more practical for model training.\n\n![image.png](https://dev-media.amazoncloud.cn/785845d7e83e4b20be3c441d858b0c3d_image.png)\n\nOur new summary-scoring metric, QUALS (bottom), uses the same strategy as the earlier QAGS (top) but has a simpler architecture, enabling it to generate a score 55 times as quickly.\nCREDIT: GLYNIS CONDON\n\nUsing QAGS as an evaluation metric, we compared models trained using our approach to models trained using traditional metrics and methodologies, and we found that our approach improved on the best-performing previous models by 15% on one dataset and by 2% on another.\n\n#### **Scoring through question answering**\n\nQAGS (which stands for question answering and generation for summarization) uses a four-step procedure to score a text summary. First, it extracts names and noun phrases from the summary; these are potential answers to potential questions about the summary. \n\nSecond, it feeds each extracted noun, together with the text of the summary, to a trained question generation model, which produces a question whose answer is the noun. Third, it feeds each of the generated questions to a trained question-answering model, once accompanied by the summary and once accompanied by the source text. \n\n![image.png](https://dev-media.amazoncloud.cn/24a32f8c90184cf1bec5ab826736cddc_image.png)\n\nQAGS requires the sequential application of three neural models: an answer extraction model, a question-answering model, and a question generation model.\nCREDIT: GLYNIS CONDON\n\nThe final score assesses the similarity between the answers based on the source text and the answers based on the summary. The intuition is that if both the summary and the source text cause the question-answering model to answer the questions in the same way, the summary is factually accurate. If they cause different answers, then the summary has probably garbled some facts.\n\nBy accounting for factual accuracy, QAGS offers a better assessment of summary quality than metrics based on phrasal overlap. But it requires the sequential application of three different deep-learning networks, which is inefficient.\n\n#### **QUALS**\n\nOur approach, which we call QUALS (for question answering with language model score for summarization), reduces the number of models to one, which makes it 55 times as fast as QAGS.\n\nThat one model is the joint question-and-answer generation (QAGen) model that members of our group [presented at last year’s ACL.](https://www.amazon.science/publications/end-to-end-synthetic-data-generation-for-domain-adaptation-of-question-answering-systems) It takes a text as input and generates question-and-answer pairs pertaining to it.\n\n![image.png](https://dev-media.amazoncloud.cn/b1f7548b8a9945c38ec8dcb4c17dc8c9_image.png)\n\nQUALS requires a single neural model, a question-and-answer generation model.\nCREDIT: GLYNIS CONDON\n\nThe output of the QAGen model for a given input can be thought of as a huge tree, in which the nodes are words and each edge encodes the likelihood that a particular word will be followed by another word.\n\nFor a given summary, we search the resulting tree to produce 60 high-probability question-and-answer pairs. Our search algorithm ensures that we explore diverse paths through the tree, in order to generate a variety of candidate questions and answers. Then we throw out all the question-answer pairs whose answers are not sequences of words found in the summary.\n\nNext, we feed the source text on which the summary is based to the QAGen model. We use the resulting tree to calculate the probabilities of the same question-answer pairs we extracted for the summary. When, for the source text, the probability of generating a particular question-answer pair is small compared to the probability for the summary, the QUALS will be low. Intuitively, the discrepancy suggests that the question-answer pair was plausible for the summary but not in the source text, indicating factual inconsistency.\n\n![image.png](https://dev-media.amazoncloud.cn/8d14440de82e47e5b679f1b6340703c7_image.png)\n\nProbabilities per token (words and other standalone symbols) of two different question-answer pairs, based on a summary (blue) and an input document (orange). The large probability differences for the answer in the right-hand example give it a much lower QUALS score (-2.615) than the right-hand example (-0.054).\n\n#### **Training methodology**\n\nThe QUALS score gives us an efficiently computable measure of a summary’s factual accuracy, but using it to train a machine learning model is not straightforward. Differences in QUALS score can’t simply be back-propagated through the QAGen model to update the summarization model.\n\nSo in our paper, we propose contrastive learning as a method for using QUALS to train a summarization model. First, we train a summarization model using the standard approach, which uses maximum-likelihood estimation (MLE) to approximate a phrasal-overlap score.\n\nNext, we use the trained model to generate new summaries for all the source texts in the training data and create two different groups of summaries. One group, S+, contains ground truth summaries that have high QUALS scores (indicating factually accurate summaries); the other, S- contains generated summaries that have low QUALS scores (indicating factually inaccurate summaries).\n\nFinally, we retrain the summarization model, using a loss function that encourages it to generate summaries like those in S+ and discourages it from generating summaries like those in S-.\n\n#### **Evaluation**\n\nAs baselines for the evaluation of our approach, we used two models. One was trained using MLE in the standard way, to fine-tune a BART language model. For the other, we used our contrastive-learning methodology, but instead of using QUALS to evaluate summaries, we used an ensemble of three ROUGE metrics (ROUGE 1, ROUGE 2, and ROUGE L), all of which are based on phrasal overlap.\n\n![image.png](https://dev-media.amazoncloud.cn/113b1399926942b981192a44e2a80154_image.png)\n\nExamples from the human-evaluation study, featuring input texts and summaries produced using both MLE and the ConSeq model, which is trained using QUALS.\n\nIn addition to evaluating the models’ performance using QAGS, we evaluated them according to the three ROUGE metrics and FactCC, another model-based metric that simply predicts the factual consistency of two texts. On all five metrics, models trained using QUALS outperformed the two baselines.\n\nFor validation, we also conducted a human-evaluation study, which involved 100 summaries generated using QUALS and 100 summaries generated using MLE for each of two datasets (XSUM and CNNDM). Human subjects were asked to compare the summaries on three attributes: factual consistency, informativeness and grammatical correctness.\n\nOn average, annotators found the QUALS-based summaries more factually accurate and more informative than the MLE-based summaries, for both datasets. On grammatical correctness, the two models’ performance was virtually indistinguishable.\n\n![image.png](https://dev-media.amazoncloud.cn/004d45d601e647198e151bb20ae1960f_image.png)\n\nThe results of the human-evaluation study. Subjects were asked whether summaries produced using QUALS were better than, worse than, or equal to those produced using MLE, on three axes.\n\nABOUT THE AUTHOR\n\n#### **[Feng Nan](https://www.amazon.science/author/feng-nan)**\n\nFeng Nan is an applied scientist with Amazon Web Services.\n\n#### **[Ramesh Nallapati](https://www.amazon.science/author/ramesh-nallapati)**\n\nRamesh Nallapati is a principal applied scientist with Amazon Web Services.","render":"Abstractive summarization is the automatic extraction and recombination of phrases from a text in order to summarize that text. Deep-learning-based abstractive-summarization systems are usually trained to maximize the overlap between the summaries they generate and sample summaries in their training data.\nThe trouble with this approach is that a summary that overlaps significantly with a target summary may recombine phrases in factually inaccurate manner. In the example below, which concerns an upcoming boxing match, the summarization model correctly concludes that “has a chink in his armor” summarizes an important aspect of the input text, but it applies it to the wrong boxer:\n<img src=\"https://dev-media.amazoncloud.cn/f4c23662d67642beae876f06106c5be8_image.png\" alt=\"image.png\" />\nConventional metrics for training abstractive-summarization models don’t account for factual accuracy.\nAlthough abstractive-summarization models have become very good at generating fluent, syntactically correct text, their frequent factual inaccuracy has severely hampered their adoption.\nIn a <ins><a href=\"https://www.amazon.science/publications/improving-factual-consistency-of-abstractive-summarization-via-question-answering\" target=\"_blank\">paper</a></ins> we presented at this year’s meeting of the Association for Computational Linguistics (<a href=\"https://www.amazon.science/conferences-and-events/acl-ijcnlp-2021\" target=\"_blank\">ACL</a>), we describe a new metric for measuring the performance of abstractive-summarization models, which accounts for factual accuracy. We also describe a methodology for using our metric to train abstractive-summarization models.\nOur metric adopts the same general strategy as the earlier QAGS metric, but it’s 55 times as fast to apply, which makes it more practical for model training.\n<img src=\"https://dev-media.amazoncloud.cn/785845d7e83e4b20be3c441d858b0c3d_image.png\" alt=\"image.png\" />\nOur new summary-scoring metric, QUALS (bottom), uses the same strategy as the earlier QAGS (top) but has a simpler architecture, enabling it to generate a score 55 times as quickly. \nCREDIT: GLYNIS CONDON\nUsing QAGS as an evaluation metric, we compared models trained using our approach to models trained using traditional metrics and methodologies, and we found that our approach improved on the best-performing previous models by 15% on one dataset and by 2% on another.\n<h4><a id=\"Scoring_through_question_answering_21\"></a>Scoring through question answering</h4>\nQAGS (which stands for question answering and generation for summarization) uses a four-step procedure to score a text summary. First, it extracts names and noun phrases from the summary; these are potential answers to potential questions about the summary.\nSecond, it feeds each extracted noun, together with the text of the summary, to a trained question generation model, which produces a question whose answer is the noun. Third, it feeds each of the generated questions to a trained question-answering model, once accompanied by the summary and once accompanied by the source text.\n<img src=\"https://dev-media.amazoncloud.cn/24a32f8c90184cf1bec5ab826736cddc_image.png\" alt=\"image.png\" />\nQAGS requires the sequential application of three neural models: an answer extraction model, a question-answering model, and a question generation model. \nCREDIT: GLYNIS CONDON\nThe final score assesses the similarity between the answers based on the source text and the answers based on the summary. The intuition is that if both the summary and the source text cause the question-answering model to answer the questions in the same way, the summary is factually accurate. If they cause different answers, then the summary has probably garbled some facts.\nBy accounting for factual accuracy, QAGS offers a better assessment of summary quality than metrics based on phrasal overlap. But it requires the sequential application of three different deep-learning networks, which is inefficient.\n<h4><a id=\"QUALS_36\"></a>QUALS</h4>\nOur approach, which we call QUALS (for question answering with language model score for summarization), reduces the number of models to one, which makes it 55 times as fast as QAGS.\nThat one model is the joint question-and-answer generation (QAGen) model that members of our group <a href=\"https://www.amazon.science/publications/end-to-end-synthetic-data-generation-for-domain-adaptation-of-question-answering-systems\" target=\"_blank\">presented at last year’s ACL.</a> It takes a text as input and generates question-and-answer pairs pertaining to it.\n<img src=\"https://dev-media.amazoncloud.cn/b1f7548b8a9945c38ec8dcb4c17dc8c9_image.png\" alt=\"image.png\" />\nQUALS requires a single neural model, a question-and-answer generation model. \nCREDIT: GLYNIS CONDON\nThe output of the QAGen model for a given input can be thought of as a huge tree, in which the nodes are words and each edge encodes the likelihood that a particular word will be followed by another word.\nFor a given summary, we search the resulting tree to produce 60 high-probability question-and-answer pairs. Our search algorithm ensures that we explore diverse paths through the tree, in order to generate a variety of candidate questions and answers. Then we throw out all the question-answer pairs whose answers are not sequences of words found in the summary.\nNext, we feed the source text on which the summary is based to the QAGen model. We use the resulting tree to calculate the probabilities of the same question-answer pairs we extracted for the summary. When, for the source text, the probability of generating a particular question-answer pair is small compared to the probability for the summary, the QUALS will be low. Intuitively, the discrepancy suggests that the question-answer pair was plausible for the summary but not in the source text, indicating factual inconsistency.\n<img src=\"https://dev-media.amazoncloud.cn/8d14440de82e47e5b679f1b6340703c7_image.png\" alt=\"image.png\" />\nProbabilities per token (words and other standalone symbols) of two different question-answer pairs, based on a summary (blue) and an input document (orange). The large probability differences for the answer in the right-hand example give it a much lower QUALS score (-2.615) than the right-hand example (-0.054).\n<h4><a id=\"Training_methodology_57\"></a>Training methodology</h4>\nThe QUALS score gives us an efficiently computable measure of a summary’s factual accuracy, but using it to train a machine learning model is not straightforward. Differences in QUALS score can’t simply be back-propagated through the QAGen model to update the summarization model.\nSo in our paper, we propose contrastive learning as a method for using QUALS to train a summarization model. First, we train a summarization model using the standard approach, which uses maximum-likelihood estimation (MLE) to approximate a phrasal-overlap score.\nNext, we use the trained model to generate new summaries for all the source texts in the training data and create two different groups of summaries. One group, S+, contains ground truth summaries that have high QUALS scores (indicating factually accurate summaries); the other, S- contains generated summaries that have low QUALS scores (indicating factually inaccurate summaries).\nFinally, we retrain the summarization model, using a loss function that encourages it to generate summaries like those in S+ and discourages it from generating summaries like those in S-.\n<h4><a id=\"Evaluation_67\"></a>Evaluation</h4>\nAs baselines for the evaluation of our approach, we used two models. One was trained using MLE in the standard way, to fine-tune a BART language model. For the other, we used our contrastive-learning methodology, but instead of using QUALS to evaluate summaries, we used an ensemble of three ROUGE metrics (ROUGE 1, ROUGE 2, and ROUGE L), all of which are based on phrasal overlap.\n<img src=\"https://dev-media.amazoncloud.cn/113b1399926942b981192a44e2a80154_image.png\" alt=\"image.png\" />\nExamples from the human-evaluation study, featuring input texts and summaries produced using both MLE and the ConSeq model, which is trained using QUALS.\nIn addition to evaluating the models’ performance using QAGS, we evaluated them according to the three ROUGE metrics and FactCC, another model-based metric that simply predicts the factual consistency of two texts. On all five metrics, models trained using QUALS outperformed the two baselines.\nFor validation, we also conducted a human-evaluation study, which involved 100 summaries generated using QUALS and 100 summaries generated using MLE for each of two datasets (XSUM and CNNDM). Human subjects were asked to compare the summaries on three attributes: factual consistency, informativeness and grammatical correctness.\nOn average, annotators found the QUALS-based summaries more factually accurate and more informative than the MLE-based summaries, for both datasets. On grammatical correctness, the two models’ performance was virtually indistinguishable.\n<img src=\"https://dev-media.amazoncloud.cn/004d45d601e647198e151bb20ae1960f_image.png\" alt=\"image.png\" />\nThe results of the human-evaluation study. Subjects were asked whether summaries produced using QUALS were better than, worse than, or equal to those produced using MLE, on three axes.\nABOUT THE AUTHOR\n<h4><a id=\"Feng_Nanhttpswwwamazonscienceauthorfengnan_87\"></a><a href=\"https://www.amazon.science/author/feng-nan\" target=\"_blank\">Feng Nan</a></h4>\nFeng Nan is an applied scientist with Amazon Web Services.\n<h4><a id=\"Ramesh_Nallapatihttpswwwamazonscienceauthorrameshnallapati_91\"></a><a href=\"https://www.amazon.science/author/ramesh-nallapati\" target=\"_blank\">Ramesh Nallapati</a></h4>\nRamesh Nallapati is a principal applied scientist with Amazon Web Services.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家