Improving question-answering models that use data from tables

自然语言处理

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"Question-answering (QA) models sometimes need to retrieve information from tables, which use an entirely different set of semantic cues than free-form text. Historically, most work on table-based QA has concentrated on extracting the contents of a single table cell as the answer to a question.\n\nBut sometimes, the questioner needs more context to make sense of an answer, so recent work on table-based QA has investigated the possibility of embedding tabular data in sentences or sequences of sentences. So far, the most successful models have been end-to-end neural models, which take a question and a table as input and output a free-form answer to the question.\n\nAt this year’s meeting of the Association for the Advancement of Artificial Intelligence (++[AAAI](https://www.amazon.science/conferences-and-events/aaai-2022)++), my colleagues and I are presenting a ++[new approach to training table-based, free-form QA models](https://www.amazon.science/publications/generation-focused-table-based-intermediate-pre-training-for-free-form-question-answering)++ in which the model is pretrained on synthetic data before being fine-tuned on a real QA dataset. We call our model GenTaP, for generation-focused table-based intermediate pretraining.\n\n![image.png](https://dev-media.amazoncloud.cn/84c9d7bca4bd4608a5a50a2d912148fe_image.png)\n\n\n#### **Related content**\n[Automatically evaluating question-answering models](https://www.amazon.science/blog/automatically-evaluating-question-answering-models)\n\nWe pretrain the model on two objectives simultaneously: one is a sentence-style answer to the question, and the other is an answer extracted from a single table cell, often a name or number. In experiments, we compared our model to four previous end-to-end models, on five different metrics, and ours was the top performer across the board, improving on the best prior model by 14%, according to the BLEU metric.\n\n#### **Data augmentation**\n\nThe key to our approach is generating synthetic training data with no human involvement, to make the pretraining pipeline efficient.\n\nTo produce the long-form training examples, we identify online documents that include tables. From those documents, we extract sentences that contain at least two cell values that share a row in the table. Then, using a separate machine learning model, we convert the sentences into questions.\n\n![image.png](https://dev-media.amazoncloud.cn/56709ea007c44187a65b916b54ceaf45_image.png)\n\nExamples of questions generated from texts extracted from online documents containing tables.\n\nAs input, the question generation model takes a sentence and the corresponding entries from the table. To train the model, we used an existing dataset for training reading comprehension models, which consists of questions and document excerpts that provide the information needed to answer them. Except that we invert the relationships between inputs and outputs.\n\n![image.png](https://dev-media.amazoncloud.cn/0a8d248a79d146d6ab3cd4f93b953e46_image.png)\n\n\n#### **Related content**\n\n[Question answering as a \\"lingua franca\\" for transfer learning](https://www.amazon.science/blog/question-answering-as-a-lingua-franca-for-transfer-learning)\n\nThe question generator’s outputs give us sets of data triplets — table, question, and answer — that we can use to pretrain the QA system. The tables are converted into strings with special characters separating rows and appended to the questions as inputs. The QA model then learns to predict the answers.\n\n![image.png](https://dev-media.amazoncloud.cn/4c21fe82b5d148bdbd279704f81a7026_image.png)\n\nThe grammar we use to generate questions with single-cell-value targets. The variables [column] and [value] are randomly sampled from the table. Other variables (in square brackets) are defined by the grammar itself.\n\nIn addition to long-form answers, we also train the model on automatically generated question-answer pairs in which each answer consists of a single cell value from the table. We generate these pairs using a simple grammar — a set of phrase and sentence templates that randomly sample data from the tables to produce new sentences.\n\nDuring pretraining, we use equal numbers of long-form and short-form examples. The idea is that the long-form targets improve the coherence of the QA model’s outputs, while the short-form targets improve its factual accuracy. Our experiments showed that omitting the short-form targets during pretraining does slightly diminish the model’s performance on the test set.\n\nThe model itself is an encoder-decoder model, with two decoders, one for each of the two output objectives.\n\n![image.png](https://dev-media.amazoncloud.cn/ad2da42787314cd78c88466e54d6ca34_image.png)\n\nAn example of our training methodology, with a question and a table as inputs and two different output objectives.\n\n#### **Results**\n\nAfter pretraining our model on the synthetic data, we ran two experiments on it using a hand-annotated QA dataset. In the first, we simply tested the pretrained model on the dataset’s test examples, without further fine-tuning — a zero-shot experiment. In the second experiment, we first fine-tuned our model on the dataset’s training set and then retested it.\n\nAs benchmarks, we used four models based on the T5 language model and a fifth model based on the BART language model. We used five different evaluation metrics: the BLEU metric, which measures the overlap between the model’s output and the target output in the hand-annotated dataset; three ROUGE metrics (ROUGE 1, ROUGE 2, and ROUGE L), all of which measure phrasal overlap between output and target; and METEOR, which factors in synonyms and shared roots when assessing sentence matches.\n\n![image.png](https://dev-media.amazoncloud.cn/61e4959173754dce94294f9717ce80dc_image.png)\n\n\n#### **Related content**\n\n[Teaching computers to answer complex questions](https://www.amazon.science/blog/teaching-computers-to-answer-complex-questions)\n\nOur model was the best-performing across the board, with a BLEU score improvement of 14% over the second-best model (the one based on BART) and improvements of 5% to 10% on the other four metrics.\n\nOur zero-shot model outperformed the benchmark built atop the small version of the T5 language model — even though the T5 benchmark was trained on the dataset’s full training set. And the zero-shot model fell just a little short of the benchmark built atop the base T5 model (also trained on the full training set).\n\nWe also tested our pretrained model on a different task: generating domain-specific sentences (not question answers) based on tabular data, with limited numbers of training examples (50 to 500). On that task, our model outperformed two benchmarks based on the GPT language model, indicating that our approach may adapt well to other applications.\n\nABOUT THE AUTHOR\n\n#### **[Patrick Ng](https://www.amazon.science/author/patrick-ng)**\n\nPatrick Ng is a senior applied scientist with Amazon Web Services.","render":"Question-answering (QA) models sometimes need to retrieve information from tables, which use an entirely different set of semantic cues than free-form text. Historically, most work on table-based QA has concentrated on extracting the contents of a single table cell as the answer to a question.\nBut sometimes, the questioner needs more context to make sense of an answer, so recent work on table-based QA has investigated the possibility of embedding tabular data in sentences or sequences of sentences. So far, the most successful models have been end-to-end neural models, which take a question and a table as input and output a free-form answer to the question.\nAt this year’s meeting of the Association for the Advancement of Artificial Intelligence (<ins><a href=\\"https://www.amazon.science/conferences-and-events/aaai-2022\\" target=\\"_blank\\">AAAI</a></ins>), my colleagues and I are presenting a <ins><a href=\\"https://www.amazon.science/publications/generation-focused-table-based-intermediate-pre-training-for-free-form-question-answering\\" target=\\"_blank\\">new approach to training table-based, free-form QA models</a></ins> in which the model is pretrained on synthetic data before being fine-tuned on a real QA dataset. We call our model GenTaP, for generation-focused table-based intermediate pretraining.\n<img src=\\"https://dev-media.amazoncloud.cn/84c9d7bca4bd4608a5a50a2d912148fe_image.png\\" alt=\\"image.png\\" />\n<h4><a id=\\"Related_content_9\\"></a>Related content</h4>\\n<a href=\\"https://www.amazon.science/blog/automatically-evaluating-question-answering-models\\" target=\\"_blank\\">Automatically evaluating question-answering models</a>\\nWe pretrain the model on two objectives simultaneously: one is a sentence-style answer to the question, and the other is an answer extracted from a single table cell, often a name or number. In experiments, we compared our model to four previous end-to-end models, on five different metrics, and ours was the top performer across the board, improving on the best prior model by 14%, according to the BLEU metric.\n<h4><a id=\\"Data_augmentation_14\\"></a>Data augmentation</h4>\\nThe key to our approach is generating synthetic training data with no human involvement, to make the pretraining pipeline efficient.\nTo produce the long-form training examples, we identify online documents that include tables. From those documents, we extract sentences that contain at least two cell values that share a row in the table. Then, using a separate machine learning model, we convert the sentences into questions.\n<img src=\\"https://dev-media.amazoncloud.cn/56709ea007c44187a65b916b54ceaf45_image.png\\" alt=\\"image.png\\" />\nExamples of questions generated from texts extracted from online documents containing tables.\nAs input, the question generation model takes a sentence and the corresponding entries from the table. To train the model, we used an existing dataset for training reading comprehension models, which consists of questions and document excerpts that provide the information needed to answer them. Except that we invert the relationships between inputs and outputs.\n<img src=\\"https://dev-media.amazoncloud.cn/0a8d248a79d146d6ab3cd4f93b953e46_image.png\\" alt=\\"image.png\\" />\n<h4><a id=\\"Related_content_29\\"></a>Related content</h4>\\n<a href=\\"https://www.amazon.science/blog/question-answering-as-a-lingua-franca-for-transfer-learning\\" target=\\"_blank\\">Question answering as a “lingua franca” for transfer learning</a>\\nThe question generator’s outputs give us sets of data triplets — table, question, and answer — that we can use to pretrain the QA system. The tables are converted into strings with special characters separating rows and appended to the questions as inputs. The QA model then learns to predict the answers.\n<img src=\\"https://dev-media.amazoncloud.cn/4c21fe82b5d148bdbd279704f81a7026_image.png\\" alt=\\"image.png\\" />\nThe grammar we use to generate questions with single-cell-value targets. The variables [column] and [value] are randomly sampled from the table. Other variables (in square brackets) are defined by the grammar itself.\nIn addition to long-form answers, we also train the model on automatically generated question-answer pairs in which each answer consists of a single cell value from the table. We generate these pairs using a simple grammar — a set of phrase and sentence templates that randomly sample data from the tables to produce new sentences.\nDuring pretraining, we use equal numbers of long-form and short-form examples. The idea is that the long-form targets improve the coherence of the QA model’s outputs, while the short-form targets improve its factual accuracy. Our experiments showed that omitting the short-form targets during pretraining does slightly diminish the model’s performance on the test set.\nThe model itself is an encoder-decoder model, with two decoders, one for each of the two output objectives.\n<img src=\\"https://dev-media.amazoncloud.cn/ad2da42787314cd78c88466e54d6ca34_image.png\\" alt=\\"image.png\\" />\nAn example of our training methodology, with a question and a table as inputs and two different output objectives.\n<h4><a id=\\"Results_49\\"></a>Results</h4>\\nAfter pretraining our model on the synthetic data, we ran two experiments on it using a hand-annotated QA dataset. In the first, we simply tested the pretrained model on the dataset’s test examples, without further fine-tuning — a zero-shot experiment. In the second experiment, we first fine-tuned our model on the dataset’s training set and then retested it.\nAs benchmarks, we used four models based on the T5 language model and a fifth model based on the BART language model. We used five different evaluation metrics: the BLEU metric, which measures the overlap between the model’s output and the target output in the hand-annotated dataset; three ROUGE metrics (ROUGE 1, ROUGE 2, and ROUGE L), all of which measure phrasal overlap between output and target; and METEOR, which factors in synonyms and shared roots when assessing sentence matches.\n<img src=\\"https://dev-media.amazoncloud.cn/61e4959173754dce94294f9717ce80dc_image.png\\" alt=\\"image.png\\" />\n<h4><a id=\\"Related_content_58\\"></a>Related content</h4>\\n<a href=\\"https://www.amazon.science/blog/teaching-computers-to-answer-complex-questions\\" target=\\"_blank\\">Teaching computers to answer complex questions</a>\\nOur model was the best-performing across the board, with a BLEU score improvement of 14% over the second-best model (the one based on BART) and improvements of 5% to 10% on the other four metrics.\nOur zero-shot model outperformed the benchmark built atop the small version of the T5 language model — even though the T5 benchmark was trained on the dataset’s full training set. And the zero-shot model fell just a little short of the benchmark built atop the base T5 model (also trained on the full training set).\nWe also tested our pretrained model on a different task: generating domain-specific sentences (not question answers) based on tabular data, with limited numbers of training examples (50 to 500). On that task, our model outperformed two benchmarks based on the GPT language model, indicating that our approach may adapt well to other applications.\nABOUT THE AUTHOR\n<h4><a id=\\"Patrick_Nghttpswwwamazonscienceauthorpatrickng_70\\"></a><a href=\\"https://www.amazon.science/author/patrick-ng\\" target=\\"_blank\\">Patrick Ng</a></h4>\nPatrick Ng is a senior applied scientist with Amazon Web Services.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家