Automatically generating text from structured data

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"Data-to-text generation converts information from a structured format such as a table into natural language. This allows structured information to be read or listened to, as when a device displays a weather forecast or a voice assistant answers a question.\n\nLanguage models trained on billions of sentences learn common linguistic patterns and can generate natural-sounding sentences by predicting likely sequences of words. However, in data-to-text generation we want to generate language that not only is fluent but also conveys content accurately. \n\nSome approaches to data-to-text generation use a pipeline of machine learning models to turn the data into text, but this can be labor intensive to create, and pipelining poses the risk that errors in one step will compound in later steps.\n\nIn the Alexa AI organization, we’ve developed a neural, end-to-end, data-to-text generation system called DataTuner, which can be used for a variety of data types and topics to generate fluent and accurate texts. We've released the DataTuner code on [GitHub](https://github.com/amazon-research/datatuner) under a noncommercial license.\n\n![image.png](https://dev-media.amazoncloud.cn/bd14d67f37b54867bada1d0e8dde860e_image.png)\n\nAlexa AI's new DataTuner software can convert structured information, such as the relationships encoded by knowledge graphs, into texts that are both semantically faithful and fluent.\nCREDIT: GLYNIS CONDON\n\nAt last year’s International Conference on Computational Linguistics (COLING), [we presented a paper](https://www.amazon.science/publications/have-your-text-and-use-it-too-end-to-end-neural-data-to-text-generation-with-semantic-fidelity) in which we compared our approach to its best-performing predecessors, using four data-to-text data sets. On automated metrics, DataTuner pushes the state of the art by significant margins, from 1.2 to 5.9 points according to the BLEU algorithm for evaluating text quality.\n\nHuman annotators also graded our responses as both more natural-sounding and more accurate. In fact, on two of the four data sets, our texts were judged to be more natural-sounding, on average, than human-written texts.\n\nAnnotator evaluations showed that DataTuner improved the semantic accuracy of generated texts, with margins ranging from 5.3% to 40%. Our paper also introduces a model-based approach for measuring the accuracy of generated texts, an approach that is 4.2% to 14.2% more accurate at detecting errors than previous hand-crafted approaches. \n\n#### **Semantic fidelity vs. fluency**\n\nTo get a sense of the problem we address, consider an example in which we have some structured information about Michelle Obama that we want to convey to our readers or listeners. That information is organized in the entity-relation-entity format typical of [knowledge graphs](https://www.amazon.science/tag/knowledge-graphs).\n\nMichelle Obama | author of | Becoming \nMichelle Obama | birthplace | Chicago, Illinois, USA\nPrinceton University | alma mater of | Michelle Obama\nHarvard University | alma mater of | Michelle Obama\n\nWe could imagine a text that conveys the meaning accurately but doesn’t sound very natural:\n\nMichelle Obama is the author of Becoming. Michelle Obama was born in Chicago, Illinois, USA. Michelle Obama was educated at Princeton University. Michelle Obama was educated at Harvard University.\n\nThis text has high semantic fidelity but low fluency.\n\nAlternatively, we could imagine a text that sounds very fluent but doesn’t accurately convey the information: \n\nBorn in Chicago, Illinois, and educated at Harvard, Michelle Obama is the author of A Promised Land. \n\nThis text has added some information and missed some out, so it has low semantic fidelity even though it has high fluency.\n\nPipeline-based approaches to data-to-text generation typically consist of steps such as (1) ordering the content; (2) dividing the content into sentences; (3) finding the right words and phrases to express the data (lexicalization and referring-expression generation), and (4) joining it all together to produce the final text (realization). These approaches usually generalize well to new concepts because of the separate lexicalization step, but they can be difficult to maintain and require training data for each step that can be labor intensive to acquire. \n\nEnd-to-end approaches are trained on [data, text] pairs that can be gathered more easily, but it’s difficult to guarantee the semantic fidelity of the results. This is the problem we address with DataTuner.\n\n#### **The DataTuner model**\n\nDataTuner’s approach has two steps, generation and reranking. \n\nFirst, our language model generates texts from data. In our experiments, we started with a pretrained language model that could generate text, the GPT-2 model. To adapt it to the data-to-text task, we trained it on concatenated data and text, using the special tokens <data> and <text> to indicate which was which. When we use the trained model to generate text, the only input is the data.\n\n![image.png](https://dev-media.amazoncloud.cn/88a077ac5d054947ab33e510e7b5e2a8_image.png)\n\nDuring training, the inputs to DataTuner's data-to-text model are data and text, separated by the special tokens <data> and <text>. At runtime, the only input is the data.\nCREDIT: HAMZA HARKOUS\n\nInside the model, we concatenate several types of embeddings, or vector representations whose spatial relationships indicate relationships between data (see figure above). The first type is token embeddings, which encode semantic information about individual input words. The other is an embedding that represents words’ positions in the text. \n\nWe also introduce what we call fine-grained state embeddings. To produce these, we use special tokens that indicate structural relationships between data items.\n\nFor example, we would convert the data triple Michelle Obama | author of | Becoming into the string <subject> Michelle Obama <predicate> author of <object> Becoming, with <subject>, <object>, and <predicate> as special tokens. The state embedding for any token is that of the special token that most recently precedes it; for example, the token Becoming will get the state embedding of <object>. \n\nSecondly, we train a semantic-fidelity classifier. This takes the input data and a generated text and identifies whether the text accurately conveys the data or whether it adds, repeats, omits, or changes any of the content. We use this to rerank the generated texts according to accuracy. \n\nThe classifier is trained using the same data we used to train our language model. Our original [data, text] pairs give us the examples that are to be classified as accurate. To get inaccurate examples, we use rule-based corruptions of the accurate [data, text] pairs. For example, we could take the training pair (Michelle Obama | author of | Becoming) and “Michelle Obama wrote Becoming” and swap the entities to create the inaccurate [data, text] pair (Michelle Obama | author of | the Gruffalo) and “Michelle Obama wrote Becoming”.\n\nFor this classifier we use the RoBERTA language model with an additional classification layer, an approach that has been successful in other tasks, such as natural-language inference. For each input token (either data or text), we take the token embeddings, positional embeddings, and segment embeddings (embeddings of the tokens that distinguish text and data) and sum these element-wise to provide the input to RoBERTa’s first layer. A final single-layer neural network produces a classification label. \n\n#### **Evaluation**\n\nWe experimented with four different data sets in different formats, including news texts, restaurant reviews, and chats about video games. We evaluated the texts we generated both with automated metrics and by asking human annotators to rate fluency and accuracy via Amazon Mechanical Turk. \n\nIn our experiments, we saw that a model trained without the fine-grained state embeddings is less accurate than a model with them and that adding the semantic-fidelity classifier boosts accuracy further.\n\nWe also examined the cases in which our generated texts were assessed as better than human-written texts, and we suspect that the reason is that our model learned to produce standard formulations, whereas humans sometimes write in non-standard or informal ways that other people might find less fluent.\n\nWe also investigated the use of our semantic-fidelity classifier as a method for automatically evaluating the accuracy of texts generated by different models and found that, for two datasets, it was a significantly better predictor of annotators’ evaluations than existing heuristic approaches.\n\nABOUT THE AUTHOR\n\n#### **Isabel Groves**\n\nIsabel Groves is a computational linguist in the Alexa AI organization.\n","render":"Data-to-text generation converts information from a structured format such as a table into natural language. This allows structured information to be read or listened to, as when a device displays a weather forecast or a voice assistant answers a question.\nLanguage models trained on billions of sentences learn common linguistic patterns and can generate natural-sounding sentences by predicting likely sequences of words. However, in data-to-text generation we want to generate language that not only is fluent but also conveys content accurately.\nSome approaches to data-to-text generation use a pipeline of machine learning models to turn the data into text, but this can be labor intensive to create, and pipelining poses the risk that errors in one step will compound in later steps.\nIn the Alexa AI organization, we’ve developed a neural, end-to-end, data-to-text generation system called DataTuner, which can be used for a variety of data types and topics to generate fluent and accurate texts. We’ve released the DataTuner code on <a href=\"https://github.com/amazon-research/datatuner\" target=\"_blank\">GitHub</a> under a noncommercial license.\n<img src=\"https://dev-media.amazoncloud.cn/bd14d67f37b54867bada1d0e8dde860e_image.png\" alt=\"image.png\" />\nAlexa AI’s new DataTuner software can convert structured information, such as the relationships encoded by knowledge graphs, into texts that are both semantically faithful and fluent. \nCREDIT: GLYNIS CONDON\nAt last year’s International Conference on Computational Linguistics (COLING), <a href=\"https://www.amazon.science/publications/have-your-text-and-use-it-too-end-to-end-neural-data-to-text-generation-with-semantic-fidelity\" target=\"_blank\">we presented a paper</a> in which we compared our approach to its best-performing predecessors, using four data-to-text data sets. On automated metrics, DataTuner pushes the state of the art by significant margins, from 1.2 to 5.9 points according to the BLEU algorithm for evaluating text quality.\nHuman annotators also graded our responses as both more natural-sounding and more accurate. In fact, on two of the four data sets, our texts were judged to be more natural-sounding, on average, than human-written texts.\nAnnotator evaluations showed that DataTuner improved the semantic accuracy of generated texts, with margins ranging from 5.3% to 40%. Our paper also introduces a model-based approach for measuring the accuracy of generated texts, an approach that is 4.2% to 14.2% more accurate at detecting errors than previous hand-crafted approaches.\n<h4><a id=\"Semantic_fidelity_vs_fluency_19\"></a>Semantic fidelity vs. fluency</h4>\nTo get a sense of the problem we address, consider an example in which we have some structured information about Michelle Obama that we want to convey to our readers or listeners. That information is organized in the entity-relation-entity format typical of <a href=\"https://www.amazon.science/tag/knowledge-graphs\" target=\"_blank\">knowledge graphs</a>.\nMichelle Obama | author of | Becoming \nMichelle Obama | birthplace | Chicago, Illinois, USA \nPrinceton University | alma mater of | Michelle Obama \nHarvard University | alma mater of | Michelle Obama\nWe could imagine a text that conveys the meaning accurately but doesn’t sound very natural:\nMichelle Obama is the author of Becoming. Michelle Obama was born in Chicago, Illinois, USA. Michelle Obama was educated at Princeton University. Michelle Obama was educated at Harvard University.\nThis text has high semantic fidelity but low fluency.\nAlternatively, we could imagine a text that sounds very fluent but doesn’t accurately convey the information:\nBorn in Chicago, Illinois, and educated at Harvard, Michelle Obama is the author of A Promised Land.\nThis text has added some information and missed some out, so it has low semantic fidelity even though it has high fluency.\nPipeline-based approaches to data-to-text generation typically consist of steps such as (1) ordering the content; (2) dividing the content into sentences; (3) finding the right words and phrases to express the data (lexicalization and referring-expression generation), and (4) joining it all together to produce the final text (realization). These approaches usually generalize well to new concepts because of the separate lexicalization step, but they can be difficult to maintain and require training data for each step that can be labor intensive to acquire.\nEnd-to-end approaches are trained on [data, text] pairs that can be gathered more easily, but it’s difficult to guarantee the semantic fidelity of the results. This is the problem we address with DataTuner.\n<h4><a id=\"The_DataTuner_model_44\"></a>The DataTuner model</h4>\nDataTuner’s approach has two steps, generation and reranking.\nFirst, our language model generates texts from data. In our experiments, we started with a pretrained language model that could generate text, the GPT-2 model. To adapt it to the data-to-text task, we trained it on concatenated data and text, using the special tokens <data> and <text> to indicate which was which. When we use the trained model to generate text, the only input is the data.\n<img src=\"https://dev-media.amazoncloud.cn/88a077ac5d054947ab33e510e7b5e2a8_image.png\" alt=\"image.png\" />\nDuring training, the inputs to DataTuner’s data-to-text model are data and text, separated by the special tokens <data> and <text>. At runtime, the only input is the data. \nCREDIT: HAMZA HARKOUS\nInside the model, we concatenate several types of embeddings, or vector representations whose spatial relationships indicate relationships between data (see figure above). The first type is token embeddings, which encode semantic information about individual input words. The other is an embedding that represents words’ positions in the text.\nWe also introduce what we call fine-grained state embeddings. To produce these, we use special tokens that indicate structural relationships between data items.\nFor example, we would convert the data triple Michelle Obama | author of | Becoming into the string <subject> Michelle Obama <predicate> author of <object> Becoming, with <subject>, <object>, and <predicate> as special tokens. The state embedding for any token is that of the special token that most recently precedes it; for example, the token Becoming will get the state embedding of <object>.\nSecondly, we train a semantic-fidelity classifier. This takes the input data and a generated text and identifies whether the text accurately conveys the data or whether it adds, repeats, omits, or changes any of the content. We use this to rerank the generated texts according to accuracy.\nThe classifier is trained using the same data we used to train our language model. Our original [data, text] pairs give us the examples that are to be classified as accurate. To get inaccurate examples, we use rule-based corruptions of the accurate [data, text] pairs. For example, we could take the training pair (Michelle Obama | author of | Becoming) and “Michelle Obama wrote Becoming” and swap the entities to create the inaccurate [data, text] pair (Michelle Obama | author of | the Gruffalo) and “Michelle Obama wrote Becoming”.\nFor this classifier we use the RoBERTA language model with an additional classification layer, an approach that has been successful in other tasks, such as natural-language inference. For each input token (either data or text), we take the token embeddings, positional embeddings, and segment embeddings (embeddings of the tokens that distinguish text and data) and sum these element-wise to provide the input to RoBERTa’s first layer. A final single-layer neural network produces a classification label.\n<h4><a id=\"Evaluation_67\"></a>Evaluation</h4>\nWe experimented with four different data sets in different formats, including news texts, restaurant reviews, and chats about video games. We evaluated the texts we generated both with automated metrics and by asking human annotators to rate fluency and accuracy via Amazon Mechanical Turk.\nIn our experiments, we saw that a model trained without the fine-grained state embeddings is less accurate than a model with them and that adding the semantic-fidelity classifier boosts accuracy further.\nWe also examined the cases in which our generated texts were assessed as better than human-written texts, and we suspect that the reason is that our model learned to produce standard formulations, whereas humans sometimes write in non-standard or informal ways that other people might find less fluent.\nWe also investigated the use of our semantic-fidelity classifier as a method for automatically evaluating the accuracy of texts generated by different models and found that, for two datasets, it was a significantly better predictor of annotators’ evaluations than existing heuristic approaches.\nABOUT THE AUTHOR\n<h4><a id=\"Isabel_Groves_79\"></a>Isabel Groves</h4>\nIsabel Groves is a computational linguist in the Alexa AI organization.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家