Amazon releases new dataset for commonsense dialogue

自然语言处理

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"Recently, AI models that can engage in open-domain dialogue have made a lot of progress — the ++[top finishers](https://www.amazon.science/academic-engagements/czech-technical-university-team-wins-alexa-prize-socialbot-grand-challenge-4)++ in the Alexa Prize challenge, for instance.\n\nBut dialogue models still struggle with conversations that require commonsense inferences. For example, if someone says, “I'm going to perform in front of a thousand people tomorrow”, the listener might infer that the speaker is feeling nervous and respond, \"Relax, you'll do great!”\n\nTo aid the research community in the development of commonsense dialogue models, ++[we are publicly releasing](https://github.com/alexa/commonsense-dialogues)++ a large, multiturn, open-domain dialogue dataset that is focused on commonsense knowledge.\n\nOur dataset contains more than 11,000 dialogues we collected with the aid of workers recruited through Amazon Mechanical Turk.\n\nTo create dialogue examples, we provided workers with prompts culled from SocialIQA, a large-scale benchmark for commonsense reasoning about social situations, which is based on the ATOMIC knowledge graph. The prompts are sentences like “Addison wanted to go on a trip to Mexico and messaged all of his friends to set up a schedule” or “Tracy performed her function.”\n\nWe showed each prompt to five people, and we asked them to create multiturn dialogues based on those prompts. On average, each dialogue had 5.7 turns. Below are some sample prompts and dialogues:\n\n![image.png](https://dev-media.amazoncloud.cn/0d38a60055e946b69875eb5ba7a7f392_image.png)\n\nFrom the dialogues, we extracted examples of commonsense inference by using the public commonsense knowledge graph Conceptnet. Conceptnet encodes semantic triples with the structure *<entity1, relationship, entity2>*, such as *<doctor, LocateAt, hospital>* or *<specialist, TypeOf, doctor>*.\n\nFrom our candidate dialogues, we kept those in which concepts mentioned in successive dialogue turns were related through Conceptnet triples, as illustrated in the following figure. This reduced the number of dialogues from 25,000 to about 11,000. \n\n![image.png](https://dev-media.amazoncloud.cn/aac29519181a4d62953fc06ffaf09841_image.png)\n\nOnly dialogues in which words of successive dialogue turns are related by commonsense triples were included in the dataset.\n\n#### **Effectiveness study**\n\nTo study the impact of commonsense-oriented datasets on dialogue models, we trained a state-of-the-art pre-trained language model, GPT2, using different datasets. One is a combination of existing datasets. The other includes our new dataset and dialogues from existing datasets that we’ve identified as being commonsense-oriented using ConceptNet.\n\nTo evaluate the models’ performance, we used two automatic metrics: ROUGE, which measures the overlap between a generated response and a reference response for a given dialogue history, and perplexity, which measures a model’s likelihood of generating the reference response. \n\nWe also conducted a human study to evaluate the different models’ outputs on a subset of test dialogues. As we report in a ++[paper](https://www.amazon.science/publications/commonsense-focused-dialogues-for-response-generation-an-empirical-study)++ we presented at SIGDIAL 2021, the model trained using our dataset and commonsense filter data outperformed the baselines on all three measures. \n\nIn the paper, we propose an automatic metric that focuses on the commonsensical aspect of response quality. The metric uses a regression model that factors in features such as length, the likelihood scores from neural models such as DialoGPT, and the number of one-hop and two-hop triples from ConceptNet that can be found between the dialogue history and the current response turn. \n\nWe trained the model on human-evaluation scores for responses generated by the dialogue models we used in our experiments. In tests, our metric showed higher correlation with human-annotation scores than either a neural network trained to predict human assessment or a regression model that didn’t use Conceptnet features. \n\n#### **Looking ahead**\n\nWe are happy to release our dataset to help with research in dialogue response generation. We have conducted only preliminary research with the data and are hoping that the community will use the data for research on commonsense dialogue models. There are many interesting research questions to purse, such as, Do we need to explicitly perform commonsense reasoning for response generation? Or can end-to-end models do this implicitly? \n\nOur work on the automatic metric is also just a beginning. We don’t have a good understanding yet of how to determine if a response is appropriate or commensensical, either from a psycholinguistic or a model development point of view. We are looking forward to seeing more advances from the community in these and other related directions.\n\nABOUT THE AUTHOR\n\n#### **[Yang Liu](https://www.amazon.science/author/yang-liu)**\n\nYang Liu is a principal applied scientist with Alexa AI.\n\n\n\n\n\n\n\n\n\n","render":"Recently, AI models that can engage in open-domain dialogue have made a lot of progress — the <ins><a href=\"https://www.amazon.science/academic-engagements/czech-technical-university-team-wins-alexa-prize-socialbot-grand-challenge-4\" target=\"_blank\">top finishers</a></ins> in the Alexa Prize challenge, for instance.\nBut dialogue models still struggle with conversations that require commonsense inferences. For example, if someone says, “I’m going to perform in front of a thousand people tomorrow”, the listener might infer that the speaker is feeling nervous and respond, "Relax, you’ll do great!”\nTo aid the research community in the development of commonsense dialogue models, <ins><a href=\"https://github.com/alexa/commonsense-dialogues\" target=\"_blank\">we are publicly releasing</a></ins> a large, multiturn, open-domain dialogue dataset that is focused on commonsense knowledge.\nOur dataset contains more than 11,000 dialogues we collected with the aid of workers recruited through Amazon Mechanical Turk.\nTo create dialogue examples, we provided workers with prompts culled from SocialIQA, a large-scale benchmark for commonsense reasoning about social situations, which is based on the ATOMIC knowledge graph. The prompts are sentences like “Addison wanted to go on a trip to Mexico and messaged all of his friends to set up a schedule” or “Tracy performed her function.”\nWe showed each prompt to five people, and we asked them to create multiturn dialogues based on those prompts. On average, each dialogue had 5.7 turns. Below are some sample prompts and dialogues:\n<img src=\"https://dev-media.amazoncloud.cn/0d38a60055e946b69875eb5ba7a7f392_image.png\" alt=\"image.png\" />\nFrom the dialogues, we extracted examples of commonsense inference by using the public commonsense knowledge graph Conceptnet. Conceptnet encodes semantic triples with the structure <entity1, relationship, entity2>, such as <doctor, LocateAt, hospital> or <specialist, TypeOf, doctor>.\nFrom our candidate dialogues, we kept those in which concepts mentioned in successive dialogue turns were related through Conceptnet triples, as illustrated in the following figure. This reduced the number of dialogues from 25,000 to about 11,000.\n<img src=\"https://dev-media.amazoncloud.cn/aac29519181a4d62953fc06ffaf09841_image.png\" alt=\"image.png\" />\nOnly dialogues in which words of successive dialogue turns are related by commonsense triples were included in the dataset.\n<h4><a id=\"Effectiveness_study_22\"></a>Effectiveness study</h4>\nTo study the impact of commonsense-oriented datasets on dialogue models, we trained a state-of-the-art pre-trained language model, GPT2, using different datasets. One is a combination of existing datasets. The other includes our new dataset and dialogues from existing datasets that we’ve identified as being commonsense-oriented using ConceptNet.\nTo evaluate the models’ performance, we used two automatic metrics: ROUGE, which measures the overlap between a generated response and a reference response for a given dialogue history, and perplexity, which measures a model’s likelihood of generating the reference response.\nWe also conducted a human study to evaluate the different models’ outputs on a subset of test dialogues. As we report in a <ins><a href=\"https://www.amazon.science/publications/commonsense-focused-dialogues-for-response-generation-an-empirical-study\" target=\"_blank\">paper</a></ins> we presented at SIGDIAL 2021, the model trained using our dataset and commonsense filter data outperformed the baselines on all three measures.\nIn the paper, we propose an automatic metric that focuses on the commonsensical aspect of response quality. The metric uses a regression model that factors in features such as length, the likelihood scores from neural models such as DialoGPT, and the number of one-hop and two-hop triples from ConceptNet that can be found between the dialogue history and the current response turn.\nWe trained the model on human-evaluation scores for responses generated by the dialogue models we used in our experiments. In tests, our metric showed higher correlation with human-annotation scores than either a neural network trained to predict human assessment or a regression model that didn’t use Conceptnet features.\n<h4><a id=\"Looking_ahead_34\"></a>Looking ahead</h4>\nWe are happy to release our dataset to help with research in dialogue response generation. We have conducted only preliminary research with the data and are hoping that the community will use the data for research on commonsense dialogue models. There are many interesting research questions to purse, such as, Do we need to explicitly perform commonsense reasoning for response generation? Or can end-to-end models do this implicitly?\nOur work on the automatic metric is also just a beginning. We don’t have a good understanding yet of how to determine if a response is appropriate or commensensical, either from a psycholinguistic or a model development point of view. We are looking forward to seeing more advances from the community in these and other related directions.\nABOUT THE AUTHOR\n<h4><a id=\"Yang_Liuhttpswwwamazonscienceauthoryangliu_42\"></a><a href=\"https://www.amazon.science/author/yang-liu\" target=\"_blank\">Yang Liu</a></h4>\nYang Liu is a principal applied scientist with Alexa AI.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家