Amazon releases 51-language dataset for language understanding

自然语言处理

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"Imagine that all people around the world could use voice AI systems such as Alexa in their native tongues.\n\nOne promising approach to realizing this vision is massively multilingual natural-language understanding (MMNLU), a paradigm in which a single machine learning model can parse and understand inputs from many typologically diverse languages. By learning a shared data representation that spans languages, the model can transfer knowledge from languages with abundant training data to those in which training data is scarce.\n\nToday we are pleased to make three announcements related to MMNLU.\n\n\n![下载.jpg](https://dev-media.amazoncloud.cn/9a1cec02fdc14914aeaf0de3c7de1c48_%E4%B8%8B%E8%BD%BD.jpg)\n\nThe MASSIVE dataset is a step toward the creation of multilingual natural-language-understanding models that can generalize easily to new languages.\n\nFirst, we are releasing a [new dataset called MASSIVE](https://github.com/alexa/massive), which is composed of one million labeled utterances spanning 51 languages, along with open-source code, which provides examples of how to perform massively multilingual NLU modeling and allows practitioners to re-create baseline results for intent classification and slot filling that are presented in our paper..\n\nSecond, we are launching a [new competition using the MASSIVE dataset](https://eval.ai/web/challenges/challenge-page/1697/overview) called Massively Multilingual NLU 2022 (MMNLU-22).\n\nAnd third, we will cohost a [workshop at EMNLP 2022](https://mmnlu-22.github.io/) in Abu Dhabi and online, also called Massively Multilingual NLU 2022, which will highlight the results from the competition and include presentations from invited speakers and oral and poster sessions from submitted papers on multilingual natural-language processing (NLP).\n\n“We are very excited to share this large multilingual dataset with the worldwide language research community,” says Prem Natarajan, vice president of Alexa AI Natural Understanding. “We hope that this dataset will enable researchers across the world to drive new advances in multilingual language understanding that expand the availability and reach of conversational-AI technologies.”\n\n#### **The MASSIVE dataset**\n\n\nMASSIVE is a parallel dataset, meaning that every utterance is given in all 51 languages. This enables models to learn shared representations of utterances with the same intents, regardless of language, facilitating cross-linguistic training on natural-language-understanding (NLU) tasks. It also allows for adaptation to other NLP tasks such as machine translation, multilingual paraphrasing, new linguistic analyses of imperative morphologies, and more.\n\nNLU — a subdiscipline of NLP — is a machine's ability to understand the meaning of a text and identify the relevant entities. For instance, given the utterance “What is the temperature in New York?”, an NLU model might classify the intent as “weather_query” and recognize relevant entities as “weather_descriptor: temperature” and “place_name: new york.”\n\nOur particular focus is on NLU as a component of spoken-language understanding (SLU), in which audio is converted to text before NLU is performed. Although SLU-based virtual assistants like Alexa have made major capability advances in the past decade, academic and industrial NLU efforts worldwide are still limited to a small subset of the world's 7,000+ languages. One difficulty in creating massively multilingual NLU models is the lack of labeled data for training and evaluation — particularly data that is realistic for a given task and natural for a given language. High naturalness typically requires human vetting, which is often costly.\n\nMASSIVE — Multilingual Amazon SLURP (SLU resource package) for Slot Filling, Intent Classification, and Virtual-Assistant Evaluation — contains one million realistic, parallel, labeled virtual-assistant text utterances spanning 51 languages, 18 domains, 60 intents, and 55 slots. MASSIVE was created by tasking professional translators to localize or translate the English-only [SLURP dataset](https://arxiv.org/abs/2011.13205) into 50 typologically diverse languages from 29 genera, including low-resource languages.\n\n![微信截图_20220824113002.png](https://dev-media.amazoncloud.cn/ff1fbdac83b24d10813bed665ec019ad_%E5%BE%AE%E4%BF%A1%E6%88%AA%E5%9B%BE_20220824113002.png)\n\nWe have released a paper [describing the dataset](https://arxiv.org/abs/2204.08582) and presenting baseline modeling results on [XLM-R](https://arxiv.org/abs/1911.02116) and [mT5](https://arxiv.org/pdf/2010.11934.pdf) models. Tools for the dataset, as well as the modeling code used for our baseline results, are available in our [Github repository](https://github.com/alexa/massive). MASSIVE is licensed under the CC BY 4.0 license, encouraging its broadest possible use across academia and industry.\n\n\n#### **MMNLU competition and workshop**\n\n\nThe MASSIVE leaderboard and the Massively Multilingual NLU 2022 competition, hosted on eval.ai, are composed of two tasks. In the first, called MMNLU-22-Full, each competitor trains and tests a single model on all 51 languages of the full MASSIVE dataset. In the second task, called MMNLU-22-ZeroShot, each competitor fine-tunes a pretrained model only with English-labeled data and tests it on all 50 non-English languages.\n\nThis assesses the model’s ability to generalize to new languages, an important consideration given the number of languages around the world for which there is little-to-no labeled data. Zero-shot learning is a key technology for scaling NLU technology to many more low-resource languages worldwide.\n\nThe permanent MASSIVE leaderboard has been launched, and on July 25 the Massively Multilingual NLU 2022 evaluation split will be released. Participants will then have until August 8 to perform inference on the evaluation set and submit their predictions, which will be used to determine the winners. Winners will be invited to give an oral presentation at the Massively Multilingual NLU 2022 workshop.\n\nThe Massively Multilingual NLU 2022 workshop is collocated with EMNLP 2022 and will take place on either December 7 or 8, both in person in Abu Dhabi and online. Paper submissions spanning the breadth of multilingualism in NLU are sought, and the first call for papers will be released soon. The workshop will feature speakers on various topics related to multilingualism and NLU, as well as talks from the top performers from the MMNLU-22 competition.\n\nLet’s scale natural-language-understanding technology to every language on Earth. Come build with us!\n\n\n#### **Acknowledgments**\n\n\nack FitzGerald, Christopher Hench, Charith Peris, Scott Mackie, Kay Rottmann, Ana Sanchez, Aaron Nash, Liam Urbach, Vishesh Kakarala, Richa Singh, Swetha Ranganath, Laurie Crist, Misha Britan, Wouter Leeuwis, Gokhan Tur, and Prem Natarajan for core dataset contributions; Andrew Turner for product and program management; Anna-Karin Johansson for vendor management; Saleh Soltan for text-to-text modeling discussions; Anne Yoder, Zheng Xie, Adeetee Bhide, Misa Sunaga, Trang Doan, and Satyam Dwivedi for program management and language expertise; Wayne Blossom, Brendan Egan, Columbine Marshall, Todd Tieuli, and Augusta Niles for creating the hidden evaluation split of the dataset; Jack FitzGerald, Kay Rottmann, Julia Hirschberg, Anna Rumshisky, and Mohit Bansal for workshop organization; and Charith Peris and Jack FitzGerald for leaderboard and competition setup.\n\n\nABOUT THE AUTHOR\n\n#### [Jack FitzGerald](https://www.amazon.science/author/jack-g-m-fitzgerald)\n\nJack G. M. FitzGerald is a senior applied scientist in Alexa AI's Natural Understanding group.","render":"Imagine that all people around the world could use voice AI systems such as Alexa in their native tongues.\nOne promising approach to realizing this vision is massively multilingual natural-language understanding (MMNLU), a paradigm in which a single machine learning model can parse and understand inputs from many typologically diverse languages. By learning a shared data representation that spans languages, the model can transfer knowledge from languages with abundant training data to those in which training data is scarce.\nToday we are pleased to make three announcements related to MMNLU.\n<img src=\"https://dev-media.amazoncloud.cn/9a1cec02fdc14914aeaf0de3c7de1c48_%E4%B8%8B%E8%BD%BD.jpg\" alt=\"下载.jpg\" />\nThe MASSIVE dataset is a step toward the creation of multilingual natural-language-understanding models that can generalize easily to new languages.\nFirst, we are releasing a <a href=\"https://github.com/alexa/massive\" target=\"_blank\">new dataset called MASSIVE</a>, which is composed of one million labeled utterances spanning 51 languages, along with open-source code, which provides examples of how to perform massively multilingual NLU modeling and allows practitioners to re-create baseline results for intent classification and slot filling that are presented in our paper…\nSecond, we are launching a <a href=\"https://eval.ai/web/challenges/challenge-page/1697/overview\" target=\"_blank\">new competition using the MASSIVE dataset</a> called Massively Multilingual NLU 2022 (MMNLU-22).\nAnd third, we will cohost a <a href=\"https://mmnlu-22.github.io/\" target=\"_blank\">workshop at EMNLP 2022</a> in Abu Dhabi and online, also called Massively Multilingual NLU 2022, which will highlight the results from the competition and include presentations from invited speakers and oral and poster sessions from submitted papers on multilingual natural-language processing (NLP).\n“We are very excited to share this large multilingual dataset with the worldwide language research community,” says Prem Natarajan, vice president of Alexa AI Natural Understanding. “We hope that this dataset will enable researchers across the world to drive new advances in multilingual language understanding that expand the availability and reach of conversational-AI technologies.”\n<h4><a id=\"The_MASSIVE_dataset_19\"></a>The MASSIVE dataset</h4>\nMASSIVE is a parallel dataset, meaning that every utterance is given in all 51 languages. This enables models to learn shared representations of utterances with the same intents, regardless of language, facilitating cross-linguistic training on natural-language-understanding (NLU) tasks. It also allows for adaptation to other NLP tasks such as machine translation, multilingual paraphrasing, new linguistic analyses of imperative morphologies, and more.\nNLU — a subdiscipline of NLP — is a machine’s ability to understand the meaning of a text and identify the relevant entities. For instance, given the utterance “What is the temperature in New York?”, an NLU model might classify the intent as “weather_query” and recognize relevant entities as “weather_descriptor: temperature” and “place_name: new york.”\nOur particular focus is on NLU as a component of spoken-language understanding (SLU), in which audio is converted to text before NLU is performed. Although SLU-based virtual assistants like Alexa have made major capability advances in the past decade, academic and industrial NLU efforts worldwide are still limited to a small subset of the world’s 7,000+ languages. One difficulty in creating massively multilingual NLU models is the lack of labeled data for training and evaluation — particularly data that is realistic for a given task and natural for a given language. High naturalness typically requires human vetting, which is often costly.\nMASSIVE — Multilingual Amazon SLURP (SLU resource package) for Slot Filling, Intent Classification, and Virtual-Assistant Evaluation — contains one million realistic, parallel, labeled virtual-assistant text utterances spanning 51 languages, 18 domains, 60 intents, and 55 slots. MASSIVE was created by tasking professional translators to localize or translate the English-only <a href=\"https://arxiv.org/abs/2011.13205\" target=\"_blank\">SLURP dataset</a> into 50 typologically diverse languages from 29 genera, including low-resource languages.\n<img src=\"https://dev-media.amazoncloud.cn/ff1fbdac83b24d10813bed665ec019ad_%E5%BE%AE%E4%BF%A1%E6%88%AA%E5%9B%BE_20220824113002.png\" alt=\"微信截图_20220824113002.png\" />\nWe have released a paper <a href=\"https://arxiv.org/abs/2204.08582\" target=\"_blank\">describing the dataset</a> and presenting baseline modeling results on <a href=\"https://arxiv.org/abs/1911.02116\" target=\"_blank\">XLM-R</a> and <a href=\"https://arxiv.org/pdf/2010.11934.pdf\" target=\"_blank\">mT5</a> models. Tools for the dataset, as well as the modeling code used for our baseline results, are available in our <a href=\"https://github.com/alexa/massive\" target=\"_blank\">Github repository</a>. MASSIVE is licensed under the CC BY 4.0 license, encouraging its broadest possible use across academia and industry.\n<h4><a id=\"MMNLU_competition_and_workshop_35\"></a>MMNLU competition and workshop</h4>\nThe MASSIVE leaderboard and the Massively Multilingual NLU 2022 competition, hosted on eval.ai, are composed of two tasks. In the first, called MMNLU-22-Full, each competitor trains and tests a single model on all 51 languages of the full MASSIVE dataset. In the second task, called MMNLU-22-ZeroShot, each competitor fine-tunes a pretrained model only with English-labeled data and tests it on all 50 non-English languages.\nThis assesses the model’s ability to generalize to new languages, an important consideration given the number of languages around the world for which there is little-to-no labeled data. Zero-shot learning is a key technology for scaling NLU technology to many more low-resource languages worldwide.\nThe permanent MASSIVE leaderboard has been launched, and on July 25 the Massively Multilingual NLU 2022 evaluation split will be released. Participants will then have until August 8 to perform inference on the evaluation set and submit their predictions, which will be used to determine the winners. Winners will be invited to give an oral presentation at the Massively Multilingual NLU 2022 workshop.\nThe Massively Multilingual NLU 2022 workshop is collocated with EMNLP 2022 and will take place on either December 7 or 8, both in person in Abu Dhabi and online. Paper submissions spanning the breadth of multilingualism in NLU are sought, and the first call for papers will be released soon. The workshop will feature speakers on various topics related to multilingualism and NLU, as well as talks from the top performers from the MMNLU-22 competition.\nLet’s scale natural-language-understanding technology to every language on Earth. Come build with us!\n<h4><a id=\"Acknowledgments_49\"></a>Acknowledgments</h4>\nack FitzGerald, Christopher Hench, Charith Peris, Scott Mackie, Kay Rottmann, Ana Sanchez, Aaron Nash, Liam Urbach, Vishesh Kakarala, Richa Singh, Swetha Ranganath, Laurie Crist, Misha Britan, Wouter Leeuwis, Gokhan Tur, and Prem Natarajan for core dataset contributions; Andrew Turner for product and program management; Anna-Karin Johansson for vendor management; Saleh Soltan for text-to-text modeling discussions; Anne Yoder, Zheng Xie, Adeetee Bhide, Misa Sunaga, Trang Doan, and Satyam Dwivedi for program management and language expertise; Wayne Blossom, Brendan Egan, Columbine Marshall, Todd Tieuli, and Augusta Niles for creating the hidden evaluation split of the dataset; Jack FitzGerald, Kay Rottmann, Julia Hirschberg, Anna Rumshisky, and Mohit Bansal for workshop organization; and Charith Peris and Jack FitzGerald for leaderboard and competition setup.\nABOUT THE AUTHOR\n<h4><a id=\"Jack_FitzGeraldhttpswwwamazonscienceauthorjackgmfitzgerald_57\"></a><a href=\"https://www.amazon.science/author/jack-g-m-fitzgerald\" target=\"_blank\">Jack FitzGerald</a></h4>\nJack G. M. FitzGerald is a senior applied scientist in Alexa AI’s Natural Understanding group.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家