Using NLU labels to improve an ASR rescoring model

自然语言处理

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"Typically, when someone speaks to a voice agent like Alexa, an automatic speech recognition (++[ASR](https://www.amazon.science/tag/asr)++) model converts the speech to text. A natural-language-understanding (++[NLU](https://www.amazon.science/tag/nlu)++) model then interprets the text, giving the agent structured data that it can act on.\n\nTraditionally, ASR systems were pipelined, with separate acoustic models, dictionaries, and language models. The language models encoded word sequence probabilities, which could be used to decide between competing interpretations of the acoustic signal. Because their training data included public texts, the language models encoded probabilities for a large variety of words.\n\nEnd-to-end ASR models, which take an acoustic signal as input and output word sequences, are far more compact, and overall, they perform as well as the older, pipelined systems did. But they are typically trained on limited data consisting of audio-and-text pairs, so they sometimes struggle with rare words.\n\nThe standard way to address this problem is to use a separate language model to rescore the output of the end-to-end model. If the end-to-end model is running on-device, for instance, the language model might rescore its output in the cloud.\n\nAt this year’s Automatic Speech Recognition and Understanding Workshop (++[ASRU](https://www.amazon.science/conferences-and-events/asru-2021)++), we presented a paper in which we ++[propose training the rescoring model](https://www.amazon.science/publications/multi-task-language-modeling-for-improving-speech-recognition-of-rare-words)++ not only on the standard language model objective — computing word sequence probabilities — but also on tasks performed by the NLU model.\n\nThe idea is that adding NLU tasks, for which labeled training data are generally available, can help the language model ingest more knowledge, which will aid in the recognition of rare words. In experiments, we found that this approach could reduce the language model’s error rate on rare words by about 3% relative to a rescoring language model trained in the conventional way and by about 5% relative to a model with no rescoring at all.\n\nFurthermore, we got our best results by pretraining the rescoring model on just the language model objective and then fine-tuning it on the combined objective using a smaller NLU dataset. This allows us to leverage large amounts of unannotated data while still getting the benefit of the multitask learning.\n\n#### **Multitask training**\n\nOur end-to-end ASR model is a recurrent neural network–transducer, a type of network that processes sequential inputs in order. Its output is a set of text hypotheses, ranked according to probability.\n\nOrdinarily, an NLU model performs two principal functions: intent classification and slot tagging. If the customer says, for instance, “Play ‘Christmas’ by Darlene Love”, the intent might be PlayMusic, and the slots SongName and ArtistName would take the values “Christmas” and “Darlene Love”, respectively.\n\nLanguage models are usually trained on the task of predicting the next word in a sequence, given the words that precede it. The model learns to represent the input words as fixed-length vectors — embeddings — that capture the information necessary to do accurate prediction.\n\n\n![image.png](https://dev-media.amazoncloud.cn/b81b94ddd46e431f81db930b0b106d11_image.png)\n\nIn our multitask training scheme, the same embedding is used for the tasks of intent detection, slot filling, and predicting the next word in a sequence of words.\n\nWe feed the language model embeddings to two additional subnetworks, an intent detection network and a slot-filling network. During training, the model learns to produce embeddings optimized for all three tasks — word prediction, intent detection, and slot filling.\n\nAt run time, the additional subnetworks for intent detection and slot filling are not used. The rescoring of the ASR model’s text hypotheses is based on the sentence probability scores computed from the word prediction task (“LM scores” in the figure below).\n\nDuring training, we had to optimize three objectives simultaneously, and that meant assigning each objective a weight, indicating how much to emphasize it relative to the others.\n\n![image.png](https://dev-media.amazoncloud.cn/2b39c71ee5f44f76a66123cf78096ed0_image.png)\n\nThe outputs of our multitask language model are combined with the original outputs of the ASR model and fed to a decoder, which rescores the ASR hypotheses.\n\nWe experimented with two techniques for assigning weights. One was a linear method, in which we started the weights of the NLU objectives at zero and incrementally dialed them up. The other was the randomized-weight-majority algorithm, in which each objective’s weight is randomly assigned according to a particular probability distribution. The distributions are adjusted during training, depending on performance. In our experiments, this worked better than the linear method.\n\nThe gains our method shows — a 2.6% reduction in word error rate for rare words, relative to a rescoring model built atop an ordinary language model — are not huge, but they do demonstrate the merit of our approach. In ongoing work, we are exploring additional methods to drive the error rate down further.\n\nFor instance, we could use the NLU classifications as explicit inputs to the decoder, rather than just as objectives for training the encoder. Or we could use the intent classification to dynamically bias the rescoring results. We are also exploring semi-supervised training techniques, in which we augment the labeled data used to train the NLU subnetworks with larger corpora of automatically labeled data.\n\nABOUT THE AUTHOR\n\n#### **[Yile Gu](https://www.amazon.science/author/yile-gu)**\n\nYile Gu is a senior applied scientist in the Alexa AI organization.","render":"Typically, when someone speaks to a voice agent like Alexa, an automatic speech recognition (<ins><a href=\"https://www.amazon.science/tag/asr\" target=\"_blank\">ASR</a></ins>) model converts the speech to text. A natural-language-understanding (<ins><a href=\"https://www.amazon.science/tag/nlu\" target=\"_blank\">NLU</a></ins>) model then interprets the text, giving the agent structured data that it can act on.\nTraditionally, ASR systems were pipelined, with separate acoustic models, dictionaries, and language models. The language models encoded word sequence probabilities, which could be used to decide between competing interpretations of the acoustic signal. Because their training data included public texts, the language models encoded probabilities for a large variety of words.\nEnd-to-end ASR models, which take an acoustic signal as input and output word sequences, are far more compact, and overall, they perform as well as the older, pipelined systems did. But they are typically trained on limited data consisting of audio-and-text pairs, so they sometimes struggle with rare words.\nThe standard way to address this problem is to use a separate language model to rescore the output of the end-to-end model. If the end-to-end model is running on-device, for instance, the language model might rescore its output in the cloud.\nAt this year’s Automatic Speech Recognition and Understanding Workshop (<ins><a href=\"https://www.amazon.science/conferences-and-events/asru-2021\" target=\"_blank\">ASRU</a></ins>), we presented a paper in which we <ins><a href=\"https://www.amazon.science/publications/multi-task-language-modeling-for-improving-speech-recognition-of-rare-words\" target=\"_blank\">propose training the rescoring model</a></ins> not only on the standard language model objective — computing word sequence probabilities — but also on tasks performed by the NLU model.\nThe idea is that adding NLU tasks, for which labeled training data are generally available, can help the language model ingest more knowledge, which will aid in the recognition of rare words. In experiments, we found that this approach could reduce the language model’s error rate on rare words by about 3% relative to a rescoring language model trained in the conventional way and by about 5% relative to a model with no rescoring at all.\nFurthermore, we got our best results by pretraining the rescoring model on just the language model objective and then fine-tuning it on the combined objective using a smaller NLU dataset. This allows us to leverage large amounts of unannotated data while still getting the benefit of the multitask learning.\n<h4><a id=\"Multitask_training_14\"></a>Multitask training</h4>\nOur end-to-end ASR model is a recurrent neural network–transducer, a type of network that processes sequential inputs in order. Its output is a set of text hypotheses, ranked according to probability.\nOrdinarily, an NLU model performs two principal functions: intent classification and slot tagging. If the customer says, for instance, “Play ‘Christmas’ by Darlene Love”, the intent might be PlayMusic, and the slots SongName and ArtistName would take the values “Christmas” and “Darlene Love”, respectively.\nLanguage models are usually trained on the task of predicting the next word in a sequence, given the words that precede it. The model learns to represent the input words as fixed-length vectors — embeddings — that capture the information necessary to do accurate prediction.\n<img src=\"https://dev-media.amazoncloud.cn/b81b94ddd46e431f81db930b0b106d11_image.png\" alt=\"image.png\" />\nIn our multitask training scheme, the same embedding is used for the tasks of intent detection, slot filling, and predicting the next word in a sequence of words.\nWe feed the language model embeddings to two additional subnetworks, an intent detection network and a slot-filling network. During training, the model learns to produce embeddings optimized for all three tasks — word prediction, intent detection, and slot filling.\nAt run time, the additional subnetworks for intent detection and slot filling are not used. The rescoring of the ASR model’s text hypotheses is based on the sentence probability scores computed from the word prediction task (“LM scores” in the figure below).\nDuring training, we had to optimize three objectives simultaneously, and that meant assigning each objective a weight, indicating how much to emphasize it relative to the others.\n<img src=\"https://dev-media.amazoncloud.cn/2b39c71ee5f44f76a66123cf78096ed0_image.png\" alt=\"image.png\" />\nThe outputs of our multitask language model are combined with the original outputs of the ASR model and fed to a decoder, which rescores the ASR hypotheses.\nWe experimented with two techniques for assigning weights. One was a linear method, in which we started the weights of the NLU objectives at zero and incrementally dialed them up. The other was the randomized-weight-majority algorithm, in which each objective’s weight is randomly assigned according to a particular probability distribution. The distributions are adjusted during training, depending on performance. In our experiments, this worked better than the linear method.\nThe gains our method shows — a 2.6% reduction in word error rate for rare words, relative to a rescoring model built atop an ordinary language model — are not huge, but they do demonstrate the merit of our approach. In ongoing work, we are exploring additional methods to drive the error rate down further.\nFor instance, we could use the NLU classifications as explicit inputs to the decoder, rather than just as objectives for training the encoder. Or we could use the intent classification to dynamically bias the rescoring results. We are also exploring semi-supervised training techniques, in which we augment the labeled data used to train the NLU subnetworks with larger corpora of automatically labeled data.\nABOUT THE AUTHOR\n<h4><a id=\"Yile_Guhttpswwwamazonscienceauthoryilegu_45\"></a><a href=\"https://www.amazon.science/author/yile-gu\" target=\"_blank\">Yile Gu</a></h4>\nYile Gu is a senior applied scientist in the Alexa AI organization.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家