How Alexa's new Live Translation for conversations works

自然语言处理

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"Today, Amazon launched Alexa’s new Live Translation feature, which allows individuals speaking in two different languages to converse with each other, with Alexa acting as an interpreter and translating both sides of the conversation. \n\nWith this new feature, a customer can ask Alexa to initiate a translation session for a pair of languages. Once the session has commenced, customers can speak phrases or sentences in either language. Alexa will automatically identify which language is being spoken and translate each side of the conversation. \n\n![image.png](https://dev-media.amazoncloud.cn/8b0a42bc42804460bb3d2a9d24045984_image.png)\n\nA sample interaction with Alexa Live Translation.\n\nCREDIT: SHIRIN SALEEM\n\nAt launch, the feature will work with six language pairs — English and Spanish, French, German, Italian, Brazilian Portuguese, or Hindi — on Echo devices with locale set to English US.\n\nThe Live Translation feature leverages several existing Amazon systems, including Alexa’s automatic-speech-recognition (++[ASR](https://www.amazon.science/tag/asr)++) system, [Amazon Translate](https://aws.amazon.com/cn/translate/?trk=cndc-detail), and Alexa’s ++[text-to-speech](https://www.amazon.science/tag/text-to-speech)++ system, with the overall architecture and machine learning models designed and optimized for conversational-speech translation.\n\n\n#### **Language ID**\n\n\nDuring a translation session, Alexa runs two ++[ASR](https://www.amazon.science/tag/asr)++ models in parallel, along with a separate model for language identification. Input speech passes to both ASR models at once. Based on the language ID model’s classification result, however, only one ASR model’s output is sent to the translation engine.\n\nThis parallel implementation is necessary to keep the latency of the translation request acceptable, as waiting to begin speech recognition until the language ID model has returned a result would delay the playback of the translated audio. \n\nMoreover, we found that ++[the language ID model](https://arxiv.org/pdf/2006.00703.pdf)++ works best when it bases its decision on both acoustic information about the speech signal and the outputs of both ASR models. The ASR data often helps, for instance, in the cases of non-native speakers of a language, whose speech often has consistent acoustic properties regardless of the language being spoken.\n\nOnce the language ID system has selected a language, the associated ASR output is post-processed and sent to [Amazon Translate](https://aws.amazon.com/cn/translate/?trk=cndc-detail). The resulting translation is passed to Alexa’s text-to-speech system for playback.\n\n\n#### **Speech recognition**\n\n\nLike most ASR systems, the ones we use for live translation include both an acoustic model and a language model. The acoustic model converts audio into phonemes, the smallest units of speech; the language model encodes the probabilities of particular strings of words, which helps the ASR system decide between alternative interpretations of the same sequence of phonemes.\n\nEach of the ASR systems used for Live Translation, like Alexa’s existing ASR models, includes two types of language models: a traditional language model, which encodes probabilities for relatively short strings of words (typically around four), and a ++[neural language model](https://www.amazon.science/blog/how-to-make-neural-language-models-practical-for-speech-recognition)++, which can account for longer-range dependencies. The Live Translation language models were trained to handle more-conversational speech covering a wider range of topics than Alexa's existing ASR models.\n\nTo train our acoustic models, we used connectionist temporal classification (CTC), followed by multiple passes of state-level minimum-Bayes-risk (sMBR) training. To make the acoustic model more robust, we also mixed noise into the training set, enabling the model to focus on characteristics of the input signal that vary less under different acoustic conditions.\n\n\n#### **Detail work**\n\n\nAdapting to conversational speech also required modification of Alexa’s ++[end-pointer](https://www.amazon.science/blog/alexa-scientists-address-challenges-of-end-pointing)++, which determines when a customer has finished speaking. The end-pointer already distinguishes between pauses at the ends of sentences, indicating that the customer has stopped speaking and that Alexa needs to follow up, and mid-sentence pauses, which may be permitted to go on a little longer. For Live Translation, we modified the end-pointer to tolerate longer pauses at the ends of sentences, as speakers engaged in long conversations will often take time between sentences to formulate their thoughts. \n\nFinally, because [Amazon Translate](https://aws.amazon.com/cn/translate/?trk=cndc-detail)’s neural-machine-translation system was designed to work with textual input, the Live Translation system adjusts for common disfluencies and punctuates and formats the ASR output. This ensures that the inputs to [Amazon Translate](https://aws.amazon.com/cn/translate/?trk=cndc-detail) look more like the written text that it’s used to seeing.\n\nIn ongoing work, we’re exploring several approaches to improving the Live Translation feature further. One of these is semi-supervised learning, in which Alexa’s existing models annotate unlabeled data, and we use the highest-confidence outputs as additional training examples for our translation-specific ASR and language ID models.\n\nTo improve the fluency of the translation and its robustness to spoken-language input, we are also working on adapting the neural-machine-translation engine to conversational-speech data and generating translations that incorporate relevant context, such as tone of voice or formal versus informal translations. Finally, we are continuously working on improving the quality of the overall translations and of colloquial and idiomatic expressions in particular.\n\nABOUT THE AUTHOR\n\n\n#### **[Shirin Saleem](https://www.amazon.science/author/shirin-saleem)**\n\n\nShirin Saleem is the senior manager for translation with Alexa AI.\n\n\n#### **[Roland Maas](https://www.amazon.science/author/roland-maas)**\n\n\nRoland Maas is an applied-science manager with Alexa Speech.","render":"Today, Amazon launched Alexa’s new Live Translation feature, which allows individuals speaking in two different languages to converse with each other, with Alexa acting as an interpreter and translating both sides of the conversation.\nWith this new feature, a customer can ask Alexa to initiate a translation session for a pair of languages. Once the session has commenced, customers can speak phrases or sentences in either language. Alexa will automatically identify which language is being spoken and translate each side of the conversation.\n<img src=\\"https://dev-media.amazoncloud.cn/8b0a42bc42804460bb3d2a9d24045984_image.png\\" alt=\\"image.png\\" />\nA sample interaction with Alexa Live Translation.\nCREDIT: SHIRIN SALEEM\nAt launch, the feature will work with six language pairs — English and Spanish, French, German, Italian, Brazilian Portuguese, or Hindi — on Echo devices with locale set to English US.\nThe Live Translation feature leverages several existing Amazon systems, including Alexa’s automatic-speech-recognition (<ins><a href=\\"https://www.amazon.science/tag/asr\\" target=\\"_blank\\">ASR</a></ins>) system, Amazon Translate, and Alexa’s <ins><a href=\\"https://www.amazon.science/tag/text-to-speech\\" target=\\"_blank\\">text-to-speech</a></ins> system, with the overall architecture and machine learning models designed and optimized for conversational-speech translation.\n<h4><a id=\\"Language_ID_15\\"></a>Language ID</h4>\\nDuring a translation session, Alexa runs two <ins><a href=\\"https://www.amazon.science/tag/asr\\" target=\\"_blank\\">ASR</a></ins> models in parallel, along with a separate model for language identification. Input speech passes to both ASR models at once. Based on the language ID model’s classification result, however, only one ASR model’s output is sent to the translation engine.\nThis parallel implementation is necessary to keep the latency of the translation request acceptable, as waiting to begin speech recognition until the language ID model has returned a result would delay the playback of the translated audio.\nMoreover, we found that <ins><a href=\\"https://arxiv.org/pdf/2006.00703.pdf\\" target=\\"_blank\\">the language ID model</a></ins> works best when it bases its decision on both acoustic information about the speech signal and the outputs of both ASR models. The ASR data often helps, for instance, in the cases of non-native speakers of a language, whose speech often has consistent acoustic properties regardless of the language being spoken.\nOnce the language ID system has selected a language, the associated ASR output is post-processed and sent to Amazon Translate. The resulting translation is passed to Alexa’s text-to-speech system for playback.\n<h4><a id=\\"Speech_recognition_27\\"></a>Speech recognition</h4>\\nLike most ASR systems, the ones we use for live translation include both an acoustic model and a language model. The acoustic model converts audio into phonemes, the smallest units of speech; the language model encodes the probabilities of particular strings of words, which helps the ASR system decide between alternative interpretations of the same sequence of phonemes.\nEach of the ASR systems used for Live Translation, like Alexa’s existing ASR models, includes two types of language models: a traditional language model, which encodes probabilities for relatively short strings of words (typically around four), and a <ins><a href=\\"https://www.amazon.science/blog/how-to-make-neural-language-models-practical-for-speech-recognition\\" target=\\"_blank\\">neural language model</a></ins>, which can account for longer-range dependencies. The Live Translation language models were trained to handle more-conversational speech covering a wider range of topics than Alexa’s existing ASR models.\nTo train our acoustic models, we used connectionist temporal classification (CTC), followed by multiple passes of state-level minimum-Bayes-risk (sMBR) training. To make the acoustic model more robust, we also mixed noise into the training set, enabling the model to focus on characteristics of the input signal that vary less under different acoustic conditions.\n<h4><a id=\\"Detail_work_37\\"></a>Detail work</h4>\\nAdapting to conversational speech also required modification of Alexa’s <ins><a href=\\"https://www.amazon.science/blog/alexa-scientists-address-challenges-of-end-pointing\\" target=\\"_blank\\">end-pointer</a></ins>, which determines when a customer has finished speaking. The end-pointer already distinguishes between pauses at the ends of sentences, indicating that the customer has stopped speaking and that Alexa needs to follow up, and mid-sentence pauses, which may be permitted to go on a little longer. For Live Translation, we modified the end-pointer to tolerate longer pauses at the ends of sentences, as speakers engaged in long conversations will often take time between sentences to formulate their thoughts.\nFinally, because Amazon Translate’s neural-machine-translation system was designed to work with textual input, the Live Translation system adjusts for common disfluencies and punctuates and formats the ASR output. This ensures that the inputs to Amazon Translate look more like the written text that it’s used to seeing.\nIn ongoing work, we’re exploring several approaches to improving the Live Translation feature further. One of these is semi-supervised learning, in which Alexa’s existing models annotate unlabeled data, and we use the highest-confidence outputs as additional training examples for our translation-specific ASR and language ID models.\nTo improve the fluency of the translation and its robustness to spoken-language input, we are also working on adapting the neural-machine-translation engine to conversational-speech data and generating translations that incorporate relevant context, such as tone of voice or formal versus informal translations. Finally, we are continuously working on improving the quality of the overall translations and of colloquial and idiomatic expressions in particular.\nABOUT THE AUTHOR\n<h4><a id=\\"Shirin_Saleemhttpswwwamazonscienceauthorshirinsaleem_51\\"></a><a href=\\"https://www.amazon.science/author/shirin-saleem\\" target=\\"_blank\\">Shirin Saleem</a></h4>\nShirin Saleem is the senior manager for translation with Alexa AI.\n<h4><a id=\\"Roland_Maashttpswwwamazonscienceauthorrolandmaas_57\\"></a><a href=\\"https://www.amazon.science/author/roland-maas\\" target=\\"_blank\\">Roland Maas</a></h4>\nRoland Maas is an applied-science manager with Alexa Speech.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家