English-language Alexa voice learns to speak Spanish

迁移学习

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"In 2019, Alexa launched multilingual mode for US English and US Spanish, which lets customers address Alexa in either language and receive responses in that language. To ensure that both the English and Spanish voices had natural-sounding accents, they were based on different voice performers’ recorded speech. Consequently, multilingual mode felt like speaking to two different people.\n\nAlexa demonstrates her new ability speak US Spanish.\n\nNow, the Amazon Text-to-Speech (++[TTS](https://www.amazon.science/tag/text-to-speech)++) team has used deep-learning methods to transfer the ability to speak US Spanish — with native-speaker accent and fluency — to a voice based only on English recordings. Although we are using the technology initially in a bilingual mode, our experiments indicate that it should generalize to multiple languages.\n\nNeural text-to-speech (NTTS) uses neural networks to generate speech directly from phonetic renderings of input texts. In the past few years, the Amazon TTS team has used NTTS to ++[transfer vocal inflections](https://www.amazon.science/blog/neural-text-to-speech-makes-speech-synthesizers-much-more-versatile)++ (prosody) from a recorded voice to a synthesized voice or change the ++[speaking style](https://www.amazon.science/blog/varying-speaking-styles-with-neural-text-to-speech)++ of a synthesized voice, to make it sound more like a newscaster or DJ. In the same way, neural TTS lets us teach an existing voice to speak a new language.\n\nWith traditional TTS systems, the way to do this was to map the phonemes of the target language into equivalent phonemes — the shortest units of speech — in the speaker’s native language. But this resulted in synthesized speech with a heavy foreign accent. Another approach was to find bilingual voice performers and record them speaking both languages, which is not always feasible and limits the number of languages we can combine. Our new multilingual model solves both problems.\n\n\n#### **Shared spaces**\n\n\nWith our new technology, we begin by training a machine learning model on data from multiple speakers in multiple languages. We start with our standard neural TTS platform, which takes a sequence of phonemes as input. We add two additional inputs (shown in blue in the figure below), a language ID code and a speaker embedding, a vector representation that encodes distinctive characteristics of a given speaker’s speech.\n\n![image.png](https://dev-media.amazoncloud.cn/f8fdd18207454addbc5495765b9e9f84_image.png)\n\nThe researchers' multilingual model adds two inputs, speaker embedding and language ID (blue), to Amazon's standard text-to-speech model.\n\nThe phoneme sequence passes to an encoder, whose output is a vector representation that encodes acoustic information about the phonemes. We want this encoder to project acoustically similar phonemes from different languages into the same region of the representation space, regardless of speaker identity or language.\n\nThe phonetic encoding, the language ID, and the speaker embeddings pass through an attention mechanism, which determines which of the input phonemes require particular attention, given the current state of the decoder. The decoder uses the speaker and language embeddings to produce the correct acoustic content for a particular speaker and language. Reconfirming the language ID at the input of the decoder allows the encoder to extract common representations across languages.\n\nThe speaker embeddings we use are pretrained in a speaker classification task on a large external corpus. The embeddings for similar speakers cluster together, independent of the language they speak. The system can thus use speaker embeddings to extrapolate how speakers would sound in different languages. \n\n\n#### **Evaluation**\n\n\nWe evaluated the performance of our model on four axes. First, we measured the naturalness of the output in English to make sure that we do not degrade the existing experience. Then we measured the system’s naturalness, speaker similarity, and accent quality in Spanish. These three measures ensure that we provide our customers with a high-quality synthetic voice that resembles the original speaker and speaks Spanish with a native accent. \n\nThe figure below shows the boxplots of our measurements along the four axes, according to the MUSHRA (multiple stimuli with hidden reference and anchor) methodology. We evaluated the current English production model (EN Alexa) against the bilingual (Polyglot) model. The plots present (from left to right) the results for naturalness in English, naturalness in Spanish, speaker similarity in Spanish, and accent in Spanish. \n\n![image.png](https://dev-media.amazoncloud.cn/05b565e23d3f4babb082d0646acb00f8_image.png)\n\nThe results of the researchers' perception study, comparing the new bilingual model (Polyglot) to voice recordings and to the existing US English and US Spanish models.\n\nIn both naturalness evaluations we used English recordings of the original speaker as a reference. We can see in the English evaluations that the Polyglot system is performing slightly worse than the EN Alexa model. We decided that this was a small acceptable regression, given the benefits of having a voice that can speak both languages. The Polyglot system achieves similar naturalness scores in both English and Spanish. \n\nIn the speaker similarity evaluations, we asked listeners to rate how similar the Spanish samples were to a random recording of the original speaker in English and to rate the similarity of the English and Spanish voices in the original Alexa multilingual mode. We also compared the Polyglot system with a version of the EN Alexa model that mapped the Spanish phonemes to English phonemes.\n\nUnsurprisingly, the Polyglot Spanish voice sounds much more similar to the English target speaker than the native Spanish speaker from the original multilingual mode does to the native English speaker. The Polyglot voices don’t reach the same similarity that the voices produced by phoneme mapping do, but this might be because of listener bias toward the English accent. \n\nIn the accent evaluations, there is no statistically significant difference between the scores given to the Polyglot system and to the Spanish recordings. In other words, Polyglot sounds as native as Spanish Alexa recordings. Overall, we were able to produce a high-quality synthetic voice with a native Spanish accent that was perceived to be the same person as the English-speaking voice, without needing an English voice actor to read in Spanish.\n\nThis technology may enable Alexa to speak even more languages in the future, as we can make an existing speaker speak a new language without making additional recordings.\n\nABOUT THE AUTHOR\n\n#### **[Kayoko Yanagisawa](https://www.amazon.science/author/kayoko-yanagisawa)**\n\nKayoko Yanagisawa is a senior speech scientist in the Amazon Text-to-Speech group.\n\n#### **[Marius Cotescu](https://www.amazon.science/author/marius-cotescu)**\n\nMarius Cotescu is an applied scientist in the Amazon Text-to-Speech group.","render":"In 2019, Alexa launched multilingual mode for US English and US Spanish, which lets customers address Alexa in either language and receive responses in that language. To ensure that both the English and Spanish voices had natural-sounding accents, they were based on different voice performers’ recorded speech. Consequently, multilingual mode felt like speaking to two different people.\nAlexa demonstrates her new ability speak US Spanish.\nNow, the Amazon Text-to-Speech (<ins><a href=\"https://www.amazon.science/tag/text-to-speech\" target=\"_blank\">TTS</a></ins>) team has used deep-learning methods to transfer the ability to speak US Spanish — with native-speaker accent and fluency — to a voice based only on English recordings. Although we are using the technology initially in a bilingual mode, our experiments indicate that it should generalize to multiple languages.\nNeural text-to-speech (NTTS) uses neural networks to generate speech directly from phonetic renderings of input texts. In the past few years, the Amazon TTS team has used NTTS to <ins><a href=\"https://www.amazon.science/blog/neural-text-to-speech-makes-speech-synthesizers-much-more-versatile\" target=\"_blank\">transfer vocal inflections</a></ins> (prosody) from a recorded voice to a synthesized voice or change the <ins><a href=\"https://www.amazon.science/blog/varying-speaking-styles-with-neural-text-to-speech\" target=\"_blank\">speaking style</a></ins> of a synthesized voice, to make it sound more like a newscaster or DJ. In the same way, neural TTS lets us teach an existing voice to speak a new language.\nWith traditional TTS systems, the way to do this was to map the phonemes of the target language into equivalent phonemes — the shortest units of speech — in the speaker’s native language. But this resulted in synthesized speech with a heavy foreign accent. Another approach was to find bilingual voice performers and record them speaking both languages, which is not always feasible and limits the number of languages we can combine. Our new multilingual model solves both problems.\n<h4><a id=\"Shared_spaces_11\"></a>Shared spaces</h4>\nWith our new technology, we begin by training a machine learning model on data from multiple speakers in multiple languages. We start with our standard neural TTS platform, which takes a sequence of phonemes as input. We add two additional inputs (shown in blue in the figure below), a language ID code and a speaker embedding, a vector representation that encodes distinctive characteristics of a given speaker’s speech.\n<img src=\"https://dev-media.amazoncloud.cn/f8fdd18207454addbc5495765b9e9f84_image.png\" alt=\"image.png\" />\nThe researchers’ multilingual model adds two inputs, speaker embedding and language ID (blue), to Amazon’s standard text-to-speech model.\nThe phoneme sequence passes to an encoder, whose output is a vector representation that encodes acoustic information about the phonemes. We want this encoder to project acoustically similar phonemes from different languages into the same region of the representation space, regardless of speaker identity or language.\nThe phonetic encoding, the language ID, and the speaker embeddings pass through an attention mechanism, which determines which of the input phonemes require particular attention, given the current state of the decoder. The decoder uses the speaker and language embeddings to produce the correct acoustic content for a particular speaker and language. Reconfirming the language ID at the input of the decoder allows the encoder to extract common representations across languages.\nThe speaker embeddings we use are pretrained in a speaker classification task on a large external corpus. The embeddings for similar speakers cluster together, independent of the language they speak. The system can thus use speaker embeddings to extrapolate how speakers would sound in different languages.\n<h4><a id=\"Evaluation_27\"></a>Evaluation</h4>\nWe evaluated the performance of our model on four axes. First, we measured the naturalness of the output in English to make sure that we do not degrade the existing experience. Then we measured the system’s naturalness, speaker similarity, and accent quality in Spanish. These three measures ensure that we provide our customers with a high-quality synthetic voice that resembles the original speaker and speaks Spanish with a native accent.\nThe figure below shows the boxplots of our measurements along the four axes, according to the MUSHRA (multiple stimuli with hidden reference and anchor) methodology. We evaluated the current English production model (EN Alexa) against the bilingual (Polyglot) model. The plots present (from left to right) the results for naturalness in English, naturalness in Spanish, speaker similarity in Spanish, and accent in Spanish.\n<img src=\"https://dev-media.amazoncloud.cn/05b565e23d3f4babb082d0646acb00f8_image.png\" alt=\"image.png\" />\nThe results of the researchers’ perception study, comparing the new bilingual model (Polyglot) to voice recordings and to the existing US English and US Spanish models.\nIn both naturalness evaluations we used English recordings of the original speaker as a reference. We can see in the English evaluations that the Polyglot system is performing slightly worse than the EN Alexa model. We decided that this was a small acceptable regression, given the benefits of having a voice that can speak both languages. The Polyglot system achieves similar naturalness scores in both English and Spanish.\nIn the speaker similarity evaluations, we asked listeners to rate how similar the Spanish samples were to a random recording of the original speaker in English and to rate the similarity of the English and Spanish voices in the original Alexa multilingual mode. We also compared the Polyglot system with a version of the EN Alexa model that mapped the Spanish phonemes to English phonemes.\nUnsurprisingly, the Polyglot Spanish voice sounds much more similar to the English target speaker than the native Spanish speaker from the original multilingual mode does to the native English speaker. The Polyglot voices don’t reach the same similarity that the voices produced by phoneme mapping do, but this might be because of listener bias toward the English accent.\nIn the accent evaluations, there is no statistically significant difference between the scores given to the Polyglot system and to the Spanish recordings. In other words, Polyglot sounds as native as Spanish Alexa recordings. Overall, we were able to produce a high-quality synthetic voice with a native Spanish accent that was perceived to be the same person as the English-speaking voice, without needing an English voice actor to read in Spanish.\nThis technology may enable Alexa to speak even more languages in the future, as we can make an existing speaker speak a new language without making additional recordings.\nABOUT THE AUTHOR\n<h4><a id=\"Kayoko_Yanagisawahttpswwwamazonscienceauthorkayokoyanagisawa_50\"></a><a href=\"https://www.amazon.science/author/kayoko-yanagisawa\" target=\"_blank\">Kayoko Yanagisawa</a></h4>\nKayoko Yanagisawa is a senior speech scientist in the Amazon Text-to-Speech group.\n<h4><a id=\"Marius_Cotescuhttpswwwamazonscienceauthormariuscotescu_54\"></a><a href=\"https://www.amazon.science/author/marius-cotescu\" target=\"_blank\">Marius Cotescu</a></h4>\nMarius Cotescu is an applied scientist in the Amazon Text-to-Speech group.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家