New Alexa features: Speaking style adaptation

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"*Today in Seattle, Dave Limp, Amazon’s senior vice president for devices, unveiled the latest lineup of products and services from his organization. During the presentation, Rohit Prasad, Amazon vice president and Alexa head scientist, described three new advances from the Alexa science team. One of those is speaking style adaptation.*\n\nAlexa’s speech is generated by text-to-speech (TTS) models, which convert the textual outputs of Alexa’s [natural-language-understanding](https://www.amazon.science/tag/nlu) models and [dialogue managers](https://www.amazon.science/tag/dialogue-management) into synthetic speech.\n\nRead Alexa head scientist Rohit Prasad's overview of today's Alexa-related announcements [on Amazon's Day One blog](https://blog.aboutamazon.com/devices/ai-advances-make-alexa-more-natural-conversational-and-useful).\n\nIn recent years, Alexa has been using neural TTS, or TTS based on neural networks, which has enabled not only more natural-sounding speech, but also much greater versatility. Neural TTS enables Alexa to vary her [speaking style](https://www.amazon.science/blog/varying-speaking-styles-with-neural-text-to-speech) — newscaster or music style, for instance — and it enables us to transfer [prosody](https://www.amazon.science/blog/neural-text-to-speech-makes-speech-synthesizers-much-more-versatile), or inflection patterns, from one voice to another.\n\nIn human speech, speaking style and prosody are often a matter of context, and for Alexa’s interactions with customers to be as natural as possible, the same should be true for her. Imagine the following exchange, for instance:\n\n**Customer**: Alexa, play the Village People.\n**Alexa**: Do you mean the band, the album, or the song?\n\nA human speaker would naturally emphasize “band”, “album”, and “song”, the words most strongly correlated with missing information.\n\nWith speaking style adaptation, Alexa will begin to vary prosodic patterns in the same way, to fit the conversational context. Similarly, she will vary her tone: a cheerful, upbeat tone might fit some contexts, but that could be annoying if Alexa has just failed to successfully complete a request.\n\n![image.png](https://dev-media.amazoncloud.cn/adbec13280f045a7b752551f5a03608b_image.png)\n\nThis image depicts our model’s representations (embeddings) of speech samples drawn from data sets with different prosodic characteristics. Points of the same color represent samples from the same data set. The clustering of like-colored points indicates that our model accurately captures information about prosody. Based on context, the speech generator selects a point in this space to define the prosody of the generated speech.\n\nOne of the models that enable speaking style adaptation generates alternative phrasings in a context-aware way, so that Alexa does not keep asking the same question repeatedly. In one round of conversation, she might say, “Do you mean the song?”, in another, “Should I play the song, then?”, and so on.\n\nSpeaking style adaptation thus represents a step in the direction of concept-to-speech, the envisioned successor of text-to-speech, which takes as input a high-level representation of a concept and has considerable latitude in how to convey it, based on context and other signals. For instance, sometimes the same conceptual content can be conveyed by tone of voice, by explicit linguistic formulation, or by both.\n\nSpeaking style adaptation depends on state information from the dialogue manager. That information includes the customer’s intent — the action the customer wants performed, such as playing a song — and slot values — the specific entities involved in the action, such as the song name.\n\nIt also includes the current conversational state — opening, development, or closing — and the dialogue manager’s current confidence in its understanding of the dialogue state.\n\nFirst, the state information passes to the speech generator’s rephrasing module, a Transformer-based neural network trained on a large, domain-specific linguistic corpus. Based on the state information, the model produces a list of alternative phrasings.\n\nThe rephrasings then pass to another neural network that has been trained to identify “focus words” in each sentence, words that are good candidates for particular emphasis in speech.\n\n![image.png](https://dev-media.amazoncloud.cn/02142e3edc7d40f789b8e5515a10f79a_image.png)\n\nA sample output of the focus word model, assigning different weights (y-axis) to different input words.\n\nThe dialogue state information, the rephrasing proposed by the rephrasing module, and the output of the focus word model all pass to another neural network — the articulator — that generates the output speech.\n\nThe focus word information, together with the slot information, tells the articulator which words of the input sentence to stress. The confidence scores from the dialogue manager determine the speech style, on a spectrum from low to high excitement.\n\nIt’s still day one, however, and we are experimenting with leveraging other contextual information to further customize Alexa’s responses.\n\n**More coverage of Alexa announcements**\n- [Interactive teaching by customers](https://www.amazon.science/blog/new-alexa-features-interactive-teaching-by-customers)\n- [Natural turn-taking](https://www.amazon.science/blog/change-to-alexa-wake-word-process-adds-natural-turn-taking)\n- [The science behind Echo 10](https://www.amazon.science/blog/the-science-behind-echo-show-10)\n\nABOUT THE AUTHOR\n\n#### **[Antonio Bonafonte](https://www.amazon.science/author/antonio-bonafonte)**\n\nAntonio Bonafonte is an applied scientist in the Amazon text-to-speech group.\n","render":"Today in Seattle, Dave Limp, Amazon’s senior vice president for devices, unveiled the latest lineup of products and services from his organization. During the presentation, Rohit Prasad, Amazon vice president and Alexa head scientist, described three new advances from the Alexa science team. One of those is speaking style adaptation.\nAlexa’s speech is generated by text-to-speech (TTS) models, which convert the textual outputs of Alexa’s <a href=\"https://www.amazon.science/tag/nlu\" target=\"_blank\">natural-language-understanding</a> models and <a href=\"https://www.amazon.science/tag/dialogue-management\" target=\"_blank\">dialogue managers</a> into synthetic speech.\nRead Alexa head scientist Rohit Prasad’s overview of today’s Alexa-related announcements <a href=\"https://blog.aboutamazon.com/devices/ai-advances-make-alexa-more-natural-conversational-and-useful\" target=\"_blank\">on Amazon’s Day One blog</a>.\nIn recent years, Alexa has been using neural TTS, or TTS based on neural networks, which has enabled not only more natural-sounding speech, but also much greater versatility. Neural TTS enables Alexa to vary her <a href=\"https://www.amazon.science/blog/varying-speaking-styles-with-neural-text-to-speech\" target=\"_blank\">speaking style</a> — newscaster or music style, for instance — and it enables us to transfer <a href=\"https://www.amazon.science/blog/neural-text-to-speech-makes-speech-synthesizers-much-more-versatile\" target=\"_blank\">prosody</a>, or inflection patterns, from one voice to another.\nIn human speech, speaking style and prosody are often a matter of context, and for Alexa’s interactions with customers to be as natural as possible, the same should be true for her. Imagine the following exchange, for instance:\nCustomer: Alexa, play the Village People. \nAlexa: Do you mean the band, the album, or the song?\nA human speaker would naturally emphasize “band”, “album”, and “song”, the words most strongly correlated with missing information.\nWith speaking style adaptation, Alexa will begin to vary prosodic patterns in the same way, to fit the conversational context. Similarly, she will vary her tone: a cheerful, upbeat tone might fit some contexts, but that could be annoying if Alexa has just failed to successfully complete a request.\n<img src=\"https://dev-media.amazoncloud.cn/adbec13280f045a7b752551f5a03608b_image.png\" alt=\"image.png\" />\nThis image depicts our model’s representations (embeddings) of speech samples drawn from data sets with different prosodic characteristics. Points of the same color represent samples from the same data set. The clustering of like-colored points indicates that our model accurately captures information about prosody. Based on context, the speech generator selects a point in this space to define the prosody of the generated speech.\nOne of the models that enable speaking style adaptation generates alternative phrasings in a context-aware way, so that Alexa does not keep asking the same question repeatedly. In one round of conversation, she might say, “Do you mean the song?”, in another, “Should I play the song, then?”, and so on.\nSpeaking style adaptation thus represents a step in the direction of concept-to-speech, the envisioned successor of text-to-speech, which takes as input a high-level representation of a concept and has considerable latitude in how to convey it, based on context and other signals. For instance, sometimes the same conceptual content can be conveyed by tone of voice, by explicit linguistic formulation, or by both.\nSpeaking style adaptation depends on state information from the dialogue manager. That information includes the customer’s intent — the action the customer wants performed, such as playing a song — and slot values — the specific entities involved in the action, such as the song name.\nIt also includes the current conversational state — opening, development, or closing — and the dialogue manager’s current confidence in its understanding of the dialogue state.\nFirst, the state information passes to the speech generator’s rephrasing module, a Transformer-based neural network trained on a large, domain-specific linguistic corpus. Based on the state information, the model produces a list of alternative phrasings.\nThe rephrasings then pass to another neural network that has been trained to identify “focus words” in each sentence, words that are good candidates for particular emphasis in speech.\n<img src=\"https://dev-media.amazoncloud.cn/02142e3edc7d40f789b8e5515a10f79a_image.png\" alt=\"image.png\" />\nA sample output of the focus word model, assigning different weights (y-axis) to different input words.\nThe dialogue state information, the rephrasing proposed by the rephrasing module, and the output of the focus word model all pass to another neural network — the articulator — that generates the output speech.\nThe focus word information, together with the slot information, tells the articulator which words of the input sentence to stress. The confidence scores from the dialogue manager determine the speech style, on a spectrum from low to high excitement.\nIt’s still day one, however, and we are experimenting with leveraging other contextual information to further customize Alexa’s responses.\nMore coverage of Alexa announcements\n<ul>\n<li><a href=\"https://www.amazon.science/blog/new-alexa-features-interactive-teaching-by-customers\" target=\"_blank\">Interactive teaching by customers</a></li>\n<li><a href=\"https://www.amazon.science/blog/change-to-alexa-wake-word-process-adds-natural-turn-taking\" target=\"_blank\">Natural turn-taking</a></li>\n<li><a href=\"https://www.amazon.science/blog/the-science-behind-echo-show-10\" target=\"_blank\">The science behind Echo 10</a></li>\n</ul>\nABOUT THE AUTHOR\n<h4><a id=\"Antonio_Bonafontehttpswwwamazonscienceauthorantoniobonafonte_50\"></a><a href=\"https://www.amazon.science/author/antonio-bonafonte\" target=\"_blank\">Antonio Bonafonte</a></h4>\nAntonio Bonafonte is an applied scientist in the Amazon text-to-speech group.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家