How to build highly expressive speech models

自然语言处理

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"In June, Alexa announced a new feature called ++[Reading Sidekick](https://www.aboutamazon.com/news/devices/alexas-new-reading-sidekick-makes-learning-to-read-fun)++, which helps kids grow into confident readers by taking turns reading with Alexa, while Alexa provides encouragement and support. To make this an engaging and entertaining experience, the Amazon Text-to-Speech team developed a version of the Alexa voice that speaks more slowly and with more expressivity than the standard, neutral voice.\n\n![image.png](https://dev-media.amazoncloud.cn/7340227d68b64ce989bccb90259ca0c4_image.png)\n\nA child enjoying Reading Sidekick with her panda Echo Dot Kids.\n\nBecause expressive speech is more variable than neutral speech, expressive-speech models are prone to stability issues, such as sudden stoppages or harsh inflections. To tackle this problem, model developers might collect data that represents a dedicated style; but that is costly and time consuming. They might deliver a model that is not based on attention — that is, it doesn’t focus on particular words of prior inputs when determining how to handle the current word. However, attentionless models are more complex, requiring more effort to deploy and often causing additional latency. \n\nOur goal was to develop a highly expressive voice without increasing the burden of either data collection or model deployment. We did this in two ways: by developing new approaches for data preprocessing and by delivering models adapted to expressive speech. We also collaborated closely with user experience (UX) researchers, both before and after building our models.\n\n#### **Comparison of storytelling voices**\n\nAlexa's standard voice\n\nAlexa's new storytelling voice\n\nTo determine what training data to collect, we ran a UX study before the start of the project, in which children and their parents listened to a baseline voice synthesizing narrative passages. The results indicated that a slower speech rate and enhanced expressivity would improve customer experience. When recording training data, we actively controlled both the speaking rate and the expressivity level.\n\nAfter we’d built our models, we did a second UX study and found that, for story reading, subjects preferred our new voice over the standard Alexa voice by a two-to-one margin.\n\n#### **Data curation**\n\nThe instability of highly expressive voice models is due to “extreme prosody”, which is common in the reading of children’s books. Prosody is the rhythm, emphasis, melody, duration, and loudness of speech; adults reading to young children will often exaggerate inflections, change volume dramatically, and extend or shorten the duration of words to convey meaning and hold their listeners’ attention.\n\n![image.png](https://dev-media.amazoncloud.cn/f1ed6df8a9e741928b9d347ce821d3d4_image.png)\n\nThe Reading Sidekick book list screen for the Echo Show.\n\nAlthough we want our dataset to capture a wide range of expressivity, some utterances may be too extreme. We developed a new approach to preprocessing training data that removes such outliers. For each utterance, we calculate the speaker embedding — a vector representation that captures prosodic features of the speaker’s voice. If the distance between a given speaker embedding and the average one is too large, we discard the utterance from the training set.\n\nNext, from each speech sample, we remove segments that cannot be automatically transcribed from audio to text. Since most such segments are dead air, removing them prevents the model from pausing too long between words.\n\n#### **Modeling**\n\nOn the modeling side, we use regularization and data augmentation to increase stability. A neural-network-based text-to-speech (NTTS) system consists of two components: (1) a mel-spectrogram generator and (2) a vocoder. The mel-spectrogram generator takes as input a sequence of phones — the shortest phonetic units — and outputs the amplitude of a signal at audible frequencies. It is responsible for the prosody of the voice. \n\nThe vocoder adds phase information to the mel-spectrogram, to create the synthetic speech signal. Without the phase information, the speech would be robotic. Our team previously developed a ++[universal vocoder](https://www.amazon.science/publications/universal-neural-vocoding-with-parallel-wavenet)++ that works well for this application.\n\nDuring training, we apply an L2 penalty to the weights of the mel-spectrogram generator; that is, weights that deviate from the average are assessed a penalty during training, and the penalty varies with the square of the deviation. This is a form of regularization, which reduces overfitting on the recording data.\n\nWe also use data augmentation to improve the output voice. We add neutral recordings to the training recordings, providing less extreme prosodic trajectories for the model to learn from.\n\nAs an additional input, for both types of training data, we provide the model with a style id, which helps it learn to distinguish the storytelling style from other styles available through Alexa. The combination of recording, processing, and regularization makes the model stable.\n\n![image.png](https://dev-media.amazoncloud.cn/08d7391adfcc45dd92ebae26a4370497_image.png)\n\nThe text-to-speech processing pipeline, with style ID as an input.\n\n#### **Evaluation**\n\nTo evaluate the Reading Sidekick voice, we asked adult crowdsourced testers which voice they preferred for reading stories to children. The standard Alexa voice was our baseline. We tested 100 short passages with a mean duration of around 15 seconds, each of which was evaluated 30 times by different crowdsourced testers. The testers were native speakers of English; no other constraint was imposed on the tester selection.\n\nThe results favor the Reading Sidekick voice by a large margin (61.16% Reading Sidekick vs 30.46% baseline, with P<.001), particularly considering the very noisy nature of crowdsourced evaluations and the fact that we did not discard any of the data received.\n\nThanks to ++[Marco Nicolis](https://www.amazon.science/author/marco-nicolis)++ and ++[Arnaud Joly](https://www.amazon.science/author/arnaud-joly)++ for their contributions to this research.\n\n![image.png](https://dev-media.amazoncloud.cn/51a4b077fddb487b867c573a2c7910a7_image.png)\n\nParticipants in a user study preferred the new storytelling voice to Alexa's standard voice by a two-to-one margin.\n\nABOUT THE AUTHOR\n\n#### **[Elena Sokolova](https://www.amazon.science/author/elena-sokolova)**\n\nElena Sokolova is an applied-science manager in the Amazon Text-to-Speech group.\n\n\n\n\n","render":"In June, Alexa announced a new feature called <ins><a href=\"https://www.aboutamazon.com/news/devices/alexas-new-reading-sidekick-makes-learning-to-read-fun\" target=\"_blank\">Reading Sidekick</a></ins>, which helps kids grow into confident readers by taking turns reading with Alexa, while Alexa provides encouragement and support. To make this an engaging and entertaining experience, the Amazon Text-to-Speech team developed a version of the Alexa voice that speaks more slowly and with more expressivity than the standard, neutral voice.\n<img src=\"https://dev-media.amazoncloud.cn/7340227d68b64ce989bccb90259ca0c4_image.png\" alt=\"image.png\" />\nA child enjoying Reading Sidekick with her panda Echo Dot Kids.\nBecause expressive speech is more variable than neutral speech, expressive-speech models are prone to stability issues, such as sudden stoppages or harsh inflections. To tackle this problem, model developers might collect data that represents a dedicated style; but that is costly and time consuming. They might deliver a model that is not based on attention — that is, it doesn’t focus on particular words of prior inputs when determining how to handle the current word. However, attentionless models are more complex, requiring more effort to deploy and often causing additional latency.\nOur goal was to develop a highly expressive voice without increasing the burden of either data collection or model deployment. We did this in two ways: by developing new approaches for data preprocessing and by delivering models adapted to expressive speech. We also collaborated closely with user experience (UX) researchers, both before and after building our models.\n<h4><a id=\"Comparison_of_storytelling_voices_10\"></a>Comparison of storytelling voices</h4>\nAlexa’s standard voice\nAlexa’s new storytelling voice\nTo determine what training data to collect, we ran a UX study before the start of the project, in which children and their parents listened to a baseline voice synthesizing narrative passages. The results indicated that a slower speech rate and enhanced expressivity would improve customer experience. When recording training data, we actively controlled both the speaking rate and the expressivity level.\nAfter we’d built our models, we did a second UX study and found that, for story reading, subjects preferred our new voice over the standard Alexa voice by a two-to-one margin.\n<h4><a id=\"Data_curation_20\"></a>Data curation</h4>\nThe instability of highly expressive voice models is due to “extreme prosody”, which is common in the reading of children’s books. Prosody is the rhythm, emphasis, melody, duration, and loudness of speech; adults reading to young children will often exaggerate inflections, change volume dramatically, and extend or shorten the duration of words to convey meaning and hold their listeners’ attention.\n<img src=\"https://dev-media.amazoncloud.cn/f1ed6df8a9e741928b9d347ce821d3d4_image.png\" alt=\"image.png\" />\nThe Reading Sidekick book list screen for the Echo Show.\nAlthough we want our dataset to capture a wide range of expressivity, some utterances may be too extreme. We developed a new approach to preprocessing training data that removes such outliers. For each utterance, we calculate the speaker embedding — a vector representation that captures prosodic features of the speaker’s voice. If the distance between a given speaker embedding and the average one is too large, we discard the utterance from the training set.\nNext, from each speech sample, we remove segments that cannot be automatically transcribed from audio to text. Since most such segments are dead air, removing them prevents the model from pausing too long between words.\n<h4><a id=\"Modeling_32\"></a>Modeling</h4>\nOn the modeling side, we use regularization and data augmentation to increase stability. A neural-network-based text-to-speech (NTTS) system consists of two components: (1) a mel-spectrogram generator and (2) a vocoder. The mel-spectrogram generator takes as input a sequence of phones — the shortest phonetic units — and outputs the amplitude of a signal at audible frequencies. It is responsible for the prosody of the voice.\nThe vocoder adds phase information to the mel-spectrogram, to create the synthetic speech signal. Without the phase information, the speech would be robotic. Our team previously developed a <ins><a href=\"https://www.amazon.science/publications/universal-neural-vocoding-with-parallel-wavenet\" target=\"_blank\">universal vocoder</a></ins> that works well for this application.\nDuring training, we apply an L2 penalty to the weights of the mel-spectrogram generator; that is, weights that deviate from the average are assessed a penalty during training, and the penalty varies with the square of the deviation. This is a form of regularization, which reduces overfitting on the recording data.\nWe also use data augmentation to improve the output voice. We add neutral recordings to the training recordings, providing less extreme prosodic trajectories for the model to learn from.\nAs an additional input, for both types of training data, we provide the model with a style id, which helps it learn to distinguish the storytelling style from other styles available through Alexa. The combination of recording, processing, and regularization makes the model stable.\n<img src=\"https://dev-media.amazoncloud.cn/08d7391adfcc45dd92ebae26a4370497_image.png\" alt=\"image.png\" />\nThe text-to-speech processing pipeline, with style ID as an input.\n<h4><a id=\"Evaluation_48\"></a>Evaluation</h4>\nTo evaluate the Reading Sidekick voice, we asked adult crowdsourced testers which voice they preferred for reading stories to children. The standard Alexa voice was our baseline. We tested 100 short passages with a mean duration of around 15 seconds, each of which was evaluated 30 times by different crowdsourced testers. The testers were native speakers of English; no other constraint was imposed on the tester selection.\nThe results favor the Reading Sidekick voice by a large margin (61.16% Reading Sidekick vs 30.46% baseline, with P<.001), particularly considering the very noisy nature of crowdsourced evaluations and the fact that we did not discard any of the data received.\nThanks to <ins><a href=\"https://www.amazon.science/author/marco-nicolis\" target=\"_blank\">Marco Nicolis</a></ins> and <ins><a href=\"https://www.amazon.science/author/arnaud-joly\" target=\"_blank\">Arnaud Joly</a></ins> for their contributions to this research.\n<img src=\"https://dev-media.amazoncloud.cn/51a4b077fddb487b867c573a2c7910a7_image.png\" alt=\"image.png\" />\nParticipants in a user study preferred the new storytelling voice to Alexa’s standard voice by a two-to-one margin.\nABOUT THE AUTHOR\n<h4><a id=\"Elena_Sokolovahttpswwwamazonscienceauthorelenasokolova_62\"></a><a href=\"https://www.amazon.science/author/elena-sokolova\" target=\"_blank\">Elena Sokolova</a></h4>\nElena Sokolova is an applied-science manager in the Amazon Text-to-Speech group.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家