Two new approaches to synthesizing speech with appropriate prosody

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"At [ICASSP 2021](https://www.amazon.science/conferences-and-events/icassp-2021), the Amazon text-to-speech team presented two new papers about synthesizing speech from text with contextually appropriate prosody — the rhythm, emphasis, melody, duration, and loudness of speech.\n\nText-to-speech (TTS) is a one-to-many problem where a single piece of text may have more than one appropriate prosodic rendition. Determining the prosody of a piece of text is a non-trivial problem, but it can increase the naturalness of synthesized speech considerably.\n\nThe approaches we describe in the two papers share a general philosophy, but the ways in which they tackle the problem are fundamentally different. \n\n![image.png](https://dev-media.amazoncloud.cn/d1f1c2ba72e64fd8bde8767a7844bc57_image.png)\n\nThe Kathaka architecture, with separate encoders for spectrogram and phoneme sequence inputs.\n\nOne of our papers, “[Prosodic representation learning and contextual sampling for neural text-to-speech](https://www.amazon.science/publications/prosodic-representation-learning-and-contextual-sampling-for-neural-text-to-speech)”, introduces Kathaka, a model trained using a novel two-stage approach. In the first stage, the model learns a distribution of the prosody of all the speech samples in the training data by exploiting a variational learning approach. In stage two, the model learns to sample from this distribution based on semantic and syntactic characteristics of the texts associated with the speech samples.\n\nAccording to listener studies using the industry-standard MUSHRA (multiple stimuli with hidden reference and anchor) methodology, the speech produced by Kathaka improved over the baseline TTS model by 13.2% in terms of naturalness.\n\nThe other paper, “[CAMP: A two-stage approach to modelling prosody in context](https://www.amazon.science/publications/camp-a-two-stage-approach-to-modelling-prosody-in-context)”, introduces CAMP, the context-aware model of prosody. Like Kathaka, CAMP is trained using a two-stage approach. In the first stage, CAMP learns a representation of prosody for every word of each speech sample in the training data in a non-variational way. In stage two, the model learns to predict these learned representations based on the semantic and syntactic characteristics of the associated texts. \n\nAccording listener studies with MUSHRA evaluations, the speech produced by CAMP improved over the baseline TTS model by 26% in terms of naturalness. \n\n#### **Kathaka**\n\nSince TTS is a one-to-many problem, where the same text can be said in different ways, TTS models often synthesize speech with neutral prosody. This decreases the naturalness of synthesized speech, as there is no relation between the prosody and what is being said. \n\nKathaka’s two-stage learning approach tackles this problem by exploiting the semantics and syntax of the text. The Kathaka architecture has two encoders: one, the reference encoder, takes a mel-spectrogram (a snapshot of the frequency spectrum) of the speech signal as input; the other takes the associated text, represented as a sequence of phonemes, the smallest units of speech. \n\nBased on the mel-spectrogram, the reference encoder outputs the parameters of a prosody distribution (the mean and variance, µ and σ in the diagram above), and a sample is selected from that distribution. This sample, along with the phoneme encoding, is used to synthesize a new mel-spectrogram. The model is an autoencoder, meaning that it’s trained to output the same mel-spectrogram given to the reference encoder as input.\n\nAt inference time, of course, mel-spectrograms aren’t available as input, as they are to be synthesized. Thus, in step two, we train “Samplers”, which predict the parameters of the prosody distribution directly from the text.\n\n![image.png](https://dev-media.amazoncloud.cn/3b2997e1126546a495df1314346a1d53_image.png)\n\nThe architecture of the Kathaka sampler, which factors in both semantic information from BERT embeddings and syntactic information from parse trees.\n\nTo encode the text, we use a BERT model, which is pretrained to provide contextual word embeddings — representations of words as vectors in a multidimensional space — that capture semantic and some syntactic information about the text. We also apply graph neural networks to syntax parse trees of the text, to produce representations of just the texts’ syntactic information. \n\nFrom these representations, the Sampler learns to predict the parameters of the prosody distribution. At inference time, a sample from this distribution is used in place of the sampled point from the reference encoder to synthesize the mel-spectrogram.\n\nIn order to evaluate the efficacy of Kathaka, we compared it to our neural-text-to-speech (NTTS) baseline and showed a statistically significant 13.2% increase in naturalness. \n\n#### **CAMP**\n\nCAMP uses a similar two-step approach to training, but instead of learning a distribution of prosodies, it learns specific mappings between individual words and prosodic representations, conditioned on semantic and syntactic features of the text.\n\nIn stage one, CAMP learns word-level representations of prosody using a word-level reference encoder. This encoder takes a mel-spectrogram as input and produces a word-level representation of the speech sample’s prosody. This word-level representation is then aligned with the phonemes that constitute the word, which, again, are encoded by a separate encoder. Both sets of features are then used to synthesize a mel-spectrogram as output, and the training target is the same mel-spectrogram that the reference encoder took as input. Through this process, CAMP learns word-level prosodic representations. \n\n![image.png](https://dev-media.amazoncloud.cn/658384ed3cb54a10b6785b2a229353da_image.png)\n\nDuring training (left), CAMP, like Kathaka, learns a prosody representation through the reference encoder and learns to predict that representation from the syntactic and semantic content of the input text. At inference, it replaces the prosody representations from the reference encoder (center) with representations predicted from the syntactic and semantic content of the input text (right).\n\nIn stage two, CAMP uses semantic and syntactic information from the input texts to predict the word-level prosody representations learned in stage one. To encode the text, we again use BERT embeddings, and we also use word-level syntax tags such as (1) part of speech (POS); (2) word class (“open” words such as nouns or verbs, which can proliferate indefinitely, versus “closed” words such as pronouns and articles, which are fixed and limited); (3) noun structure; and (4) punctuation structure. This information is then used to predict the word-level prosody representations learned in stage one. \n\nAs with Kathaka, during inference, we replace the prosody representations from the reference encoder with the representations predicted from the syntactic and semantic content of the input text. \n\nCompared to our NTTS baseline, CAMP showed a statistically significant 26% increase in naturalness. ","render":"At <a href=\"https://www.amazon.science/conferences-and-events/icassp-2021\" target=\"_blank\">ICASSP 2021</a>, the Amazon text-to-speech team presented two new papers about synthesizing speech from text with contextually appropriate prosody — the rhythm, emphasis, melody, duration, and loudness of speech.\nText-to-speech (TTS) is a one-to-many problem where a single piece of text may have more than one appropriate prosodic rendition. Determining the prosody of a piece of text is a non-trivial problem, but it can increase the naturalness of synthesized speech considerably.\nThe approaches we describe in the two papers share a general philosophy, but the ways in which they tackle the problem are fundamentally different.\n<img src=\"https://dev-media.amazoncloud.cn/d1f1c2ba72e64fd8bde8767a7844bc57_image.png\" alt=\"image.png\" />\nThe Kathaka architecture, with separate encoders for spectrogram and phoneme sequence inputs.\nOne of our papers, “<a href=\"https://www.amazon.science/publications/prosodic-representation-learning-and-contextual-sampling-for-neural-text-to-speech\" target=\"_blank\">Prosodic representation learning and contextual sampling for neural text-to-speech</a>”, introduces Kathaka, a model trained using a novel two-stage approach. In the first stage, the model learns a distribution of the prosody of all the speech samples in the training data by exploiting a variational learning approach. In stage two, the model learns to sample from this distribution based on semantic and syntactic characteristics of the texts associated with the speech samples.\nAccording to listener studies using the industry-standard MUSHRA (multiple stimuli with hidden reference and anchor) methodology, the speech produced by Kathaka improved over the baseline TTS model by 13.2% in terms of naturalness.\nThe other paper, “<a href=\"https://www.amazon.science/publications/camp-a-two-stage-approach-to-modelling-prosody-in-context\" target=\"_blank\">CAMP: A two-stage approach to modelling prosody in context</a>”, introduces CAMP, the context-aware model of prosody. Like Kathaka, CAMP is trained using a two-stage approach. In the first stage, CAMP learns a representation of prosody for every word of each speech sample in the training data in a non-variational way. In stage two, the model learns to predict these learned representations based on the semantic and syntactic characteristics of the associated texts.\nAccording listener studies with MUSHRA evaluations, the speech produced by CAMP improved over the baseline TTS model by 26% in terms of naturalness.\n<h4><a id=\"Kathaka_18\"></a>Kathaka</h4>\nSince TTS is a one-to-many problem, where the same text can be said in different ways, TTS models often synthesize speech with neutral prosody. This decreases the naturalness of synthesized speech, as there is no relation between the prosody and what is being said.\nKathaka’s two-stage learning approach tackles this problem by exploiting the semantics and syntax of the text. The Kathaka architecture has two encoders: one, the reference encoder, takes a mel-spectrogram (a snapshot of the frequency spectrum) of the speech signal as input; the other takes the associated text, represented as a sequence of phonemes, the smallest units of speech.\nBased on the mel-spectrogram, the reference encoder outputs the parameters of a prosody distribution (the mean and variance, µ and σ in the diagram above), and a sample is selected from that distribution. This sample, along with the phoneme encoding, is used to synthesize a new mel-spectrogram. The model is an autoencoder, meaning that it’s trained to output the same mel-spectrogram given to the reference encoder as input.\nAt inference time, of course, mel-spectrograms aren’t available as input, as they are to be synthesized. Thus, in step two, we train “Samplers”, which predict the parameters of the prosody distribution directly from the text.\n<img src=\"https://dev-media.amazoncloud.cn/3b2997e1126546a495df1314346a1d53_image.png\" alt=\"image.png\" />\nThe architecture of the Kathaka sampler, which factors in both semantic information from BERT embeddings and syntactic information from parse trees.\nTo encode the text, we use a BERT model, which is pretrained to provide contextual word embeddings — representations of words as vectors in a multidimensional space — that capture semantic and some syntactic information about the text. We also apply graph neural networks to syntax parse trees of the text, to produce representations of just the texts’ syntactic information.\nFrom these representations, the Sampler learns to predict the parameters of the prosody distribution. At inference time, a sample from this distribution is used in place of the sampled point from the reference encoder to synthesize the mel-spectrogram.\nIn order to evaluate the efficacy of Kathaka, we compared it to our neural-text-to-speech (NTTS) baseline and showed a statistically significant 13.2% increase in naturalness.\n<h4><a id=\"CAMP_38\"></a>CAMP</h4>\nCAMP uses a similar two-step approach to training, but instead of learning a distribution of prosodies, it learns specific mappings between individual words and prosodic representations, conditioned on semantic and syntactic features of the text.\nIn stage one, CAMP learns word-level representations of prosody using a word-level reference encoder. This encoder takes a mel-spectrogram as input and produces a word-level representation of the speech sample’s prosody. This word-level representation is then aligned with the phonemes that constitute the word, which, again, are encoded by a separate encoder. Both sets of features are then used to synthesize a mel-spectrogram as output, and the training target is the same mel-spectrogram that the reference encoder took as input. Through this process, CAMP learns word-level prosodic representations.\n<img src=\"https://dev-media.amazoncloud.cn/658384ed3cb54a10b6785b2a229353da_image.png\" alt=\"image.png\" />\nDuring training (left), CAMP, like Kathaka, learns a prosody representation through the reference encoder and learns to predict that representation from the syntactic and semantic content of the input text. At inference, it replaces the prosody representations from the reference encoder (center) with representations predicted from the syntactic and semantic content of the input text (right).\nIn stage two, CAMP uses semantic and syntactic information from the input texts to predict the word-level prosody representations learned in stage one. To encode the text, we again use BERT embeddings, and we also use word-level syntax tags such as (1) part of speech (POS); (2) word class (“open” words such as nouns or verbs, which can proliferate indefinitely, versus “closed” words such as pronouns and articles, which are fixed and limited); (3) noun structure; and (4) punctuation structure. This information is then used to predict the word-level prosody representations learned in stage one.\nAs with Kathaka, during inference, we replace the prosody representations from the reference encoder with the representations predicted from the syntactic and semantic content of the input text.\nCompared to our NTTS baseline, CAMP showed a statistically significant 26% increase in naturalness.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家