More-natural prosody for synthesized speech

自然语言处理

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"At this year’s Interspeech, the Amazon text-to-speech team presented two new papers about controlling prosody — the rhythm, emphasis, melody, duration, and loudness of speech — in speech synthesis.\n\nOne paper, “++[CopyCat: many-to-many fine-grained prosody transfer for neural text-to-speech](https://www.amazon.science/publications/copycat-many-to-many-fine-grained-prosody-transfer-for-neural-text-to-speech)++”, is about transferring prosody from recorded speech to speech synthesized in a different voice. In particular, it addresses the problem of “source speaker leakage”, in which the speech synthesis model sometimes produces speech in the source speaker’s voice, rather than the target speaker’s voice.\n\nAccording to listener studies using the industry-standard MUSHRA (multiple stimuli with hidden reference and anchor) methodology, the speech produced by our model improved over the state-of-the-art system's by 47% in terms of naturalness and 14% in retention of speaker identity.\n\n\n\nThe other paper, “++[Dynamic prosody generation for speech synthesis using linguistics-driven acoustic embedding selection](htt![image.png](https://dev-media.amazoncloud.cn/714b02b268e34e578301da8e25d83f6d_image.png)ps://www.amazon.science/publications/dynamic-prosody-generation-for-speech-synthesis-using-linguistics-driven-acoustic-embedding-selection)++”, is about achieving more dynamic and natural intonation in synthesized speech from TTS systems. It describes a model that uses syntactic and semantic properties of the utterance to determine the prosodic features.\n\nAgain according to tests using the MUSHRA methodology, our model reduced the discrepancy between the naturalness of synthesized speech and that of recorded speech by about 6% for complex utterances and 20% on the task of long-form reading.\n\n\n\n#### **CopyCat**\n\n\nWhen prosody transfer (PT) involves very fine-grained characteristics — the inflections of individual words, as opposed to general speaking styles — it’s more likely to suffer from source speaker leakage. This issue is exacerbated when the PT model is trained on non-parallel data — i.e., without having the same utterances spoken by the source and target speaker.\n\nThe core of CopyCat is a novel reference encoder, whose inputs are a mel-spectrogram of the source speech (a snapshot of the frequency spectrum); an embedding, or vector representation, of the source speech phonemes (the smallest units of speech); and a vector indicating the speaker’s identity. \n\nThe reference encoder outputs speaker-independent representations of the prosody of the input speech. These prosodic representations are robust to source speaker leakage despite being trained on non-parallel data. In the absence of parallel data, we train the model to transfer prosody from speakers onto themselves. \n\n![image.png](https://dev-media.amazoncloud.cn/745481604a9f49b0b0067a16e967bfc4_image.png)\n\nThe CopyCat architecture.\n\nDuring inference, the phonemes of the speech to be synthesized pass first through a phoneme encoder and then to the reference encoder. The output of the reference encoder, together with the encoded phonemes and the speaker identity vector, then passes to the decoder, which generates speech with the target speaker’s voice and the source speaker's prosody.\n\nIn order to evaluate the efficacy of our method, we compared CopyCat to a state-of-the-art model over five target voices, onto which the source prosody from 12 different unseen speakers had been transferred. CopyCat showed a statistically significant 47% increase in prosody transfer quality over the baseline. In another evaluation involving native speakers of American English, CopyCat showed a statistically significant 14% improvement over baseline in its ability to retain the target speaker’s identity. CopyCat achieves both the results with a significantly simpler decoder than the baseline requires, with no drop in naturalness. \n\n\n#### **Prosody Selection**\n\n\nText-to-speech (TTS) has improved dramatically in recent years, but it still lacks the dynamic variation and adaptability of human speech.\n\nOne popular way to encode prosody in TTS systems is to use a variational autoencoder (VAE), which learns a distribution of prosodic characteristics from sample speech. Selecting a prosodic style for a synthetic utterance is a matter of picking a point — an acoustic embedding — in that distribution. \n\nIn practice, most VAE-based TTS systems simply choose a point in the center of the distribution — a centroid — for all utterances. But rendering all the samples with the exact same prosody gets monotonous. \n\nIn our ++[Interspeech](https://www.amazon.science/conferences-and-events/interspeech-2020)++ paper, we present a novel way of exploiting linguistic information to select acoustic embeddings in VAE systems to achieve a more dynamic and natural intonation in TTS systems, particularly for stylistic speech such as the ++[newscaster](https://www.amazon.science/blog/varying-speaking-styles-with-neural-text-to-speech)++ speaking style.\n\n\n#### **Syntax, semantics, or both?**\n\n\nWe experiment with three different systems for generating vector representations of the inputs to a TTS system, which allows us to explore the impact of both syntax and semantics on the overall quality of speech synthesis.\n\nThe first system uses syntactic information only; the second relies solely on BERT embeddings, which capture semantic information about strings of text, on the basis of word co-occurrence in large text corpora; and the third uses a combination of BERT and syntactic information. Based on these representations, our model selects acoustic embeddings to characterize the prosody of synthesized utterances.\n\nTo explore whether syntactic information can aid prosody selection, we use the notion of syntactic distance, a measure based on constituency trees, which map syntactic relationships between the words of a sentence. Large syntactic distances correlate with acoustically relevant events such as phrasing breaks or prosodic resets.\n\n![image.png](https://dev-media.amazoncloud.cn/b8575a0939f6427682491df40f316dd9_image.png)\n\nA constituency tree featuring syntactic-distance measures (orange circles).\n\nCREDIT: GLYNIS CONDON\n\nAt left is the constituency tree of the sentence “The brown fox is quick, and it is jumping over the lazy dog”. Parts of speech are labeled according to the Penn part-of-speech tags: “DT”, for instance, indicates a determiner; “VBZ” indicates a third-person singular present verb, while “VBG” indicates a gerund or present participle; and so on.\n\nThe structure of the tree indicates syntactic relationships: for instance, “the”, “brown”, and “fox” together compose a noun phrase (NP), while “is” and “quick” compose a verb phrase (VP). \n\nSyntactic distance is a rank ordering that indicates the difference in the heights, within the tree, of the common ancestors of consecutive words; any values that preserve that ordering are valid.\n\nOne valid distance vector for this sentence is d = [0 2 1 3 1 8 7 6 5 4 3 2 1]. The completion of the subject noun phrase (after “fox”) triggers a prosodic reset, reflected in the distance of 3 between “fox” and “is”. There should also be a more emphasized reset at the end of the first clause, represented by the distance of 8 between “quick” and “and”.\n\nWe compared VAE models with linguistically informed acoustic-embedding selection against a VAE model that uses centroid selection on two tasks, sentence synthesis and long-form reading.\n\nThe sentence synthesis data set had four categories: complex utterances, sentences with compound nouns, and two types of questions, with their characteristic prosody (the rising inflection at the end, for instance): questions beginning with “wh” words (who, what, why, etc.) and “or” questions, which present a choice.\n\nThe model that uses syntactic information alone improves on the baseline model across the board, while the addition of semantic information improves performance still further in some contexts. \n\nOn the “wh” questions, the combination of syntactic and semantic data delivered an 8% improvement over the baseline, and on the “or” questions, the improvement was 21%. This demonstrates that questions have closely related syntactic structures, information that can be used to achieve better prosody.\n\nOn long-form reading, the syntactic model alone delivered the best results, reducing the gap between the baseline and recorded speech by approximately 20%.\n\nABOUT THE AUTHOR\n\n\n#### **[Shubhi Tyagi](https://www.amazon.science/author/shubhi-tyagi)**\n\n\nShubhi Tyagi is an applied scientist in the Amazon Text-to-Speech group.\n\n\n#### **[Sri Karlapati](https://www.amazon.science/author/sri-karlapati)**\n\n\nSri Karlapati is an applied scientist in the Amazon Text-to-Speech group.","render":"At this year’s Interspeech, the Amazon text-to-speech team presented two new papers about controlling prosody — the rhythm, emphasis, melody, duration, and loudness of speech — in speech synthesis.\nOne paper, “<ins><a href=\"https://www.amazon.science/publications/copycat-many-to-many-fine-grained-prosody-transfer-for-neural-text-to-speech\" target=\"_blank\">CopyCat: many-to-many fine-grained prosody transfer for neural text-to-speech</a></ins>”, is about transferring prosody from recorded speech to speech synthesized in a different voice. In particular, it addresses the problem of “source speaker leakage”, in which the speech synthesis model sometimes produces speech in the source speaker’s voice, rather than the target speaker’s voice.\nAccording to listener studies using the industry-standard MUSHRA (multiple stimuli with hidden reference and anchor) methodology, the speech produced by our model improved over the state-of-the-art system’s by 47% in terms of naturalness and 14% in retention of speaker identity.\nThe other paper, “<ins><a href=\"htt!%5Bimage.png%5D(https://dev-media.amazoncloud.cn/714b02b268e34e578301da8e25d83f6d_image.png)ps://www.amazon.science/publications/dynamic-prosody-generation-for-speech-synthesis-using-linguistics-driven-acoustic-embedding-selection\" target=\"_blank\">Dynamic prosody generation for speech synthesis using linguistics-driven acoustic embedding selection</a></ins>”, is about achieving more dynamic and natural intonation in synthesized speech from TTS systems. It describes a model that uses syntactic and semantic properties of the utterance to determine the prosodic features.\nAgain according to tests using the MUSHRA methodology, our model reduced the discrepancy between the naturalness of synthesized speech and that of recorded speech by about 6% for complex utterances and 20% on the task of long-form reading.\n<h4><a id=\"CopyCat_14\"></a>CopyCat</h4>\nWhen prosody transfer (PT) involves very fine-grained characteristics — the inflections of individual words, as opposed to general speaking styles — it’s more likely to suffer from source speaker leakage. This issue is exacerbated when the PT model is trained on non-parallel data — i.e., without having the same utterances spoken by the source and target speaker.\nThe core of CopyCat is a novel reference encoder, whose inputs are a mel-spectrogram of the source speech (a snapshot of the frequency spectrum); an embedding, or vector representation, of the source speech phonemes (the smallest units of speech); and a vector indicating the speaker’s identity.\nThe reference encoder outputs speaker-independent representations of the prosody of the input speech. These prosodic representations are robust to source speaker leakage despite being trained on non-parallel data. In the absence of parallel data, we train the model to transfer prosody from speakers onto themselves.\n<img src=\"https://dev-media.amazoncloud.cn/745481604a9f49b0b0067a16e967bfc4_image.png\" alt=\"image.png\" />\nThe CopyCat architecture.\nDuring inference, the phonemes of the speech to be synthesized pass first through a phoneme encoder and then to the reference encoder. The output of the reference encoder, together with the encoded phonemes and the speaker identity vector, then passes to the decoder, which generates speech with the target speaker’s voice and the source speaker’s prosody.\nIn order to evaluate the efficacy of our method, we compared CopyCat to a state-of-the-art model over five target voices, onto which the source prosody from 12 different unseen speakers had been transferred. CopyCat showed a statistically significant 47% increase in prosody transfer quality over the baseline. In another evaluation involving native speakers of American English, CopyCat showed a statistically significant 14% improvement over baseline in its ability to retain the target speaker’s identity. CopyCat achieves both the results with a significantly simpler decoder than the baseline requires, with no drop in naturalness.\n<h4><a id=\"Prosody_Selection_32\"></a>Prosody Selection</h4>\nText-to-speech (TTS) has improved dramatically in recent years, but it still lacks the dynamic variation and adaptability of human speech.\nOne popular way to encode prosody in TTS systems is to use a variational autoencoder (VAE), which learns a distribution of prosodic characteristics from sample speech. Selecting a prosodic style for a synthetic utterance is a matter of picking a point — an acoustic embedding — in that distribution.\nIn practice, most VAE-based TTS systems simply choose a point in the center of the distribution — a centroid — for all utterances. But rendering all the samples with the exact same prosody gets monotonous.\nIn our <ins><a href=\"https://www.amazon.science/conferences-and-events/interspeech-2020\" target=\"_blank\">Interspeech</a></ins> paper, we present a novel way of exploiting linguistic information to select acoustic embeddings in VAE systems to achieve a more dynamic and natural intonation in TTS systems, particularly for stylistic speech such as the <ins><a href=\"https://www.amazon.science/blog/varying-speaking-styles-with-neural-text-to-speech\" target=\"_blank\">newscaster</a></ins> speaking style.\n<h4><a id=\"Syntax_semantics_or_both_44\"></a>Syntax, semantics, or both?</h4>\nWe experiment with three different systems for generating vector representations of the inputs to a TTS system, which allows us to explore the impact of both syntax and semantics on the overall quality of speech synthesis.\nThe first system uses syntactic information only; the second relies solely on BERT embeddings, which capture semantic information about strings of text, on the basis of word co-occurrence in large text corpora; and the third uses a combination of BERT and syntactic information. Based on these representations, our model selects acoustic embeddings to characterize the prosody of synthesized utterances.\nTo explore whether syntactic information can aid prosody selection, we use the notion of syntactic distance, a measure based on constituency trees, which map syntactic relationships between the words of a sentence. Large syntactic distances correlate with acoustically relevant events such as phrasing breaks or prosodic resets.\n<img src=\"https://dev-media.amazoncloud.cn/b8575a0939f6427682491df40f316dd9_image.png\" alt=\"image.png\" />\nA constituency tree featuring syntactic-distance measures (orange circles).\nCREDIT: GLYNIS CONDON\nAt left is the constituency tree of the sentence “The brown fox is quick, and it is jumping over the lazy dog”. Parts of speech are labeled according to the Penn part-of-speech tags: “DT”, for instance, indicates a determiner; “VBZ” indicates a third-person singular present verb, while “VBG” indicates a gerund or present participle; and so on.\nThe structure of the tree indicates syntactic relationships: for instance, “the”, “brown”, and “fox” together compose a noun phrase (NP), while “is” and “quick” compose a verb phrase (VP).\nSyntactic distance is a rank ordering that indicates the difference in the heights, within the tree, of the common ancestors of consecutive words; any values that preserve that ordering are valid.\nOne valid distance vector for this sentence is d = [0 2 1 3 1 8 7 6 5 4 3 2 1]. The completion of the subject noun phrase (after “fox”) triggers a prosodic reset, reflected in the distance of 3 between “fox” and “is”. There should also be a more emphasized reset at the end of the first clause, represented by the distance of 8 between “quick” and “and”.\nWe compared VAE models with linguistically informed acoustic-embedding selection against a VAE model that uses centroid selection on two tasks, sentence synthesis and long-form reading.\nThe sentence synthesis data set had four categories: complex utterances, sentences with compound nouns, and two types of questions, with their characteristic prosody (the rising inflection at the end, for instance): questions beginning with “wh” words (who, what, why, etc.) and “or” questions, which present a choice.\nThe model that uses syntactic information alone improves on the baseline model across the board, while the addition of semantic information improves performance still further in some contexts.\nOn the “wh” questions, the combination of syntactic and semantic data delivered an 8% improvement over the baseline, and on the “or” questions, the improvement was 21%. This demonstrates that questions have closely related syntactic structures, information that can be used to achieve better prosody.\nOn long-form reading, the syntactic model alone delivered the best results, reducing the gap between the baseline and recorded speech by approximately 20%.\nABOUT THE AUTHOR\n<h4><a id=\"Shubhi_Tyagihttpswwwamazonscienceauthorshubhityagi_80\"></a><a href=\"https://www.amazon.science/author/shubhi-tyagi\" target=\"_blank\">Shubhi Tyagi</a></h4>\nShubhi Tyagi is an applied scientist in the Amazon Text-to-Speech group.\n<h4><a id=\"Sri_Karlapatihttpswwwamazonscienceauthorsrikarlapati_86\"></a><a href=\"https://www.amazon.science/author/sri-karlapati\" target=\"_blank\">Sri Karlapati</a></h4>\nSri Karlapati is an applied scientist in the Amazon Text-to-Speech group.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家