A simpler singing synthesis system

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"Singing synthesis — the use of computer models to synthesize a human singing voice — has been studied since the 1950s. Like the related field of text-to-speech (TTS), it initially revolved around two paradigms: statistical parametric synthesis, which uses statistical models to reproduce features of the voice, and unit selection, in which snippets of vocal recordings are recombined on the fly.\n\nRecently, TTS has shifted toward neural text-to-speech (NTTS), or models based on deep neural networks, which increase the perceived quality of the generated speech. An important class of NTTS models, called attention-based sequence-to-sequence (AS2S), has become the industry standard.\n\nIn a [paper](https://www.amazon.science/publications/singing-synthesis-with-a-little-help-from-my-attention) at this year's [Interspeech](https://www.amazon.science/conferences-and-events/interspeech-2020), we are presenting a singing synthesis model called UTACO, which was built using AS2S. To our knowledge, we were the first to do this, in the fall of 2019, although several successful AS2S architectures have been introduced in the singing synthesis field since then.\n\n![image.png](https://dev-media.amazoncloud.cn/d8a3cf27bf194ab5b83603a5fdca5604_image.png)\n\nAs input, the new singing synthesis system takes a musical score with lyrics, which it represents as a set of phonemes (the smallest units of sound from which spoken words are composed) labeled according to properties such as pitch and duration.\n\nUTACO is simpler than previous models: it doesn’t depend on sub-models that separately generate input features such as vibrato and note and phoneme durations; instead, it simply takes notated music — with lyrics — as input. It also autonomously sings in tune, which is not true of all neural models.\n\nMost important, UTACO achieves a high level of naturalness. In our paper, we compare it to the most recent fully neural model in the literature, which obtained a naturalness score of 31 out of 100 in a test using the MUSHRA (multiple stimuli with hidden reference and anchor) methodology. UTACO’s score was 60, while the training examples of humans singing got an 82.\n\nFinally, because AS2S models are a very active field of research, UTACO can naturally take advantage of many improvements and extensions already reported in the literature.\n\n#### **Simplifying singing synthesis**\n\nWhen we began investigating singing synthesis, we noticed a stark contrast between it and NTTS. Singing synthesis models seemed to be conceptually more complex. Most singing models required a number of different inputs, such as the pitch pattern of the singing voice over time (called [F0](https://en.wikipedia.org/wiki/Fundamental_frequency)) or almost [imperceptible errors](https://www.isca-speech.org/archive/archive_papers/interspeech_2007/i07_4011.pdf) whose absence makes the singing sound unnatural. Producing each of these inputs required a separate sub-model.\n\nThe only input required by an AS2S TTS model, by contrast, is a sequence of [phonemes](https://en.wikipedia.org/wiki/Phoneme), the individual sounds from which any spoken word is made. Before AS2S, speech models also required specification of a number of other features, such as speed, rhythm, and intonation (collectively called [prosody](https://en.wikipedia.org/wiki/Prosody_(linguistics))). AS2S models learn all of this on their own, from training examples.\n\nWe wondered whether an AS2S model could learn everything necessary to synthesize a singing voice, too. A trained human can sing a song just by reading a score, so we built a simple AS2S speech architecture and fed it only the information contained in the score, showing it corresponding examples of how that score should be sung. To the best of our knowledge, we were the first to do this, in the fall of 2019, although several successful AS2S architectures have been introduced in the singing synthesis field since then.\n\n#### **Measurable advances**\n\nIn our paper, we compare UTACO to [WGANSing](https://arxiv.org/abs/1903.10729), which was at the time of submission the most recent fully neural singing synthesis model in the literature. In our [MUSHRA](https://en.wikipedia.org/wiki/MUSHRA) test, 40 listeners were asked to compare three versions of the same short song clip and score them from 0 to 100 in terms of perceived “naturalness”. The versions are\n\n- the audio produced by UTACO;\n- the audio produced by WGANSing;\n- the recording of the human voice used to train the model.\n\nThe listeners do not know which is which, so they are not biased. The results appear below. The mean differences in score are statistically significant (the p-value of all paired t-tests is below 10-16).\n\n![image.png](https://dev-media.amazoncloud.cn/9d8a9ab794114080a515500086010e7f_image.png)\n\nWe take WGANSing as representative of the state of the art in neural singing synthesis circa fall 2019. WGANSing has a different architecture (it’s not based on AS2S), and at synthesis time, it needs to be fed the pitch pattern and the duration of each phoneme, which are extracted from the original recording. UTACO generates all of these features on its own. \n\nOne interesting result is that UTACO is capable of reproducing a good vibrato autonomously, even “deciding” where to apply it: in the sample input below, note that there is no vibrato indication. Before UTACO, researchers created whole sub-[models](https://arxiv.org/pdf/1803.04030.pdf) devoted to representing vibrato.\n\nUTACO is a jump forward for singing synthesis, but it does have its drawbacks. For example, rests in a score can sometimes cause it to break down (a [known](https://arxiv.org/abs/1710.07654) [problem](http://papers.nips.cc/paper/8580-fastspeech-fast-robust-and-controllable-text-to-speech) in AS2S architectures). And its timing is not quite perfect, something that a musician can detect instantly.\n\nAS2S architectures, however, are being intensely researched in the text-to-speech field, and many of the resulting innovations may be directly applicable to our model.\n\n#### **Model architecture**\n\n![image.png](https://dev-media.amazoncloud.cn/1cf411d11c7e4b89924451c178abe696_image.png)\n\nA diagram of the UTACO design.\n\nn more detail, to turn a score into the input for UTACO, we use a representation we call note embeddings. We take a score (in [MusicXML](https://www.musicxml.com/) format) and perform linguistic analysis on the lyrics to figure out which phonemes must be pronounced on each note. \n\nThe sequence of phonemes is what a text-to-speech model would normally see as input. But to each phoneme, we add information about the note that contains it: octave (pitch range), step (which of the 12 notes in the pitch range it is), and duration in seconds. We also add a “progress” stream, which is 1 at the beginning of a note and 0 at the end, so UTACO knows where notes begin and end. (See illustration, above.)\n\nAs in a typical NTTS system, the model produces a [spectrogram](https://en.wikipedia.org/wiki/Spectrogram), which is turned into a waveform by a neural vocoder based on dilated causal convolutions. \n\nWe are pleased with the results of our experiments with UTACO. But this is just the beginning of a major change in the field of singing synthesis that will enhance its capabilities in ways that were unthinkable until just a few years ago.\n\nABOUT THE AUTHOR\n\n#### **[Orazio Angelini](https://en.wikipedia.org/wiki/Spectrogram)**\n\nOrazio Angelini is an applied scientist with the Amazon Text-to-Speech group.","render":"Singing synthesis — the use of computer models to synthesize a human singing voice — has been studied since the 1950s. Like the related field of text-to-speech (TTS), it initially revolved around two paradigms: statistical parametric synthesis, which uses statistical models to reproduce features of the voice, and unit selection, in which snippets of vocal recordings are recombined on the fly.\nRecently, TTS has shifted toward neural text-to-speech (NTTS), or models based on deep neural networks, which increase the perceived quality of the generated speech. An important class of NTTS models, called attention-based sequence-to-sequence (AS2S), has become the industry standard.\nIn a <a href=\"https://www.amazon.science/publications/singing-synthesis-with-a-little-help-from-my-attention\" target=\"_blank\">paper</a> at this year’s <a href=\"https://www.amazon.science/conferences-and-events/interspeech-2020\" target=\"_blank\">Interspeech</a>, we are presenting a singing synthesis model called UTACO, which was built using AS2S. To our knowledge, we were the first to do this, in the fall of 2019, although several successful AS2S architectures have been introduced in the singing synthesis field since then.\n<img src=\"https://dev-media.amazoncloud.cn/d8a3cf27bf194ab5b83603a5fdca5604_image.png\" alt=\"image.png\" />\nAs input, the new singing synthesis system takes a musical score with lyrics, which it represents as a set of phonemes (the smallest units of sound from which spoken words are composed) labeled according to properties such as pitch and duration.\nUTACO is simpler than previous models: it doesn’t depend on sub-models that separately generate input features such as vibrato and note and phoneme durations; instead, it simply takes notated music — with lyrics — as input. It also autonomously sings in tune, which is not true of all neural models.\nMost important, UTACO achieves a high level of naturalness. In our paper, we compare it to the most recent fully neural model in the literature, which obtained a naturalness score of 31 out of 100 in a test using the MUSHRA (multiple stimuli with hidden reference and anchor) methodology. UTACO’s score was 60, while the training examples of humans singing got an 82.\nFinally, because AS2S models are a very active field of research, UTACO can naturally take advantage of many improvements and extensions already reported in the literature.\n<h4><a id=\"Simplifying_singing_synthesis_16\"></a>Simplifying singing synthesis</h4>\nWhen we began investigating singing synthesis, we noticed a stark contrast between it and NTTS. Singing synthesis models seemed to be conceptually more complex. Most singing models required a number of different inputs, such as the pitch pattern of the singing voice over time (called <a href=\"https://en.wikipedia.org/wiki/Fundamental_frequency\" target=\"_blank\">F0</a>) or almost <a href=\"https://www.isca-speech.org/archive/archive_papers/interspeech_2007/i07_4011.pdf\" target=\"_blank\">imperceptible errors</a> whose absence makes the singing sound unnatural. Producing each of these inputs required a separate sub-model.\nThe only input required by an AS2S TTS model, by contrast, is a sequence of <a href=\"https://en.wikipedia.org/wiki/Phoneme\" target=\"_blank\">phonemes</a>, the individual sounds from which any spoken word is made. Before AS2S, speech models also required specification of a number of other features, such as speed, rhythm, and intonation (collectively called <a href=\"https://en.wikipedia.org/wiki/Prosody_(linguistics)\" target=\"_blank\">prosody</a>). AS2S models learn all of this on their own, from training examples.\nWe wondered whether an AS2S model could learn everything necessary to synthesize a singing voice, too. A trained human can sing a song just by reading a score, so we built a simple AS2S speech architecture and fed it only the information contained in the score, showing it corresponding examples of how that score should be sung. To the best of our knowledge, we were the first to do this, in the fall of 2019, although several successful AS2S architectures have been introduced in the singing synthesis field since then.\n<h4><a id=\"Measurable_advances_24\"></a>Measurable advances</h4>\nIn our paper, we compare UTACO to <a href=\"https://arxiv.org/abs/1903.10729\" target=\"_blank\">WGANSing</a>, which was at the time of submission the most recent fully neural singing synthesis model in the literature. In our <a href=\"https://en.wikipedia.org/wiki/MUSHRA\" target=\"_blank\">MUSHRA</a> test, 40 listeners were asked to compare three versions of the same short song clip and score them from 0 to 100 in terms of perceived “naturalness”. The versions are\n<ul>\n<li>the audio produced by UTACO;</li>\n<li>the audio produced by WGANSing;</li>\n<li>the recording of the human voice used to train the model.</li>\n</ul>\nThe listeners do not know which is which, so they are not biased. The results appear below. The mean differences in score are statistically significant (the p-value of all paired t-tests is below 10-16).\n<img src=\"https://dev-media.amazoncloud.cn/9d8a9ab794114080a515500086010e7f_image.png\" alt=\"image.png\" />\nWe take WGANSing as representative of the state of the art in neural singing synthesis circa fall 2019. WGANSing has a different architecture (it’s not based on AS2S), and at synthesis time, it needs to be fed the pitch pattern and the duration of each phoneme, which are extracted from the original recording. UTACO generates all of these features on its own.\nOne interesting result is that UTACO is capable of reproducing a good vibrato autonomously, even “deciding” where to apply it: in the sample input below, note that there is no vibrato indication. Before UTACO, researchers created whole sub-<a href=\"https://arxiv.org/pdf/1803.04030.pdf\" target=\"_blank\">models</a> devoted to representing vibrato.\nUTACO is a jump forward for singing synthesis, but it does have its drawbacks. For example, rests in a score can sometimes cause it to break down (a <a href=\"https://arxiv.org/abs/1710.07654\" target=\"_blank\">known</a> <a href=\"http://papers.nips.cc/paper/8580-fastspeech-fast-robust-and-controllable-text-to-speech\" target=\"_blank\">problem</a> in AS2S architectures). And its timing is not quite perfect, something that a musician can detect instantly.\nAS2S architectures, however, are being intensely researched in the text-to-speech field, and many of the resulting innovations may be directly applicable to our model.\n<h4><a id=\"Model_architecture_44\"></a>Model architecture</h4>\n<img src=\"https://dev-media.amazoncloud.cn/1cf411d11c7e4b89924451c178abe696_image.png\" alt=\"image.png\" />\nA diagram of the UTACO design.\nn more detail, to turn a score into the input for UTACO, we use a representation we call note embeddings. We take a score (in <a href=\"https://www.musicxml.com/\" target=\"_blank\">MusicXML</a> format) and perform linguistic analysis on the lyrics to figure out which phonemes must be pronounced on each note.\nThe sequence of phonemes is what a text-to-speech model would normally see as input. But to each phoneme, we add information about the note that contains it: octave (pitch range), step (which of the 12 notes in the pitch range it is), and duration in seconds. We also add a “progress” stream, which is 1 at the beginning of a note and 0 at the end, so UTACO knows where notes begin and end. (See illustration, above.)\nAs in a typical NTTS system, the model produces a <a href=\"https://en.wikipedia.org/wiki/Spectrogram\" target=\"_blank\">spectrogram</a>, which is turned into a waveform by a neural vocoder based on dilated causal convolutions.\nWe are pleased with the results of our experiments with UTACO. But this is just the beginning of a major change in the field of singing synthesis that will enhance its capabilities in ways that were unthinkable until just a few years ago.\nABOUT THE AUTHOR\n<h4><a id=\"Orazio_AngelinihttpsenwikipediaorgwikiSpectrogram_60\"></a><a href=\"https://en.wikipedia.org/wiki/Spectrogram\" target=\"_blank\">Orazio Angelini</a></h4>\nOrazio Angelini is an applied scientist with the Amazon Text-to-Speech group.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家