Using synthesized speech to train speech recognizers

自然语言处理

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"In recent years, most commercial automatic-speech-recognition (ASR) systems have begun moving from hybrid systems — with separate acoustic models, dictionaries, and language models — to end-to-end neural-network models, which take an acoustic signal as input and output text.\n\nEnd-to-end models have advantages in performance and flexibility, but they require more training data than hybrid systems. That can be a problem in situations where little training data is available — for example, when current events introduce new terminology (“coronavirus”) or when models are being adapted to new applications. \n\n![image.png](https://dev-media.amazoncloud.cn/389e2e3aa9b94508851237fd1119bfe7_image.png)\n\nAn overview of the proposed approach, with a speech generation model (left) and an automatic-speech-recognition module (right).\n\nIn such situations, using synthetic speech as supplemental training data can be a viable solution. In a ++[paper](https://www.amazon.science/publications/synthasr-unlocking-synthetic-data-for-speech-recognition)++ we presented at this year’s ++[Interspeech](https://www.amazon.science/conferences-and-events/interspeech-2021)++, we adopt this approach, using synthetic voice data — like the output speech generated by Alexa’s text-to-speech models — to update an ASR model. \n\nIn our experiments, we fine-tuned an existing ASR model to recognize the names of medications it hadn’t heard before and found that our approach reduced the model’s word error rate on the new vocabulary by 65%, relative to the original model.\n\nIt also left the model’s performance on the existing vocabulary unchanged, thanks to a continual-learning training procedure we used to avoid “catastrophic forgetting”. We describe that procedure in the paper, along with the steps we took to make the synthetic speech data look as much like real speech data as possible.\n\n#### **Synthetic speech**\n\nOne key to building a robust ASR model is to train it on a range of different voices, so it can learn a variety of acoustic-frequency profiles and different ways of voicing phonemes, the shortest units of speech. We synthesized each utterance in our dataset 32 times, by randomly sampling 32 voice profiles from 500 we’d collected from volunteers in the lab.\n\nLike most TTS model, ours has an encoder-decoder architecture: the encoder produces a vector representation of the input text, which the decoder translates into an output spectrogram, a series of snapshots of the frequency profile of the synthesized speech. The spectrogram passes to a neural vocoder, which adds the phase information necessary to convert it into a real speech signal.\n\nFor each speaker, we used a speaker identification system to produce a unique voice profile embedding — a vector representation of that speaker’s acoustic signature. This embedding is a late input to the TTS model, right before the decoding step.\n\nWe also train the TTS model to take a reference spectrogram as input, giving it a model of output prosody (the rhythm, emphasis, melody, duration, and loudness of the output speech). The architecture of the model allows us to vary both the voice profile embedding and the prosody embedding for the same input text, so we can produce multiple versions of the same utterance with different voices and prosodies.\n\n![image.png](https://dev-media.amazoncloud.cn/b35d4e9ee8554ad293a4f33adefbb80b_image.png)\n\nOur TTS model learns a voice- and prosody-agnostic phonetic encoder, whose output is conditioned on both a voice profile embedding and a prosodic embedding, allowing us to produce variations on the same text with different voices and prosodies.\n\nNext, to make our synthesized speech look more like real speech, we manipulate it in a variety of ways: we apply different types of reverberation to it, based on chirp sound samples collected in the lab; we add noise; we attenuate certain frequency bands; and we mask parts of the signal to simulate interruptions. We apply these manipulations randomly according to certain probabilities (a 60% chance of added background noise, for instance) to ensure a good mix of different types of samples.\n\n#### **Continual learning**\n\nWhen a neural-network model is updated to reflect new data, it can run the risk of catastrophic forgetting: changing the model weights to handle new data can compromise the model’s ability to handle the types of data it was originally trained on. In our paper, we describe a few of the techniques we use to prevent this from happening when we fine-tune an existing ASR model on synthetic data. \n\nOur baseline model is an ASR model trained on 50,000 hours of data. We update it for a new vocabulary of medication names in four stages. In the first stage, we add 5,000 hours of synthetic data to the original dataset and fine-tune the model on both, but we freeze the encoder settings, so that only the decoder’s weights change.\n\nIn stage two, we again fine-tune the model on the combined dataset, but this time, we allow updates to the encoder weights.\n\nIn stage three, we fine-tune the model on only the original data, but we add a new term to the loss function, which penalizes the model if the weights of its connections change too dramatically. Finally, we fine-tune the model on only the original data, allowing all the weights to be updated in an unconstrained way.\n\nIn experiments, we found that, after the second stage of training — when the dataset includes the synthetic data, and all weights are free to change — the error rate on the new vocabulary dropped by more than 86% relative to baseline. The error rate on the existing vocabulary, however, rose slightly, by just under 1% relative to baseline.\n\nIn some cases, such as the adaptation of a model to a new application, that may be acceptable. But in an update to an already deployed model, it may not be. The third and fourth fine-tuning stages brought the error rate on the original vocabulary below baseline, while still cutting the error rate on the new vocabulary by 65%. Our experiments thus point toward a training methodology that can be adapted to different use cases.\n\n\nABOUT THE AUTHOR\n\n#### **[Amin Fazel](https://www.amazon.science/author/amin-fazel)**\n\nAmin Fazel is an applied scientist in the Alexa AI organization.\n\n#### **[Yulan Liu](https://www.amazon.science/author/yulan-liu)**\n\nYulan Liu is a speech scientist in the Alexa AI organization.","render":"In recent years, most commercial automatic-speech-recognition (ASR) systems have begun moving from hybrid systems — with separate acoustic models, dictionaries, and language models — to end-to-end neural-network models, which take an acoustic signal as input and output text.\nEnd-to-end models have advantages in performance and flexibility, but they require more training data than hybrid systems. That can be a problem in situations where little training data is available — for example, when current events introduce new terminology (“coronavirus”) or when models are being adapted to new applications.\n<img src=\"https://dev-media.amazoncloud.cn/389e2e3aa9b94508851237fd1119bfe7_image.png\" alt=\"image.png\" />\nAn overview of the proposed approach, with a speech generation model (left) and an automatic-speech-recognition module (right).\nIn such situations, using synthetic speech as supplemental training data can be a viable solution. In a <ins><a href=\"https://www.amazon.science/publications/synthasr-unlocking-synthetic-data-for-speech-recognition\" target=\"_blank\">paper</a></ins> we presented at this year’s <ins><a href=\"https://www.amazon.science/conferences-and-events/interspeech-2021\" target=\"_blank\">Interspeech</a></ins>, we adopt this approach, using synthetic voice data — like the output speech generated by Alexa’s text-to-speech models — to update an ASR model.\nIn our experiments, we fine-tuned an existing ASR model to recognize the names of medications it hadn’t heard before and found that our approach reduced the model’s word error rate on the new vocabulary by 65%, relative to the original model.\nIt also left the model’s performance on the existing vocabulary unchanged, thanks to a continual-learning training procedure we used to avoid “catastrophic forgetting”. We describe that procedure in the paper, along with the steps we took to make the synthetic speech data look as much like real speech data as possible.\n<h4><a id=\"Synthetic_speech_14\"></a>Synthetic speech</h4>\nOne key to building a robust ASR model is to train it on a range of different voices, so it can learn a variety of acoustic-frequency profiles and different ways of voicing phonemes, the shortest units of speech. We synthesized each utterance in our dataset 32 times, by randomly sampling 32 voice profiles from 500 we’d collected from volunteers in the lab.\nLike most TTS model, ours has an encoder-decoder architecture: the encoder produces a vector representation of the input text, which the decoder translates into an output spectrogram, a series of snapshots of the frequency profile of the synthesized speech. The spectrogram passes to a neural vocoder, which adds the phase information necessary to convert it into a real speech signal.\nFor each speaker, we used a speaker identification system to produce a unique voice profile embedding — a vector representation of that speaker’s acoustic signature. This embedding is a late input to the TTS model, right before the decoding step.\nWe also train the TTS model to take a reference spectrogram as input, giving it a model of output prosody (the rhythm, emphasis, melody, duration, and loudness of the output speech). The architecture of the model allows us to vary both the voice profile embedding and the prosody embedding for the same input text, so we can produce multiple versions of the same utterance with different voices and prosodies.\n<img src=\"https://dev-media.amazoncloud.cn/b35d4e9ee8554ad293a4f33adefbb80b_image.png\" alt=\"image.png\" />\nOur TTS model learns a voice- and prosody-agnostic phonetic encoder, whose output is conditioned on both a voice profile embedding and a prosodic embedding, allowing us to produce variations on the same text with different voices and prosodies.\nNext, to make our synthesized speech look more like real speech, we manipulate it in a variety of ways: we apply different types of reverberation to it, based on chirp sound samples collected in the lab; we add noise; we attenuate certain frequency bands; and we mask parts of the signal to simulate interruptions. We apply these manipulations randomly according to certain probabilities (a 60% chance of added background noise, for instance) to ensure a good mix of different types of samples.\n<h4><a id=\"Continual_learning_30\"></a>Continual learning</h4>\nWhen a neural-network model is updated to reflect new data, it can run the risk of catastrophic forgetting: changing the model weights to handle new data can compromise the model’s ability to handle the types of data it was originally trained on. In our paper, we describe a few of the techniques we use to prevent this from happening when we fine-tune an existing ASR model on synthetic data.\nOur baseline model is an ASR model trained on 50,000 hours of data. We update it for a new vocabulary of medication names in four stages. In the first stage, we add 5,000 hours of synthetic data to the original dataset and fine-tune the model on both, but we freeze the encoder settings, so that only the decoder’s weights change.\nIn stage two, we again fine-tune the model on the combined dataset, but this time, we allow updates to the encoder weights.\nIn stage three, we fine-tune the model on only the original data, but we add a new term to the loss function, which penalizes the model if the weights of its connections change too dramatically. Finally, we fine-tune the model on only the original data, allowing all the weights to be updated in an unconstrained way.\nIn experiments, we found that, after the second stage of training — when the dataset includes the synthetic data, and all weights are free to change — the error rate on the new vocabulary dropped by more than 86% relative to baseline. The error rate on the existing vocabulary, however, rose slightly, by just under 1% relative to baseline.\nIn some cases, such as the adaptation of a model to a new application, that may be acceptable. But in an update to an already deployed model, it may not be. The third and fourth fine-tuning stages brought the error rate on the original vocabulary below baseline, while still cutting the error rate on the new vocabulary by 65%. Our experiments thus point toward a training methodology that can be adapted to different use cases.\nABOUT THE AUTHOR\n<h4><a id=\"Amin_Fazelhttpswwwamazonscienceauthoraminfazel_47\"></a><a href=\"https://www.amazon.science/author/amin-fazel\" target=\"_blank\">Amin Fazel</a></h4>\nAmin Fazel is an applied scientist in the Alexa AI organization.\n<h4><a id=\"Yulan_Liuhttpswwwamazonscienceauthoryulanliu_51\"></a><a href=\"https://www.amazon.science/author/yulan-liu\" target=\"_blank\">Yulan Liu</a></h4>\nYulan Liu is a speech scientist in the Alexa AI organization.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家