Using warped language models to correct speech recognition errors

自然语言处理

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"Language-related machine learning applications have made great strides in recent years, thanks in part to masked language models such as BERT: during training, sentences are fed to the models with certain words either masked out or replaced with random substitutions, and the models learn to output complete and corrected sentences.\n\nThe success of masked language models has led to the development of warped language models, which add insertions and deletions to the menu of possible alterations. Warped language models were designed specifically to address the types of errors common in automatic speech recognition (ASR), so they could serve as the basis for ASR models.\n\nIn a ++[paper](https://www.amazon.science/publications/correcting-automated-and-manual-speech-transcription-errors-using-warped-language-models)++ we presented at this year’s ++[Interspeech](https://www.amazon.science/conferences-and-events/interspeech-2021)++, we describe how to use warped language models, not during ASR, but to correct ASR output — or to correct the human transcriptions of speech used to train ASR models.\n\nThis new use case required us to modify the design of warped language models so that they not only output text strings but also classify the errors in the input strings. From that information, we can produce a corrected text even when its word count differs from that of the input.\n\n\nSince ASR models output not a single transcript of input speech but a ranked list of hypotheses, we also experimented with using multiple hypotheses as inputs to our error correction model. With the human transcriptions, we generated the hypotheses by subjecting the transcribed speech to ASR. \n\nWe found that the multiple-hypothesis approach had particular benefits for the correction of human transcription errors. There, it was able to reduce the word error rate by about 11%. For ASR outputs, the same model reduced the word error rate by almost 6%.\n\n#### **Warped language models**\n\n![image.png](https://dev-media.amazoncloud.cn/f53a2267235b4a1081b51351caa0a269_image.png)\n\nThe traditional architecture of a warped language model, in which each output token corresponds to exactly one input token.\n\nA warped language model outputs a token — either a word or a special symbol, such as a blank when it detects a spurious insertion — for each word of its input. This means, however, that it can’t fully correct word deletions: it has to choose between outputting the dropped word or the input word at the current position.\n\nWe adapt the basic architecture of the warped language model so that, for each input token, it predicts both an output token and a warping operation.\n\nOur model still outputs a single token for each input token, but from the combination of the token and the warping operation, a simple correction algorithm can deduce the original input. \n\nThe figure below, for example, depicts our model’s handling of the input sentence “I saying that table I [mask] apples place oranges.” The middle row indicates our model’s output: first, an operation name, and second, an output token. When our model replaces the input “saying” with the output “was” and flags the operation as “drop”, the correction algorithm deduces that the sentence should have begun “I was saying”, not “I saying”.\n\n![image.png](https://dev-media.amazoncloud.cn/42936b6ff70643ecbc2140158e629cd1_image.png)\n\nOur modified architecture. For each input token, it predicts both an output token and a warping operation.\n\n![image.png](https://dev-media.amazoncloud.cn/aaeca4c82778498a8a0269c713d761a2_image.png)\n\nAn example of our model’s handling of all five operations used to train warped language models: keep (no alteration); drop; insert; mask; and rand (random substitution).\n\nThe great advantage of masked (and warped) language models is that they are unsupervised: the masking (and warping) operations can be performed automatically, enabling an effectively unlimited amount of training data. Our model is similarly unsupervised: we simply modify the warping algorithm so that, when it applies an operation, it also tags the output with the operation’s name.\n\n#### **Multiple hypotheses**\n\nAfter training our model on a corpus of English-language texts, we fine-tuned it on the output of an ASR model for a separate set of spoken-language utterances. For each utterance, we kept the top five ASR hypotheses.\n\nAlgorithms automatically aligned the tokens of the hypotheses and standardized their lengths, adding blank tokens where necessary. We treated hypotheses two through five as warped versions of the top hypothesis, automatically computing the minimum number of warping operations required to transform the top hypothesis into an alternate hypothesis and labeling the hypothesis tokens appropriately.\n\n![image.png](https://dev-media.amazoncloud.cn/35cfdc8d3df04fe684be4be5df09299b_image.png)\n\nThe multi-hypothesis version of the model.\n\nFor each input, our model combines all five hypotheses to produce a single vector representation (an embedding) that the model’s decoder uses to produce output strings. \n\nDuring training, the model outputs a separate set of predictions for each hypothesis. This ensures the fine-tuning of the operation predictor as well as the token predictor, as the operation classifications will differ for each hypothesis, even if the token strings are identical. At run time, however, we keep only the output corresponding to the top-ranked ASR hypothesis.\n\nWithout fine-tuning on the ASR hypotheses, our model reduced the word error rate of ASR model outputs by 5%. But it slightly increased the word error rate on the human transcriptions of speech. This is probably because the human-transcribed speech, even when erroneous, is still syntactically and semantically coherent, so errors are hard to identify. The addition of the alternate ASR hypotheses, however, enables the correction model to exploit additional information in the speech signal itself, leading to a pronounced reduction in word error rate.\n\nABOUT THE AUTHOR\n\n#### **[Mahdi Namazifar](https://www.amazon.science/author/mahdi-namazifar)**\n\nMahdi Namazifar is a senior applied scientist in the Alexa AI organization.\n\n\n\n\n\n\n\n\n\n\n\n\n","render":"Language-related machine learning applications have made great strides in recent years, thanks in part to masked language models such as BERT: during training, sentences are fed to the models with certain words either masked out or replaced with random substitutions, and the models learn to output complete and corrected sentences.\nThe success of masked language models has led to the development of warped language models, which add insertions and deletions to the menu of possible alterations. Warped language models were designed specifically to address the types of errors common in automatic speech recognition (ASR), so they could serve as the basis for ASR models.\nIn a <ins><a href=\"https://www.amazon.science/publications/correcting-automated-and-manual-speech-transcription-errors-using-warped-language-models\" target=\"_blank\">paper</a></ins> we presented at this year’s <ins><a href=\"https://www.amazon.science/conferences-and-events/interspeech-2021\" target=\"_blank\">Interspeech</a></ins>, we describe how to use warped language models, not during ASR, but to correct ASR output — or to correct the human transcriptions of speech used to train ASR models.\nThis new use case required us to modify the design of warped language models so that they not only output text strings but also classify the errors in the input strings. From that information, we can produce a corrected text even when its word count differs from that of the input.\nSince ASR models output not a single transcript of input speech but a ranked list of hypotheses, we also experimented with using multiple hypotheses as inputs to our error correction model. With the human transcriptions, we generated the hypotheses by subjecting the transcribed speech to ASR.\nWe found that the multiple-hypothesis approach had particular benefits for the correction of human transcription errors. There, it was able to reduce the word error rate by about 11%. For ASR outputs, the same model reduced the word error rate by almost 6%.\n<h4><a id=\"Warped_language_models_13\"></a>Warped language models</h4>\n<img src=\"https://dev-media.amazoncloud.cn/f53a2267235b4a1081b51351caa0a269_image.png\" alt=\"image.png\" />\nThe traditional architecture of a warped language model, in which each output token corresponds to exactly one input token.\nA warped language model outputs a token — either a word or a special symbol, such as a blank when it detects a spurious insertion — for each word of its input. This means, however, that it can’t fully correct word deletions: it has to choose between outputting the dropped word or the input word at the current position.\nWe adapt the basic architecture of the warped language model so that, for each input token, it predicts both an output token and a warping operation.\nOur model still outputs a single token for each input token, but from the combination of the token and the warping operation, a simple correction algorithm can deduce the original input.\nThe figure below, for example, depicts our model’s handling of the input sentence “I saying that table I [mask] apples place oranges.” The middle row indicates our model’s output: first, an operation name, and second, an output token. When our model replaces the input “saying” with the output “was” and flags the operation as “drop”, the correction algorithm deduces that the sentence should have begun “I was saying”, not “I saying”.\n<img src=\"https://dev-media.amazoncloud.cn/42936b6ff70643ecbc2140158e629cd1_image.png\" alt=\"image.png\" />\nOur modified architecture. For each input token, it predicts both an output token and a warping operation.\n<img src=\"https://dev-media.amazoncloud.cn/aaeca4c82778498a8a0269c713d761a2_image.png\" alt=\"image.png\" />\nAn example of our model’s handling of all five operations used to train warped language models: keep (no alteration); drop; insert; mask; and rand (random substitution).\nThe great advantage of masked (and warped) language models is that they are unsupervised: the masking (and warping) operations can be performed automatically, enabling an effectively unlimited amount of training data. Our model is similarly unsupervised: we simply modify the warping algorithm so that, when it applies an operation, it also tags the output with the operation’s name.\n<h4><a id=\"Multiple_hypotheses_37\"></a>Multiple hypotheses</h4>\nAfter training our model on a corpus of English-language texts, we fine-tuned it on the output of an ASR model for a separate set of spoken-language utterances. For each utterance, we kept the top five ASR hypotheses.\nAlgorithms automatically aligned the tokens of the hypotheses and standardized their lengths, adding blank tokens where necessary. We treated hypotheses two through five as warped versions of the top hypothesis, automatically computing the minimum number of warping operations required to transform the top hypothesis into an alternate hypothesis and labeling the hypothesis tokens appropriately.\n<img src=\"https://dev-media.amazoncloud.cn/35cfdc8d3df04fe684be4be5df09299b_image.png\" alt=\"image.png\" />\nThe multi-hypothesis version of the model.\nFor each input, our model combines all five hypotheses to produce a single vector representation (an embedding) that the model’s decoder uses to produce output strings.\nDuring training, the model outputs a separate set of predictions for each hypothesis. This ensures the fine-tuning of the operation predictor as well as the token predictor, as the operation classifications will differ for each hypothesis, even if the token strings are identical. At run time, however, we keep only the output corresponding to the top-ranked ASR hypothesis.\nWithout fine-tuning on the ASR hypotheses, our model reduced the word error rate of ASR model outputs by 5%. But it slightly increased the word error rate on the human transcriptions of speech. This is probably because the human-transcribed speech, even when erroneous, is still syntactically and semantically coherent, so errors are hard to identify. The addition of the alternate ASR hypotheses, however, enables the correction model to exploit additional information in the speech signal itself, leading to a pronounced reduction in word error rate.\nABOUT THE AUTHOR\n<h4><a id=\"Mahdi_Namazifarhttpswwwamazonscienceauthormahdinamazifar_55\"></a><a href=\"https://www.amazon.science/author/mahdi-namazifar\" target=\"_blank\">Mahdi Namazifar</a></h4>\nMahdi Namazifar is a senior applied scientist in the Alexa AI organization.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家