Alexa’s speech recognition research at ICASSP 2022

{"value":"This week, the [IEEE International Conference on Acoustics, Speech and Signal Processing ](https://www.amazon.science/conferences-and-events/icassp-2022)(ICASSP) got under way in virtual form, to be followed by an in-person meeting two weeks later (May 22-27) in Singapore. ICASSP is the flagship conference of the [IEEE Signal Processing Society ](https://signalprocessingsociety.org/)and, as such, one of the premier venues for publishing the latest advances in automatic speech recognition (ASR) and other speech-processing and speech-related fields, with strong participation from both industry and academia.\n\nThis year, the Alexa AI ASR organization is represented by [21 papers](https://www.amazon.science/blog/a-quick-guide-to-amazons-50-plus-icassp-papers), more than in any prior year, reflecting the growth of speech-related science in Alexa AI. Here we highlight a few of these papers, to give an idea of their breadth.\n\n##### **More ICASSP coverage on Amazon Science**\n- Andrew Breen, the senior manager of text-to-speech research in the Amazon Text-to-Speech group, [discusses the group's four ICASSP papers](https://www.amazon.science/blog/amazon-text-to-speech-groups-research-at-icassp-2022).\n- The 50-plus Amazon papers at ICASSP, [sorted by research topic](https://www.amazon.science/blog/a-quick-guide-to-amazons-50-plus-icassp-papers).\n\n#### **Multimodal pretraining for end-to-end ASR**\n\n\nDeep-learning methods have taken over as the method of choice in speech-based recognition and classification tasks, and increasingly, [self-supervised representation learning](https://www.amazon.science/tag/self-supervised-learning) is used to pretrain models on large unlabeled datasets, followed by “fine-tuning” on task-labeled data.\n\nIn their paper “[Multi-modal Pretraining for Automated Speech Recognition](https://www.amazon.science/publications/multi-modal-pre-training-for-automated-speech-recognition)”, David Chan and colleagues give a new twist to this approach by pretraining speech representations on audiovisual data. As the self-supervision task for both modalities, they adapt the masked language model, in which words of training sentences are randomly masked out, and the model learns to predict them. In their case, however, the masks are applied to features extracted from the video and audio stream.\n\n![下载.jpg](https://dev-media.amazoncloud.cn/3ef0119e4bc94aa4a6a91bade5b708f0_%E4%B8%8B%E8%BD%BD.jpg)\n\nIn \"[Multi-modal pre-training for automated speech recognition](https://www.amazon.science/publications/multi-modal-alignment-using-representation-codebook)\", Amazon researchers adapt the masked language model, which learns to predict masked-out words of training sentences, to features extracted from video and audio streams.\n\nOnce pretrained, the audio-only portion of the learned representation is fused with a more standard front-end representation to feed into an end-to-end speech recognition system. The researchers show that this approach yields more accurate ASR results than pretraining with only audio-based self-supervision, suggesting that the correlations between acoustic and visual signals are helpful in extracting higher-level structures relevant to the encoding of speech.\n\n\n#### **Signal-to-interpretation with multimodal embeddings**\n\n\nThe advantages of multimodality are not limited to unsupervised-learning settings. In “[Tie your embeddings down: Cross-modal latent spaces for end-to-end spoken language understanding](https://www.amazon.science/publications/tie-your-embeddings-down-cross-modal-latent-spaces-for-end-to-end-spoken-language-understanding)”, Bhuvan Agrawal and coauthors study signal-to-interpretation (S2I) recognizers that map a sequential acoustic input to an embedding, from which the intent of an utterance is directly inferred.\n\nThis bypasses the need for explicit speech transcription but still uses supervision for utterance intents. Due to their compactness, S2I models are attractive for on-device deployment, which has multiple benefits. For example, Alexa AI has used on-device speech processing to[ make Alexa faster and lower-bandwidth](https://www.amazon.science/blog/on-device-speech-processing-makes-alexa-faster-lower-bandwidth).\n\n![下载.jpg](https://dev-media.amazoncloud.cn/07b54b1847804a2084d67169487f17d0_%E4%B8%8B%E8%BD%BD.jpg)\n\nIn \"[Tie your embeddings down: Cross-modal latent spaces for end-to-end spoken language understanding](https://www.amazon.science/publications/tie-your-embeddings-down-cross-modal-latent-spaces-for-end-to-end-spoken-language-understanding)\", Amazon researchers train encoders to generate acoustic and text embeddings in the same representational space, so that the origin of the embeddings becomes indistinguishable.\n\nAgrawal and colleagues show that S2I recognizers give better results when their acoustic embeddings are constrained to be close to embeddings of the corresponding textual input produced by a pretrained language model (BERT). As in the earlier paper, this cross-modal signal is used during learning only and not required for inference (i.e., at runtime). It is a clever way to sneak linguistic structure back into the S2I system while also infusing it with knowledge gleaned from the vastly larger language model training data.\n\nThe idea of matching embeddings derived from audio to those for corresponding text strings (i.e., transcripts) also has other applications. In their paper “[TinyS2I: A small-footprint utterance classification model with contextual support for on-device SLU](https://www.amazon.science/publications/tinys2i-a-small-footprint-utterance-classification-model-with-contextual-support-for-on-device-slu)”, Anastasios Alexandridis et al. show that extremely compact, low-latency speech-understanding models can be obtained for the utterances most frequently used to control certain applications, such as media playback.\n\n![下载.jpg](https://dev-media.amazoncloud.cn/4c2d4c41c3614891b82183c1c9ff2c85_%E4%B8%8B%E8%BD%BD.jpg)\n\nThe TinyS2I architecture. From \"TINYS2I: A small-footprint utterance classification model with contextual support for on-device SLU\".\n\nThe most frequent control commands (“pause”, “volume up”, and the like) can be classified directly from an acoustic embedding. For commands involving an item from a contextual menu (“play [title]”), the acoustic embedding is matched to the media title’s textual embedding. In this paper, unlike the previous one, the textual embeddings are trained jointly with the acoustic ones. But the same triplet loss function can be used to align the cross-modal embeddings in a shared space.\n\n\n#### **ASR rescoring with BERT**\n\n\nDeep encoders of text trained using the masked-language-model (MLM) paradigm, such as BERT, have been widely used as the basis for all sorts of natural-language tasks. As mentioned earlier, they can incorporate vast amounts of language data through self-supervised pretraining, followed by task-specific supervised fine-tuning.\n\nSo far, however, the practical impact of MLMs on ASR proper has been limited, in part because of unsatisfactory tradeoffs between computational overhead (latency) and achievable accuracy gains. This is now changing with the work of Liyan Xu et al., as described in “[RescoreBERT: Discriminative speech recognition rescoring with BERT](https://www.amazon.science/publications/rescorebert-discriminative-speech-recognition-rescoring-with-bert)”.\n\nThe researchers show how BERT-generated sentence encodings can be incorporated into a model that rescores the text strings output by an ASR model. Because BERT is trained on large corpora of (text-only) public data, it understands the relative probabilities of different ASR hypotheses better than the ASR model can.\n\nThe researchers achieved their best results with a combined loss function that is based on both sentence pseudo-likelihood — a more computationally tractable estimate of sentence likelihood — and word error prediction. The resulting rescoring model is so effective compared to standard LSTM (long short-term memory) language models, while also exhibiting lower latency, that the RescoreBERT method has gone from internship project to Alexa production in less than a year.\n\n\n#### **Ontological biasing for acoustic-event detection**\n\n\nWe round out this short selection of papers with one from an ASR-adjacent field. In “[Improved representation learning for acoustic event classification using tree-structured ontology](https://www.amazon.science/publications/improved-representation-learning-for-acoustic-event-classification-using-tree-structured-ontology)”, Arman Zharmagambetov and coauthors look at an alternative to self-supervised training for the task of [acoustic-event detection](https://www.amazon.science/tag/acoustic-event-detection) (AED). (AED is the technology behind Alexa’s ability to detect breaking glass, smoke alarms, and other noteworthy events around the house.)\n\nThey show that AED classifier training can be enhanced by forcing the resulting representations to identify not only the target event label (such as “dog barking”) but also supercategories (such as “domestic animal” and “animal sound”) drawn from an ontology, a hierarchical representation of relationships between concepts. The method can be further enhanced by forcing the classification to stay the same under distortions of the inputs. The researchers found that their method is more effective than purely self-supervised pretraining and comes close to fully supervised training with only a fraction of the labeled data.\n\n![下载.jpg](https://dev-media.amazoncloud.cn/09dccd5f34384370b5a82dcd18586ab7_%E4%B8%8B%E8%BD%BD.jpg)\n\nIn \"[Improved representation learning for acoustic event classification using tree-structured ontology](https://www.amazon.science/publications/improved-representation-learning-for-acoustic-event-classification-using-tree-structured-ontology)\", Amazon researchers present a two-module joint model consisting of a representation neural network and a decision tree based on a predefined tree-structured ontology.\n\n\n#### **Conclusion and outlook**\n\n\nAs we have seen, Alexa relies on a range of audio-based technologies that use deep-learning architectures. The need to train these models robustly, fairly, and with limited supervision, as well as computational constraints at runtime, continues to drive research in Alexa Science. We have highlighted some of the results from that work as they are about to be presented to the wider science community, and we are excited to see the field as a whole come up with creative solutions and push toward ever more capable applications of speech-based AI.\n\nABOUT THE AUTHOR\n\n#### **[Andreas Stolcke](https://www.amazon.science/author/andreas-stolcke)**\n\nAndreas Stolcke is a senior principal scientist in the Alexa AI organization.\n\n","render":"<p>This week, the <a href=\"https://www.amazon.science/conferences-and-events/icassp-2022\" target=\"_blank\">IEEE International Conference on Acoustics, Speech and Signal Processing </a>(ICASSP) got under way in virtual form, to be followed by an in-person meeting two weeks later (May 22-27) in Singapore. ICASSP is the flagship conference of the <a href=\"https://signalprocessingsociety.org/\" target=\"_blank\">IEEE Signal Processing Society </a>and, as such, one of the premier venues for publishing the latest advances in automatic speech recognition (ASR) and other speech-processing and speech-related fields, with strong participation from both industry and academia.</p>\n<p>This year, the Alexa AI ASR organization is represented by <a href=\"https://www.amazon.science/blog/a-quick-guide-to-amazons-50-plus-icassp-papers\" target=\"_blank\">21 papers</a>, more than in any prior year, reflecting the growth of speech-related science in Alexa AI. Here we highlight a few of these papers, to give an idea of their breadth.</p>\n<h5><a id=\"More_ICASSP_coverage_on_Amazon_Science_4\"></a><strong>More ICASSP coverage on Amazon Science</strong></h5>\n<ul>\n<li>Andrew Breen, the senior manager of text-to-speech research in the Amazon Text-to-Speech group, <a href=\"https://www.amazon.science/blog/amazon-text-to-speech-groups-research-at-icassp-2022\" target=\"_blank\">discusses the group’s four ICASSP papers</a>.</li>\n<li>The 50-plus Amazon papers at ICASSP, <a href=\"https://www.amazon.science/blog/a-quick-guide-to-amazons-50-plus-icassp-papers\" target=\"_blank\">sorted by research topic</a>.</li>\n</ul>\n<h4><a id=\"Multimodal_pretraining_for_endtoend_ASR_8\"></a><strong>Multimodal pretraining for end-to-end ASR</strong></h4>\n<p>Deep-learning methods have taken over as the method of choice in speech-based recognition and classification tasks, and increasingly, <a href=\"https://www.amazon.science/tag/self-supervised-learning\" target=\"_blank\">self-supervised representation learning</a> is used to pretrain models on large unlabeled datasets, followed by “fine-tuning” on task-labeled data.</p>\n<p>In their paper “<a href=\"https://www.amazon.science/publications/multi-modal-pre-training-for-automated-speech-recognition\" target=\"_blank\">Multi-modal Pretraining for Automated Speech Recognition</a>”, David Chan and colleagues give a new twist to this approach by pretraining speech representations on audiovisual data. As the self-supervision task for both modalities, they adapt the masked language model, in which words of training sentences are randomly masked out, and the model learns to predict them. In their case, however, the masks are applied to features extracted from the video and audio stream.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/3ef0119e4bc94aa4a6a91bade5b708f0_%E4%B8%8B%E8%BD%BD.jpg\" alt=\"下载.jpg\" /></p>\n<p>In “<a href=\"https://www.amazon.science/publications/multi-modal-alignment-using-representation-codebook\" target=\"_blank\">Multi-modal pre-training for automated speech recognition</a>”, Amazon researchers adapt the masked language model, which learns to predict masked-out words of training sentences, to features extracted from video and audio streams.</p>\n<p>Once pretrained, the audio-only portion of the learned representation is fused with a more standard front-end representation to feed into an end-to-end speech recognition system. The researchers show that this approach yields more accurate ASR results than pretraining with only audio-based self-supervision, suggesting that the correlations between acoustic and visual signals are helpful in extracting higher-level structures relevant to the encoding of speech.</p>\n<h4><a id=\"Signaltointerpretation_with_multimodal_embeddings_22\"></a><strong>Signal-to-interpretation with multimodal embeddings</strong></h4>\n<p>The advantages of multimodality are not limited to unsupervised-learning settings. In “<a href=\"https://www.amazon.science/publications/tie-your-embeddings-down-cross-modal-latent-spaces-for-end-to-end-spoken-language-understanding\" target=\"_blank\">Tie your embeddings down: Cross-modal latent spaces for end-to-end spoken language understanding</a>”, Bhuvan Agrawal and coauthors study signal-to-interpretation (S2I) recognizers that map a sequential acoustic input to an embedding, from which the intent of an utterance is directly inferred.</p>\n<p>This bypasses the need for explicit speech transcription but still uses supervision for utterance intents. Due to their compactness, S2I models are attractive for on-device deployment, which has multiple benefits. For example, Alexa AI has used on-device speech processing to<a href=\"https://www.amazon.science/blog/on-device-speech-processing-makes-alexa-faster-lower-bandwidth\" target=\"_blank\"> make Alexa faster and lower-bandwidth</a>.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/07b54b1847804a2084d67169487f17d0_%E4%B8%8B%E8%BD%BD.jpg\" alt=\"下载.jpg\" /></p>\n<p>In “<a href=\"https://www.amazon.science/publications/tie-your-embeddings-down-cross-modal-latent-spaces-for-end-to-end-spoken-language-understanding\" target=\"_blank\">Tie your embeddings down: Cross-modal latent spaces for end-to-end spoken language understanding</a>”, Amazon researchers train encoders to generate acoustic and text embeddings in the same representational space, so that the origin of the embeddings becomes indistinguishable.</p>\n<p>Agrawal and colleagues show that S2I recognizers give better results when their acoustic embeddings are constrained to be close to embeddings of the corresponding textual input produced by a pretrained language model (BERT). As in the earlier paper, this cross-modal signal is used during learning only and not required for inference (i.e., at runtime). It is a clever way to sneak linguistic structure back into the S2I system while also infusing it with knowledge gleaned from the vastly larger language model training data.</p>\n<p>The idea of matching embeddings derived from audio to those for corresponding text strings (i.e., transcripts) also has other applications. In their paper “<a href=\"https://www.amazon.science/publications/tinys2i-a-small-footprint-utterance-classification-model-with-contextual-support-for-on-device-slu\" target=\"_blank\">TinyS2I: A small-footprint utterance classification model with contextual support for on-device SLU</a>”, Anastasios Alexandridis et al. show that extremely compact, low-latency speech-understanding models can be obtained for the utterances most frequently used to control certain applications, such as media playback.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/4c2d4c41c3614891b82183c1c9ff2c85_%E4%B8%8B%E8%BD%BD.jpg\" alt=\"下载.jpg\" /></p>\n<p>The TinyS2I architecture. From “TINYS2I: A small-footprint utterance classification model with contextual support for on-device SLU”.</p>\n<p>The most frequent control commands (“pause”, “volume up”, and the like) can be classified directly from an acoustic embedding. For commands involving an item from a contextual menu (“play [title]”), the acoustic embedding is matched to the media title’s textual embedding. In this paper, unlike the previous one, the textual embeddings are trained jointly with the acoustic ones. But the same triplet loss function can be used to align the cross-modal embeddings in a shared space.</p>\n<h4><a id=\"ASR_rescoring_with_BERT_44\"></a><strong>ASR rescoring with BERT</strong></h4>\n<p>Deep encoders of text trained using the masked-language-model (MLM) paradigm, such as BERT, have been widely used as the basis for all sorts of natural-language tasks. As mentioned earlier, they can incorporate vast amounts of language data through self-supervised pretraining, followed by task-specific supervised fine-tuning.</p>\n<p>So far, however, the practical impact of MLMs on ASR proper has been limited, in part because of unsatisfactory tradeoffs between computational overhead (latency) and achievable accuracy gains. This is now changing with the work of Liyan Xu et al., as described in “<a href=\"https://www.amazon.science/publications/rescorebert-discriminative-speech-recognition-rescoring-with-bert\" target=\"_blank\">RescoreBERT: Discriminative speech recognition rescoring with BERT</a>”.</p>\n<p>The researchers show how BERT-generated sentence encodings can be incorporated into a model that rescores the text strings output by an ASR model. Because BERT is trained on large corpora of (text-only) public data, it understands the relative probabilities of different ASR hypotheses better than the ASR model can.</p>\n<p>The researchers achieved their best results with a combined loss function that is based on both sentence pseudo-likelihood — a more computationally tractable estimate of sentence likelihood — and word error prediction. The resulting rescoring model is so effective compared to standard LSTM (long short-term memory) language models, while also exhibiting lower latency, that the RescoreBERT method has gone from internship project to Alexa production in less than a year.</p>\n<h4><a id=\"Ontological_biasing_for_acousticevent_detection_56\"></a><strong>Ontological biasing for acoustic-event detection</strong></h4>\n<p>We round out this short selection of papers with one from an ASR-adjacent field. In “<a href=\"https://www.amazon.science/publications/improved-representation-learning-for-acoustic-event-classification-using-tree-structured-ontology\" target=\"_blank\">Improved representation learning for acoustic event classification using tree-structured ontology</a>”, Arman Zharmagambetov and coauthors look at an alternative to self-supervised training for the task of <a href=\"https://www.amazon.science/tag/acoustic-event-detection\" target=\"_blank\">acoustic-event detection</a> (AED). (AED is the technology behind Alexa’s ability to detect breaking glass, smoke alarms, and other noteworthy events around the house.)</p>\n<p>They show that AED classifier training can be enhanced by forcing the resulting representations to identify not only the target event label (such as “dog barking”) but also supercategories (such as “domestic animal” and “animal sound”) drawn from an ontology, a hierarchical representation of relationships between concepts. The method can be further enhanced by forcing the classification to stay the same under distortions of the inputs. The researchers found that their method is more effective than purely self-supervised pretraining and comes close to fully supervised training with only a fraction of the labeled data.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/09dccd5f34384370b5a82dcd18586ab7_%E4%B8%8B%E8%BD%BD.jpg\" alt=\"下载.jpg\" /></p>\n<p>In “<a href=\"https://www.amazon.science/publications/improved-representation-learning-for-acoustic-event-classification-using-tree-structured-ontology\" target=\"_blank\">Improved representation learning for acoustic event classification using tree-structured ontology</a>”, Amazon researchers present a two-module joint model consisting of a representation neural network and a decision tree based on a predefined tree-structured ontology.</p>\n<h4><a id=\"Conclusion_and_outlook_68\"></a><strong>Conclusion and outlook</strong></h4>\n<p>As we have seen, Alexa relies on a range of audio-based technologies that use deep-learning architectures. The need to train these models robustly, fairly, and with limited supervision, as well as computational constraints at runtime, continues to drive research in Alexa Science. We have highlighted some of the results from that work as they are about to be presented to the wider science community, and we are excited to see the field as a whole come up with creative solutions and push toward ever more capable applications of speech-based AI.</p>\n<p>ABOUT THE AUTHOR</p>\n<h4><a id=\"Andreas_Stolckehttpswwwamazonscienceauthorandreasstolcke_75\"></a><strong><a href=\"https://www.amazon.science/author/andreas-stolcke\" target=\"_blank\">Andreas Stolcke</a></strong></h4>\n<p>Andreas Stolcke is a senior principal scientist in the Alexa AI organization.</p>\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家