{"value":"The International Conference on Acoustics, Speech, and Signal Processing ([ICASSP](https://www.amazon.science/conferences-and-events/icassp-2021)) starts next week, and as Alexa principal research scientist Ariya Rastrow [explained last year](https://www.amazon.science/blog/icassp-what-signal-processing-has-come-to-mean), it casts a wide net. The topics of the 36 Amazon research papers at this year’s ICASSP range from the classic signal-processing problems of noise and echo cancellation to such far-flung problems as separating song vocals from instrumental tracks and regulating translation length.\n\nA plurality of the papers, however, concentrate on the core technology of **automatic speech recognition** (ASR), or converting an acoustic speech signal into text:\n\n- [**ASR n-best fusion nets**](https://www.amazon.science/publications/asr-n-best-fusion-nets)\n[Xinyue Liu](https://www.amazon.science/author/xinyue-liu), Mingda Li, [Luoxin Chen](https://www.amazon.science/author/luoxin-chen), [Prashan Wanigasekara](https://www.amazon.science/author/prashan-wanigasekara), [Weitong Ruan](https://www.amazon.science/author/wael-hamza), [Haidar Khan](https://www.amazon.science/author/haidar-khan), [Wael Hamza](https://www.amazon.science/author/wael-hamza), [Chengwei Su](https://www.amazon.science/author/chengwei-su)\n\n- [**Bifocal neural ASR: Exploiting keyword spotting for inference optimization**](https://www.amazon.science/publications/bifocal-neural-asr-exploiting-keyword-spotting-for-inference-optimization)\nJon Macoskey, [Grant P. Strimel](https://www.amazon.science/author/grant-p-strimel), [Ariya Rastrow](https://www.amazon.science/author/ariya-rastrow)\n\n- [**Domain-aware neural language models for speech recognition**](https://www.amazon.science/publications/domain-aware-neural-language-models-for-speech-recognition)\n[Linda Liu](https://www.amazon.science/author/linda-liu), [Yile Gu](https://www.amazon.science/author/yile-gu), [Aditya Gourav](https://www.amazon.science/author/aditya-gourav), [Ankur Gandhe](https://www.amazon.science/author/ankur-gandhe), [Shashank Kalmane](https://www.amazon.science/author/shashank-kalmane), [Denis Filimonov](https://www.amazon.science/author/denis-filiminov), [Ariya Rastrow](https://www.amazon.science/author/ariya-rastrow), [Ivan Bulyko](https://www.amazon.science/author/ivan-bulyko)\n\n- [**End-to-end multi-channel transformer for speech recognition**](https://www.amazon.science/publications/end-to-end-multi-channel-transformer-for-speech-recognition)\n[Feng-Ju Chang](https://www.amazon.science/author/feng-ju-chang), [Martin Radfar](https://www.amazon.science/author/martin-radfar), [Athanasios Mouchtaris](https://www.amazon.science/author/athanasios-mouchtaris), [Brian King](https://www.amazon.science/author/brian-king), [Siegfried Kunzmann](https://www.amazon.science/author/siegfried-kunzmann)\n\n- [**Improved robustness to disfluencies in RNN-transducer-based speech recognition**](https://www.amazon.science/publications/improved-robustness-to-disfluencies-in-rnn-transducer-based-speech-recognition)\n[Valentin Mendelev](https://www.amazon.science/author/valentin-mendelev), Tina Raissi, Guglielmo Camporese, [Manuel Giollo](https://www.amazon.science/author/manuel-giollo)\n\n- [**Personalization strategies for end-to-end speech recognition systems**](https://www.amazon.science/publications/personalization-strategies-for-end-to-end-speech-recognition-systems)\n[Aditya Gourav](https://www.amazon.science/author/aditya-gourav), [Linda Liu](https://www.amazon.science/author/linda-liu), [Ankur Gandhe](https://www.amazon.science/author/ankur-gandhe), [Yile Gu](https://www.amazon.science/author/yile-gu), [Guitang Lan](https://www.amazon.science/author/guitang-lan), [Xiangyang Huang](https://www.amazon.science/author/xiangyang-huang), [Shashank Kalmane](https://www.amazon.science/author/shashank-kalmane), [Gautam Tiwari](https://www.amazon.science/author/gautam-tiwari), [Denis Filimonov](https://www.amazon.science/author/denis-filiminov), [Ariya Rastrow](https://www.amazon.science/author/ariya-rastrow), [Andreas Stolcke](https://www.amazon.science/author/andreas-stolcke), [Ivan Bulyko](https://www.amazon.science/author/ivan-bulyko) \n\n- [**reDAT: Accent-invariant representation for end-to-end ASR by domain adversarial training with relabeling**](https://www.amazon.science/publications/redat-accent-invariant-representation-for-end-to-end-asr-by-domain-adversarial-training-with-relabeling)\nHu Hu, [Xuesong Yang](https://www.amazon.science/author/xuesong-yang), [Zeynab Raeesy](https://www.amazon.science/author/zeynab-raeesy), [Jinxi Guo](https://www.amazon.science/author/jinxi-guo), [Gokce Keskin](https://www.amazon.science/author/gokce-keskin), [Harish Arsikere](https://www.amazon.science/author/harish-arsikere), [Ariya Rastrow](https://www.amazon.science/author/ariya-rastrow), [Andreas Stolcke](https://www.amazon.science/author/andreas-stolcke), [Roland Maas](https://www.amazon.science/author/roland-maas) \n\n- [**Sparsification via compressed sensing for automatic speech recognition**](https://www.amazon.science/publications/sparsification-via-compressed-sensing-for-automatic-speech-recognition)\nKai Zhen, [Hieu Duy Nguyen](https://www.amazon.science/author/hieu-duy-nguyen), [Feng-Ju Chang](https://www.amazon.science/author/feng-ju-chang), [Athanasios Mouchtaris](https://www.amazon.science/author/athanasios-mouchtaris), [Ariya Rastrow](https://www.amazon.science/author/ariya-rastrow)\n\n- [**Streaming multi-speaker ASR with RNN-T**](https://www.amazon.science/publications/streaming-multi-speaker-asr-with-rnn-t)\n[Ilya Sklyar](https://www.amazon.science/author/ilya-sklyar), [Anna Piunova](https://www.amazon.science/author/anna-piunova), [Yulan Liu](https://www.amazon.science/author/yulan-liu) \n\n- [**Using synthetic audio to improve the recognition of out-of-vocabulary words in end-to-end ASR systems**](https://www.amazon.science/publications/using-synthetic-audio-to-improve-the-recognition-of-out-of-vocabulary-words-in-end-to-end-asr-systems)\nXianrui Zheng, [Yulan Liu](https://www.amazon.science/author/yulan-liu), [Deniz Gunceler](https://www.amazon.science/author/deniz-gunceler), [Daniel Willett](https://www.amazon.science/author/daniel-willett)\n\n![image.png](https://dev-media.amazoncloud.cn/15c4b6efdfa140a28a0c4a3a8e0006e2_image.png)\n\nTo enable personalization of end-to-end automatic-speech-recognition systems, Linda Liu, Aditya Gourav and their colleagues use a word-level biasing finite state transducer, or FST (left). A subword-level FST preserves the weights of the word-level FST. For instance, the weight between state 0 and 5 of the subword-level FST (representing the word “player”) is (-1.6) +(- 1.6)+(-4.8) = -8.\n\nFROM [\"PERSONALIZATION STRATEGIES FOR END-TO-END SPEECH RECOGNITION SYSTEMS\"](https://www.amazon.science/publications/personalization-strategies-for-end-to-end-speech-recognition-systems)\n\nTwo of the papers address **language (or code) switching**, a more complicated version of ASR in which the speech recognizer must also determine which of several possible languages is being spoken: \n\n- [**Joint ASR and language identification using RNN-T: An efficent approach to dynamic language switching**](https://www.amazon.science/publications/joint-asr-and-language-identification-using-rnn-t-an-efficent-approach-to-dynamic-language-switching)\n[Surabhi Punjabi](https://www.amazon.science/author/surabhi-punjabi), [Harish Arsikere](https://www.amazon.science/author/harish-arsikere), [Zeynab Raeesy](https://www.amazon.science/author/zeynab-raeesy), [Chander Chandak](https://www.amazon.science/author/chander-chandak), [Nikhil Bhave](https://www.amazon.science/author/nikhil-bhave), [Markus Mueller](https://www.amazon.science/author/markus-mueller), [Sergio Murillo](https://www.amazon.science/author/sergio-murillo), [Ariya Rastrow](https://www.amazon.science/author/ariya-rastrow), [Andreas Stolcke](https://www.amazon.science/author/andreas-stolcke), [Jasha Droppo](https://www.amazon.science/author/jasha-droppo), [Sri Garimella](https://www.amazon.science/author/sri-garimella), [Roland Maas](https://www.amazon.science/author/roland-maas), [Mat Hans](https://www.amazon.science/author/mat-hans), [Athanasios Mouchtaris](https://www.amazon.science/author/athanasios-mouchtaris), [Siegfried Kunzmann](https://www.amazon.science/author/siegfried-kunzmann)\n\n- [**Transformer-transducers for code-switched speech recognition**](https://www.amazon.science/publications/transformer-transducers-for-code-switched-speech-recognition)\nSiddharth Dalmia, [Yuzong Liu](https://www.amazon.science/author/yuzong-liu), [Srikanth Ronanki](https://www.amazon.science/author/srikanth-ronanki), [Katrin Kirchhoff ](https://www.amazon.science/author/katrin-kirchhoff)\n\nThe acoustic speech signal contains more information than just the speaker’s words; how the words are said can change their meaning. Such **paralinguistic signals** can be useful for a voice agent trying to determine how to interpret the raw text. Two of Amazon’s ICASSP papers focus on such signals:\n\n- [**Contrastive unsupervised learning for speech emotion recognition**](https://www.amazon.science/publications/contrastive-unsupervised-learning-for-speech-emotion-recognition)\nMao Li, [Bo Yang](https://www.amazon.science/author/bo-yang), [Joshua Levy](https://www.amazon.science/author/joshua-levy), [Andreas Stolcke](https://www.amazon.science/author/andreas-stolcke), [Viktor Rozgic](https://www.amazon.science/author/viktor-rozgic), [Spyros Matsoukas](https://www.amazon.science/author/spyros-matsoukas), [Constantinos Papayiannis](https://www.amazon.science/author/constantinos-papayiannis), [Daniel Bone](https://www.amazon.science/author/daniel-bone), [Chao Wang](https://www.amazon.science/author/chao-wang) \n\n- [**Disentanglement for audiovisual emotion recognition using multitask setup**](https://www.amazon.science/publications/disentanglement-for-audiovisual-emotion-recognition-using-multitask-setup)\nRaghuveer Peri, [Srinivas Parthasarathy](https://www.amazon.science/author/srinivas-parthasarathy), [Charles Bradshaw](https://www.amazon.science/author/charles-bradshaw), [Shiva Sundaram](https://www.amazon.science/blog/null) \n\nSeveral papers address other **extensions of ASR**, such as speaker diarization, or tracking which of several speakers issues each utterance; inverse text normalization, or converting the raw ASR output into a format useful to downstream applications; and acoustic event classification, or recognizing sounds other than human voices:\n\n- [**BW-EDA-EEND: Streaming end-to-end neural speaker diarization for a variable number of speakers**](https://www.amazon.science/publications/bw-eda-eend-streaming-end-to-end-neural-speaker-diarization-for-a-variable-number-of-speakers)\n[Eunjung Han](https://www.amazon.science/author/eunjung-han), [Chul Lee](https://www.amazon.science/author/chul-lee), [Andreas Stolcke](https://www.amazon.science/author/andreas-stolcke)\n\n- [**Neural inverse text normalization**](https://www.amazon.science/publications/neural-inverse-text-normalization)\n[Monica Sunkara](https://www.amazon.science/author/monica-sunkara), [Chaitanya Shivade](https://www.amazon.science/author/chaitanya-shivade), [Sravan Bodapati](https://www.amazon.science/author/sravan-bodapati), [Katrin Kirchhoff](https://www.amazon.science/author/katrin-kirchhoff)\n\n- [**Unsupervised and semi-supervised few-shot acoustic event classification**](https://www.amazon.science/publications/unsupervised-and-semi-supervised-few-shot-acoustic-event-classification)\nHsin-Ping Huang, [Krishna C. Puvvada](https://www.amazon.science/author/krishna-c-puvvada), [Ming Sun](https://www.amazon.science/author/ming-sun), [Chao Wang](https://www.amazon.science/author/chao-wang)\n\n**Speech enhancement**, or removing noise and echo from the speech signal, has been a prominent topic at ICASSP since the conference began in 1976. But more recent work on the topic — including Amazon’s two papers this year — uses deep-learning methods:\n\n![image.png](https://dev-media.amazoncloud.cn/8a67b6ead95941bf9b209f4c3804b6ce_image.png)\n\nThe structure of a joint echo control and noise suppression system from Amazon. A microphone (mic) captures the output of a loudspeaker, along with noise and echo. The echo is partially cancelled by an adaptive filter (ĥf), which uses the signal to the speaker. The microphone signal then passes to a residual-echo-suppression (RES) algorithm.\n\nFROM [\"LOW-COMPLEXITY, REAL-TIME JOINT NEURAL ECHO CONTROL AND SPEECH ENHANCEMENT BASED ON PERCEPNET\"](https://www.amazon.science/publications/low-complexity-real-time-joint-neural-echo-control-and-speech-enhancement-based-on-percepnet)\n\n- [**Enhancing into the codec: Noise robust speech coding with vector-quantized autoencoders**](https://www.amazon.science/publications/enhancing-into-the-codec-noise-robust-speech-coding-with-vector-quantized-autoencoders)\nJonah Casebeer, Vinjai Vale, [Umut Isik](https://www.amazon.science/author/umut-isik), [Jean-Marc Valin](https://www.amazon.science/author/jean-marc-valin), [Ritwik Giri](https://www.amazon.science/author/ritwik-giri), [Arvindh Krishnaswamy](https://www.amazon.science/author/arvindh-krishnaswamy)\n\n- [**Low-complexity, real-time joint neural echo control and speech enhancement based on Percepnet**](https://www.amazon.science/publications/low-complexity-real-time-joint-neural-echo-control-and-speech-enhancement-based-on-percepnet)\n[Jean-Marc Valin](https://www.amazon.science/author/jean-marc-valin), [Srikanth V. Tenneti](https://www.amazon.science/author/srikanth-v-tenneti), [Karim Helwani](https://www.amazon.science/author/karim-helwani), [Umut Isik](https://www.amazon.science/author/umut-isik), [Arvindh Krishnaswamy](https://www.amazon.science/author/arvindh-krishnaswamy)\n\nEvery interaction with Alexa begins with a wake word — usually “Alexa”, but sometimes “computer” or “Echo”. So at ICASSP, Amazon usually presents work on wake word detection — or **keyword spotting**, as it’s more generally known:\n\n- [**Exploring the application of synthetic audio in training keyword spotters**](https://www.amazon.science/publications/exploring-the-application-of-synthetic-audio-in-training-keyword-spotters)\n[Andrew Werchniak](https://www.amazon.science/author/andrew-werchniak), [Roberto Barra-Chicote](https://www.amazon.science/author/roberto-barra-chicote), [Yuriy Mishchenko](https://www.amazon.science/author/yuriy-mishchenko), [Jasha Droppo](https://www.amazon.science/author/jasha-droppo), [Jeff Condal](https://www.amazon.science/author/jeff-condal), [Peng Liu](https://www.amazon.science/author/peng-liu), [Anish Shah ](https://www.amazon.science/author/anish-shah)\n\nIn many spoken-language systems, the next step after ASR **is natural-language understanding** (NLU), or making sense of the text output from the ASR system:\n\n-[ **Introducing deep reinforcement learning to NLU ranking tasks**](https://www.amazon.science/publications/introducing-deep-reinforcement-learning-to-nlu-ranking-tasks)\n[Ge Yu](https://www.amazon.science/author/ge-yu), [Chengwei Su](https://www.amazon.science/author/chengwei-su), [Emre Barut](https://www.amazon.science/author/emre-barut) \n- [**Language model is all you need: Natural language understanding as question answering**](https://www.amazon.science/publications/language-model-is-all-you-need-natural-language-understanding-as-question-answering)\n[Mahdi Namazifar](https://www.amazon.science/author/mahdi-namazifar), [Alexandros Papangelis](https://www.amazon.science/author/alexandros-papangelis), [Gokhan Tur](https://www.amazon.science/author/gokhan-tur), [Dilek Hakkani-Tür ](https://www.amazon.science/author/dilek-hakkani-tur)\n\nIn some contexts, however, it’s possible to perform both ASR and NLU with a single model, in a task known as **spoken-language understanding:**\n\n- [**Do as I mean, not as I say: Sequence loss training for spoken language understanding**](https://www.amazon.science/publications/do-as-i-mean-not-as-i-say-sequence-loss-training-for-spoken-language-understanding)\n[Milind Rao](https://www.amazon.science/author/milind-rao), [Pranav Dheram](https://www.amazon.science/author/pranav-dheram), [Gautam Tiwari](https://www.amazon.science/author/gautam-tiwari), [Anirudh Raju](https://www.amazon.science/author/anirudh-raju), [Jasha Droppo](https://www.amazon.science/author/jasha-droppo), [Ariya Rastrow](https://www.amazon.science/author/ariya-rastrow), [Andreas Stolcke ](https://www.amazon.science/author/andreas-stolcke)\n\n- [**Graph enhanced query rewriting for spoken language understanding system**](https://www.amazon.science/publications/graph-enhanced-query-rewriting-for-spoken-language-understanding-system)\nSiyang Yuan, [Saurabh Gupta](https://www.amazon.science/author/saurabh-gupta), [Xing Fan](https://www.amazon.science/author/xing-fan), [Derek Liu](https://www.amazon.science/author/derek-liu), [Yang Liu](https://www.amazon.science/author/yang-liu), C[henlei (Edward) Guo](https://www.amazon.science/author/chenlei-guo) \n\n- [**Top-down attention in end-to-end spoken language understanding**](https://www.amazon.science/publications/top-down-attention-in-end-to-end-spoken-language-understanding)\nYixin Chen, [Weiyi Lu](https://www.amazon.science/author/weiyi-lu), [Alejandro Mottini ](https://www.amazon.science/author/alejandro-mottini), [Erran Li](https://www.amazon.science/author/erran-li), [Jasha Droppo](https://www.amazon.science/author/jasha-droppo), [Zheng Du](https://www.amazon.science/author/zheng-du), [Belinda Zeng](https://www.amazon.science/author/belinda-zeng)\n\n![image.png](https://dev-media.amazoncloud.cn/f753621dbe374383bf89768b087db820_image.png)\n\nA spoken-language-understanding system combines automatic speech recognition (ASR) and natural-language understanding (NLU) in a single model.\n\nFROM [\"DO AS I MEAN, NOT AS I SAY: SEQUENCE LOSS TRAINING FOR SPOKEN LANGUAGE UNDERSTANDING\"](https://www.amazon.science/publications/do-as-i-mean-not-as-i-say-sequence-loss-training-for-spoken-language-understanding)\n\nAn interaction with a voice service, which begins with keyword spotting, ASR, and NLU, often culminates with the agent’s use of synthesized speech to relay a response. The agent’s **text-to-speech** model converts the textual outputs of various NLU and dialogue systems into speech:\n\n- [**CAMP: A two-stage approach to modelling prosody in context**](https://www.amazon.science/publications/camp-a-two-stage-approach-to-modelling-prosody-in-context)\nZack Hodari, [Alexis Moinet](https://www.amazon.science/author/alexis-moinet), [Sri Karlapati](https://www.amazon.science/author/sri-karlapati), [Jaime Lorenzo-Trueba](https://www.amazon.science/author/jaime-lorenzo-trueba), [Thomas Merritt](https://www.amazon.science/author/thomas-merritt), [Arnaud Joly](https://www.amazon.science/author/arnaud-joly), [Ammar Abbas](https://www.amazon.science/author/ammar-abbas), [Penny Karanasou](https://www.amazon.science/author/penny-karanasou), [Thomas Drugman](https://www.amazon.science/author/thomas-drugman)\n\n- [**Low-resource expressive text-to-speech using data augmentation**](https://www.amazon.science/publications/low-resource-expressive-text-to-speech-using-data-augmentation)\n[Goeric Huybrechts](https://www.amazon.science/author/goeric-huybrechts), [Thomas Merritt](https://www.amazon.science/author/thomas-merritt), [Giulia Comini](https://www.amazon.science/author/giulia-comini), [Bartek Perz](https://www.amazon.science/author/bartek-perz), [Raahil Shah](https://www.amazon.science/author/raahil-shah), [Jaime Lorenzo-Trueba](https://www.amazon.science/author/jaime-lorenzo-trueba) \n\n- [**Prosodic representation learning and contextual sampling for neural text-to-speech**](https://www.amazon.science/publications/prosodic-representation-learning-and-contextual-sampling-for-neural-text-to-speech)\n[Sri Karlapati](https://www.amazon.science/author/sri-karlapati), [Ammar Abbas](), Zack Hodari], [Alexis Moinet](https://www.amazon.science/author/alexis-moinet), [Arnaud Joly](https://www.amazon.science/author/arnaud-joly), [Penny Karanasou](https://www.amazon.science/author/penny-karanasou), [Thomas Drugman](https://www.amazon.science/author/thomas-drugman)\n\n- [**Universal neural vocoding with Parallel WaveNet**](https://www.amazon.science/publications/universal-neural-vocoding-with-parallel-wavenet)\n[Yunlong Jiao](https://www.amazon.science/author/yunlong-jiao), [Adam Gabrys](https://www.amazon.science/author/adam-gabrys), Georgi Tinchev, [Bartosz Putrycz](https://www.amazon.science/author/bartosz-putrycz), [Daniel Korzekwa](https://www.amazon.science/author/daniel-korzekwa), [Viacheslav Klimkov](https://www.amazon.science/author/viacheslav-klimkov)\n\nAll of the preceding research topics have implications for voice services like Alexa, but Amazon has a range of other products and services that rely on audio-signal processing. Three of Amazon’s papers at this year’s ICASSP relate to **audio-video synchronization**: two deal with dubbing audio in one language onto video shot in another, and one describes how to detect synchronization errors in video — as when, for example, the sound of a tennis ball being struck and the shot of the racquet hitting the ball are misaligned:\n\n- [**Detection of audio-video synchronization errors via event detection**](https://www.amazon.science/publications/detection-of-audio-video-synchronization-errors-via-event-detection)\nJoshua P. Ebenezer, [Yongjun Wu](https://www.amazon.science/author/yongjun-wu), [Hai Wei](https://www.amazon.science/author/hai-wei), [Sriram Sethuraman](https://www.amazon.science/author/sriram-sethuraman), [Zongyi Liu ](https://www.amazon.science/author/zongyi-liu)\n\n- [**Improvements to prosodic alignment for automatic dubbing**](https://www.amazon.science/publications/improvements-to-prosodic-alignment-for-automatic-dubbing)\n[Yogesh Virkar](https://www.amazon.science/author/yogesh-virkar), [Marcello Federico](https://www.amazon.science/author/marcello-federico), [Robert Enyedi](https://www.amazon.science/author/robert-enyedi), [Roberto Barra-Chicote](https://www.amazon.science/author/roberto-barra-chicote)\n\n- [**Machine translation verbosity control for automatic dubbing**](https://www.amazon.science/publications/machine-translation-verbosity-control-for-automatic-dubbing)\n[Surafel Melaku Lakew](https://www.amazon.science/https:/www.amazon.science/author/surafel-melaku-lakew), [Marcello Federico](https://www.amazon.science/author/marcello-federico), [Yue Wang](https://www.amazon.science/author/yue-wang), [Cuong Hoang](https://www.amazon.science/author/cuong-hoang), [Yogesh Virkar](https://www.amazon.science/author/yogesh-virkar), [Roberto Barra-Chicote](https://www.amazon.science/author/roberto-barra-chicote), [Robert Enyedi ](https://www.amazon.science/author/robert-enyedi)\n\nAmazon’s Text-to-Speech team has an ICASSP paper on the unusual topic of **computer-assisted pronunciation training**, a feature of some language learning applications. The researchers’ method would enable language learning apps to accept a wider range of word pronunciations, to score pronunciations more accurately, and to provide more reliable feedback:\n\n- [**Mispronunciation detection in non-native (L2) English with uncertainty modeling**](https://www.amazon.science/publications/mispronunciation-detection-in-non-native-l2-english-with-uncertainty-modeling)\n[Daniel Korzekwa](https://www.amazon.science/author/daniel-korzekwa), [Jaime Lorenzo-Trueba](https://www.amazon.science/author/jaime-lorenzo-trueba), Szymon Zaporowski, [Shira Calamaro](https://www.amazon.science/author/shira-calamaro), [Thomas Drugman](https://www.amazon.science/author/thomas-drugman), Bozena Kostek \n\n![image.png](https://dev-media.amazoncloud.cn/8c27775d555241e6a158f61419d36e54_image.png)\n\nThe architecture of a new Amazon model for separating a recording's vocal tracks and instrumental tracks.\n\nFROM [\"SEMI-SUPERVISED SINGING VOICE SEPARATION WITH NOISE SELF-TRAINING\"](https://www.amazon.science/publications/semi-supervised-singing-voice-separation-with-noise-self-training)\n\nAnother paper investigates the topic of **singing voice separation**, or separating vocal tracks from instrumental tracks in song recordings:\n\n- [**Semi-supervised singing voice separation with noise self-training**](https://www.amazon.science/publications/semi-supervised-singing-voice-separation-with-noise-self-training)\nZhepei Wang, [Ritwik Giri](https://www.amazon.science/author/ritwik-giri), [Umut Isik](https://www.amazon.science/author/umut-isik), [Jean-Marc Valin](https://www.amazon.science/author/jean-marc-valin), [Arvindh Krishnaswamy ](https://www.amazon.science/author/arvindh-krishnaswamy)\n\nFinally, two of Amazon’s ICASSP papers, although they do evaluate applications in speech recognition and audio classification, present general **machine learning methodologies** that could apply to a range of problems. One paper investigates federated learning, a distributed-learning technique in which multiple servers, each with a different, local store of training data, collectively build a machine learning model without exchanging data. The other presents a new loss function for training classification models on synthetic data created by transforming real data — for instance, training a sound classification model with samples that have noise added to them artificially.\n\n- [**Cross-silo federated training in the cloud with diversity scaling and semi-supervised learning**](https://www.amazon.science/publications/cross-silo-federated-training-in-the-cloud-with-diversity-scaling-and-semi-supervised-learning)\n[Kishore Nandury](https://www.amazon.science/author/kishore-nandury), [Anand Mohan](https://www.amazon.science/author/anand-mohan), [Frederick Weber](https://www.amazon.science/author/frederick-weber) \n- [**Enhancing audio augmentation methods with consistency learning**](https://www.amazon.science/publications/enhancing-audio-augmentation-methods-with-consistency-learning)\nTurab Iqbal, [Karim Helwani](https://www.amazon.science/author/karim-helwani), [Arvindh Krishnaswamy](https://www.amazon.science/author/arvindh-krishnaswamy), Wenwu Wang\n\nAlso at ICASSP, on June 8, seven Amazon scientists will be participating in a half-hour live Q&A. Conference registrants may [submit questions to the panelists](https://amazon.qualtrics.com/jfe/form/SV_0uf9M3PTwoL0Rfw) online.\n\nABOUT THE AUTHOR\n\n#### [Larry Hardesty](https://www.amazon.science/author/larry-hardesty)\n\nLarry Hardesty is the editor of the Amazon Science blog. Previously, he was a senior editor at MIT Technology Review and the computer science writer at the MIT News Office.\n\n\n\n\n","render":"<p>The International Conference on Acoustics, Speech, and Signal Processing (<a href=\"https://www.amazon.science/conferences-and-events/icassp-2021\" target=\"_blank\">ICASSP</a>) starts next week, and as Alexa principal research scientist Ariya Rastrow <a href=\"https://www.amazon.science/blog/icassp-what-signal-processing-has-come-to-mean\" target=\"_blank\">explained last year</a>, it casts a wide net. The topics of the 36 Amazon research papers at this year’s ICASSP range from the classic signal-processing problems of noise and echo cancellation to such far-flung problems as separating song vocals from instrumental tracks and regulating translation length.</p>\n<p>A plurality of the papers, however, concentrate on the core technology of <strong>automatic speech recognition</strong> (ASR), or converting an acoustic speech signal into text:</p>\n<ul>\n<li>\n<p><a href=\"https://www.amazon.science/publications/asr-n-best-fusion-nets\" target=\"_blank\"><strong>ASR n-best fusion nets</strong></a><br />\n<a href=\"https://www.amazon.science/author/xinyue-liu\" target=\"_blank\">Xinyue Liu</a>, Mingda Li, <a href=\"https://www.amazon.science/author/luoxin-chen\" target=\"_blank\">Luoxin Chen</a>, <a href=\"https://www.amazon.science/author/prashan-wanigasekara\" target=\"_blank\">Prashan Wanigasekara</a>, <a href=\"https://www.amazon.science/author/wael-hamza\" target=\"_blank\">Weitong Ruan</a>, <a href=\"https://www.amazon.science/author/haidar-khan\" target=\"_blank\">Haidar Khan</a>, <a href=\"https://www.amazon.science/author/wael-hamza\" target=\"_blank\">Wael Hamza</a>, <a href=\"https://www.amazon.science/author/chengwei-su\" target=\"_blank\">Chengwei Su</a></p>\n</li>\n<li>\n<p><a href=\"https://www.amazon.science/publications/bifocal-neural-asr-exploiting-keyword-spotting-for-inference-optimization\" target=\"_blank\"><strong>Bifocal neural ASR: Exploiting keyword spotting for inference optimization</strong></a><br />\nJon Macoskey, <a href=\"https://www.amazon.science/author/grant-p-strimel\" target=\"_blank\">Grant P. Strimel</a>, <a href=\"https://www.amazon.science/author/ariya-rastrow\" target=\"_blank\">Ariya Rastrow</a></p>\n</li>\n<li>\n<p><a href=\"https://www.amazon.science/publications/domain-aware-neural-language-models-for-speech-recognition\" target=\"_blank\"><strong>Domain-aware neural language models for speech recognition</strong></a><br />\n<a href=\"https://www.amazon.science/author/linda-liu\" target=\"_blank\">Linda Liu</a>, <a href=\"https://www.amazon.science/author/yile-gu\" target=\"_blank\">Yile Gu</a>, <a href=\"https://www.amazon.science/author/aditya-gourav\" target=\"_blank\">Aditya Gourav</a>, <a href=\"https://www.amazon.science/author/ankur-gandhe\" target=\"_blank\">Ankur Gandhe</a>, <a href=\"https://www.amazon.science/author/shashank-kalmane\" target=\"_blank\">Shashank Kalmane</a>, <a href=\"https://www.amazon.science/author/denis-filiminov\" target=\"_blank\">Denis Filimonov</a>, <a href=\"https://www.amazon.science/author/ariya-rastrow\" target=\"_blank\">Ariya Rastrow</a>, <a href=\"https://www.amazon.science/author/ivan-bulyko\" target=\"_blank\">Ivan Bulyko</a></p>\n</li>\n<li>\n<p><a href=\"https://www.amazon.science/publications/end-to-end-multi-channel-transformer-for-speech-recognition\" target=\"_blank\"><strong>End-to-end multi-channel transformer for speech recognition</strong></a><br />\n<a href=\"https://www.amazon.science/author/feng-ju-chang\" target=\"_blank\">Feng-Ju Chang</a>, <a href=\"https://www.amazon.science/author/martin-radfar\" target=\"_blank\">Martin Radfar</a>, <a href=\"https://www.amazon.science/author/athanasios-mouchtaris\" target=\"_blank\">Athanasios Mouchtaris</a>, <a href=\"https://www.amazon.science/author/brian-king\" target=\"_blank\">Brian King</a>, <a href=\"https://www.amazon.science/author/siegfried-kunzmann\" target=\"_blank\">Siegfried Kunzmann</a></p>\n</li>\n<li>\n<p><a href=\"https://www.amazon.science/publications/improved-robustness-to-disfluencies-in-rnn-transducer-based-speech-recognition\" target=\"_blank\"><strong>Improved robustness to disfluencies in RNN-transducer-based speech recognition</strong></a><br />\n<a href=\"https://www.amazon.science/author/valentin-mendelev\" target=\"_blank\">Valentin Mendelev</a>, Tina Raissi, Guglielmo Camporese, <a href=\"https://www.amazon.science/author/manuel-giollo\" target=\"_blank\">Manuel Giollo</a></p>\n</li>\n<li>\n<p><a href=\"https://www.amazon.science/publications/personalization-strategies-for-end-to-end-speech-recognition-systems\" target=\"_blank\"><strong>Personalization strategies for end-to-end speech recognition systems</strong></a><br />\n<a href=\"https://www.amazon.science/author/aditya-gourav\" target=\"_blank\">Aditya Gourav</a>, <a href=\"https://www.amazon.science/author/linda-liu\" target=\"_blank\">Linda Liu</a>, <a href=\"https://www.amazon.science/author/ankur-gandhe\" target=\"_blank\">Ankur Gandhe</a>, <a href=\"https://www.amazon.science/author/yile-gu\" target=\"_blank\">Yile Gu</a>, <a href=\"https://www.amazon.science/author/guitang-lan\" target=\"_blank\">Guitang Lan</a>, <a href=\"https://www.amazon.science/author/xiangyang-huang\" target=\"_blank\">Xiangyang Huang</a>, <a href=\"https://www.amazon.science/author/shashank-kalmane\" target=\"_blank\">Shashank Kalmane</a>, <a href=\"https://www.amazon.science/author/gautam-tiwari\" target=\"_blank\">Gautam Tiwari</a>, <a href=\"https://www.amazon.science/author/denis-filiminov\" target=\"_blank\">Denis Filimonov</a>, <a href=\"https://www.amazon.science/author/ariya-rastrow\" target=\"_blank\">Ariya Rastrow</a>, <a href=\"https://www.amazon.science/author/andreas-stolcke\" target=\"_blank\">Andreas Stolcke</a>, <a href=\"https://www.amazon.science/author/ivan-bulyko\" target=\"_blank\">Ivan Bulyko</a></p>\n</li>\n<li>\n<p><a href=\"https://www.amazon.science/publications/redat-accent-invariant-representation-for-end-to-end-asr-by-domain-adversarial-training-with-relabeling\" target=\"_blank\"><strong>reDAT: Accent-invariant representation for end-to-end ASR by domain adversarial training with relabeling</strong></a><br />\nHu Hu, <a href=\"https://www.amazon.science/author/xuesong-yang\" target=\"_blank\">Xuesong Yang</a>, <a href=\"https://www.amazon.science/author/zeynab-raeesy\" target=\"_blank\">Zeynab Raeesy</a>, <a href=\"https://www.amazon.science/author/jinxi-guo\" target=\"_blank\">Jinxi Guo</a>, <a href=\"https://www.amazon.science/author/gokce-keskin\" target=\"_blank\">Gokce Keskin</a>, <a href=\"https://www.amazon.science/author/harish-arsikere\" target=\"_blank\">Harish Arsikere</a>, <a href=\"https://www.amazon.science/author/ariya-rastrow\" target=\"_blank\">Ariya Rastrow</a>, <a href=\"https://www.amazon.science/author/andreas-stolcke\" target=\"_blank\">Andreas Stolcke</a>, <a href=\"https://www.amazon.science/author/roland-maas\" target=\"_blank\">Roland Maas</a></p>\n</li>\n<li>\n<p><a href=\"https://www.amazon.science/publications/sparsification-via-compressed-sensing-for-automatic-speech-recognition\" target=\"_blank\"><strong>Sparsification via compressed sensing for automatic speech recognition</strong></a><br />\nKai Zhen, <a href=\"https://www.amazon.science/author/hieu-duy-nguyen\" target=\"_blank\">Hieu Duy Nguyen</a>, <a href=\"https://www.amazon.science/author/feng-ju-chang\" target=\"_blank\">Feng-Ju Chang</a>, <a href=\"https://www.amazon.science/author/athanasios-mouchtaris\" target=\"_blank\">Athanasios Mouchtaris</a>, <a href=\"https://www.amazon.science/author/ariya-rastrow\" target=\"_blank\">Ariya Rastrow</a></p>\n</li>\n<li>\n<p><a href=\"https://www.amazon.science/publications/streaming-multi-speaker-asr-with-rnn-t\" target=\"_blank\"><strong>Streaming multi-speaker ASR with RNN-T</strong></a><br />\n<a href=\"https://www.amazon.science/author/ilya-sklyar\" target=\"_blank\">Ilya Sklyar</a>, <a href=\"https://www.amazon.science/author/anna-piunova\" target=\"_blank\">Anna Piunova</a>, <a href=\"https://www.amazon.science/author/yulan-liu\" target=\"_blank\">Yulan Liu</a></p>\n</li>\n<li>\n<p><a href=\"https://www.amazon.science/publications/using-synthetic-audio-to-improve-the-recognition-of-out-of-vocabulary-words-in-end-to-end-asr-systems\" target=\"_blank\"><strong>Using synthetic audio to improve the recognition of out-of-vocabulary words in end-to-end ASR systems</strong></a><br />\nXianrui Zheng, <a href=\"https://www.amazon.science/author/yulan-liu\" target=\"_blank\">Yulan Liu</a>, <a href=\"https://www.amazon.science/author/deniz-gunceler\" target=\"_blank\">Deniz Gunceler</a>, <a href=\"https://www.amazon.science/author/daniel-willett\" target=\"_blank\">Daniel Willett</a></p>\n</li>\n</ul>\n<p><img src=\"https://dev-media.amazoncloud.cn/15c4b6efdfa140a28a0c4a3a8e0006e2_image.png\" alt=\"image.png\" /></p>\n<p>To enable personalization of end-to-end automatic-speech-recognition systems, Linda Liu, Aditya Gourav and their colleagues use a word-level biasing finite state transducer, or FST (left). A subword-level FST preserves the weights of the word-level FST. For instance, the weight between state 0 and 5 of the subword-level FST (representing the word “player”) is (-1.6) +(- 1.6)+(-4.8) = -8.</p>\n<p>FROM <a href=\"https://www.amazon.science/publications/personalization-strategies-for-end-to-end-speech-recognition-systems\" target=\"_blank\">“PERSONALIZATION STRATEGIES FOR END-TO-END SPEECH RECOGNITION SYSTEMS”</a></p>\n<p>Two of the papers address <strong>language (or code) switching</strong>, a more complicated version of ASR in which the speech recognizer must also determine which of several possible languages is being spoken:</p>\n<ul>\n<li>\n<p><a href=\"https://www.amazon.science/publications/joint-asr-and-language-identification-using-rnn-t-an-efficent-approach-to-dynamic-language-switching\" target=\"_blank\"><strong>Joint ASR and language identification using RNN-T: An efficent approach to dynamic language switching</strong></a><br />\n<a href=\"https://www.amazon.science/author/surabhi-punjabi\" target=\"_blank\">Surabhi Punjabi</a>, <a href=\"https://www.amazon.science/author/harish-arsikere\" target=\"_blank\">Harish Arsikere</a>, <a href=\"https://www.amazon.science/author/zeynab-raeesy\" target=\"_blank\">Zeynab Raeesy</a>, <a href=\"https://www.amazon.science/author/chander-chandak\" target=\"_blank\">Chander Chandak</a>, <a href=\"https://www.amazon.science/author/nikhil-bhave\" target=\"_blank\">Nikhil Bhave</a>, <a href=\"https://www.amazon.science/author/markus-mueller\" target=\"_blank\">Markus Mueller</a>, <a href=\"https://www.amazon.science/author/sergio-murillo\" target=\"_blank\">Sergio Murillo</a>, <a href=\"https://www.amazon.science/author/ariya-rastrow\" target=\"_blank\">Ariya Rastrow</a>, <a href=\"https://www.amazon.science/author/andreas-stolcke\" target=\"_blank\">Andreas Stolcke</a>, <a href=\"https://www.amazon.science/author/jasha-droppo\" target=\"_blank\">Jasha Droppo</a>, <a href=\"https://www.amazon.science/author/sri-garimella\" target=\"_blank\">Sri Garimella</a>, <a href=\"https://www.amazon.science/author/roland-maas\" target=\"_blank\">Roland Maas</a>, <a href=\"https://www.amazon.science/author/mat-hans\" target=\"_blank\">Mat Hans</a>, <a href=\"https://www.amazon.science/author/athanasios-mouchtaris\" target=\"_blank\">Athanasios Mouchtaris</a>, <a href=\"https://www.amazon.science/author/siegfried-kunzmann\" target=\"_blank\">Siegfried Kunzmann</a></p>\n</li>\n<li>\n<p><a href=\"https://www.amazon.science/publications/transformer-transducers-for-code-switched-speech-recognition\" target=\"_blank\"><strong>Transformer-transducers for code-switched speech recognition</strong></a><br />\nSiddharth Dalmia, <a href=\"https://www.amazon.science/author/yuzong-liu\" target=\"_blank\">Yuzong Liu</a>, <a href=\"https://www.amazon.science/author/srikanth-ronanki\" target=\"_blank\">Srikanth Ronanki</a>, <a href=\"https://www.amazon.science/author/katrin-kirchhoff\" target=\"_blank\">Katrin Kirchhoff </a></p>\n</li>\n</ul>\n<p>The acoustic speech signal contains more information than just the speaker’s words; how the words are said can change their meaning. Such <strong>paralinguistic signals</strong> can be useful for a voice agent trying to determine how to interpret the raw text. Two of Amazon’s ICASSP papers focus on such signals:</p>\n<ul>\n<li>\n<p><a href=\"https://www.amazon.science/publications/contrastive-unsupervised-learning-for-speech-emotion-recognition\" target=\"_blank\"><strong>Contrastive unsupervised learning for speech emotion recognition</strong></a><br />\nMao Li, <a href=\"https://www.amazon.science/author/bo-yang\" target=\"_blank\">Bo Yang</a>, <a href=\"https://www.amazon.science/author/joshua-levy\" target=\"_blank\">Joshua Levy</a>, <a href=\"https://www.amazon.science/author/andreas-stolcke\" target=\"_blank\">Andreas Stolcke</a>, <a href=\"https://www.amazon.science/author/viktor-rozgic\" target=\"_blank\">Viktor Rozgic</a>, <a href=\"https://www.amazon.science/author/spyros-matsoukas\" target=\"_blank\">Spyros Matsoukas</a>, <a href=\"https://www.amazon.science/author/constantinos-papayiannis\" target=\"_blank\">Constantinos Papayiannis</a>, <a href=\"https://www.amazon.science/author/daniel-bone\" target=\"_blank\">Daniel Bone</a>, <a href=\"https://www.amazon.science/author/chao-wang\" target=\"_blank\">Chao Wang</a></p>\n</li>\n<li>\n<p><a href=\"https://www.amazon.science/publications/disentanglement-for-audiovisual-emotion-recognition-using-multitask-setup\" target=\"_blank\"><strong>Disentanglement for audiovisual emotion recognition using multitask setup</strong></a><br />\nRaghuveer Peri, <a href=\"https://www.amazon.science/author/srinivas-parthasarathy\" target=\"_blank\">Srinivas Parthasarathy</a>, <a href=\"https://www.amazon.science/author/charles-bradshaw\" target=\"_blank\">Charles Bradshaw</a>, <a href=\"https://www.amazon.science/blog/null\" target=\"_blank\">Shiva Sundaram</a></p>\n</li>\n</ul>\n<p>Several papers address other <strong>extensions of ASR</strong>, such as speaker diarization, or tracking which of several speakers issues each utterance; inverse text normalization, or converting the raw ASR output into a format useful to downstream applications; and acoustic event classification, or recognizing sounds other than human voices:</p>\n<ul>\n<li>\n<p><a href=\"https://www.amazon.science/publications/bw-eda-eend-streaming-end-to-end-neural-speaker-diarization-for-a-variable-number-of-speakers\" target=\"_blank\"><strong>BW-EDA-EEND: Streaming end-to-end neural speaker diarization for a variable number of speakers</strong></a><br />\n<a href=\"https://www.amazon.science/author/eunjung-han\" target=\"_blank\">Eunjung Han</a>, <a href=\"https://www.amazon.science/author/chul-lee\" target=\"_blank\">Chul Lee</a>, <a href=\"https://www.amazon.science/author/andreas-stolcke\" target=\"_blank\">Andreas Stolcke</a></p>\n</li>\n<li>\n<p><a href=\"https://www.amazon.science/publications/neural-inverse-text-normalization\" target=\"_blank\"><strong>Neural inverse text normalization</strong></a><br />\n<a href=\"https://www.amazon.science/author/monica-sunkara\" target=\"_blank\">Monica Sunkara</a>, <a href=\"https://www.amazon.science/author/chaitanya-shivade\" target=\"_blank\">Chaitanya Shivade</a>, <a href=\"https://www.amazon.science/author/sravan-bodapati\" target=\"_blank\">Sravan Bodapati</a>, <a href=\"https://www.amazon.science/author/katrin-kirchhoff\" target=\"_blank\">Katrin Kirchhoff</a></p>\n</li>\n<li>\n<p><a href=\"https://www.amazon.science/publications/unsupervised-and-semi-supervised-few-shot-acoustic-event-classification\" target=\"_blank\"><strong>Unsupervised and semi-supervised few-shot acoustic event classification</strong></a><br />\nHsin-Ping Huang, <a href=\"https://www.amazon.science/author/krishna-c-puvvada\" target=\"_blank\">Krishna C. Puvvada</a>, <a href=\"https://www.amazon.science/author/ming-sun\" target=\"_blank\">Ming Sun</a>, <a href=\"https://www.amazon.science/author/chao-wang\" target=\"_blank\">Chao Wang</a></p>\n</li>\n</ul>\n<p><strong>Speech enhancement</strong>, or removing noise and echo from the speech signal, has been a prominent topic at ICASSP since the conference began in 1976. But more recent work on the topic — including Amazon’s two papers this year — uses deep-learning methods:</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/8a67b6ead95941bf9b209f4c3804b6ce_image.png\" alt=\"image.png\" /></p>\n<p>The structure of a joint echo control and noise suppression system from Amazon. A microphone (mic) captures the output of a loudspeaker, along with noise and echo. The echo is partially cancelled by an adaptive filter (ĥf), which uses the signal to the speaker. The microphone signal then passes to a residual-echo-suppression (RES) algorithm.</p>\n<p>FROM <a href=\"https://www.amazon.science/publications/low-complexity-real-time-joint-neural-echo-control-and-speech-enhancement-based-on-percepnet\" target=\"_blank\">“LOW-COMPLEXITY, REAL-TIME JOINT NEURAL ECHO CONTROL AND SPEECH ENHANCEMENT BASED ON PERCEPNET”</a></p>\n<ul>\n<li>\n<p><a href=\"https://www.amazon.science/publications/enhancing-into-the-codec-noise-robust-speech-coding-with-vector-quantized-autoencoders\" target=\"_blank\"><strong>Enhancing into the codec: Noise robust speech coding with vector-quantized autoencoders</strong></a><br />\nJonah Casebeer, Vinjai Vale, <a href=\"https://www.amazon.science/author/umut-isik\" target=\"_blank\">Umut Isik</a>, <a href=\"https://www.amazon.science/author/jean-marc-valin\" target=\"_blank\">Jean-Marc Valin</a>, <a href=\"https://www.amazon.science/author/ritwik-giri\" target=\"_blank\">Ritwik Giri</a>, <a href=\"https://www.amazon.science/author/arvindh-krishnaswamy\" target=\"_blank\">Arvindh Krishnaswamy</a></p>\n</li>\n<li>\n<p><a href=\"https://www.amazon.science/publications/low-complexity-real-time-joint-neural-echo-control-and-speech-enhancement-based-on-percepnet\" target=\"_blank\"><strong>Low-complexity, real-time joint neural echo control and speech enhancement based on Percepnet</strong></a><br />\n<a href=\"https://www.amazon.science/author/jean-marc-valin\" target=\"_blank\">Jean-Marc Valin</a>, <a href=\"https://www.amazon.science/author/srikanth-v-tenneti\" target=\"_blank\">Srikanth V. Tenneti</a>, <a href=\"https://www.amazon.science/author/karim-helwani\" target=\"_blank\">Karim Helwani</a>, <a href=\"https://www.amazon.science/author/umut-isik\" target=\"_blank\">Umut Isik</a>, <a href=\"https://www.amazon.science/author/arvindh-krishnaswamy\" target=\"_blank\">Arvindh Krishnaswamy</a></p>\n</li>\n</ul>\n<p>Every interaction with Alexa begins with a wake word — usually “Alexa”, but sometimes “computer” or “Echo”. So at ICASSP, Amazon usually presents work on wake word detection — or <strong>keyword spotting</strong>, as it’s more generally known:</p>\n<ul>\n<li><a href=\"https://www.amazon.science/publications/exploring-the-application-of-synthetic-audio-in-training-keyword-spotters\" target=\"_blank\"><strong>Exploring the application of synthetic audio in training keyword spotters</strong></a><br />\n<a href=\"https://www.amazon.science/author/andrew-werchniak\" target=\"_blank\">Andrew Werchniak</a>, <a href=\"https://www.amazon.science/author/roberto-barra-chicote\" target=\"_blank\">Roberto Barra-Chicote</a>, <a href=\"https://www.amazon.science/author/yuriy-mishchenko\" target=\"_blank\">Yuriy Mishchenko</a>, <a href=\"https://www.amazon.science/author/jasha-droppo\" target=\"_blank\">Jasha Droppo</a>, <a href=\"https://www.amazon.science/author/jeff-condal\" target=\"_blank\">Jeff Condal</a>, <a href=\"https://www.amazon.science/author/peng-liu\" target=\"_blank\">Peng Liu</a>, <a href=\"https://www.amazon.science/author/anish-shah\" target=\"_blank\">Anish Shah </a></li>\n</ul>\n<p>In many spoken-language systems, the next step after ASR <strong>is natural-language understanding</strong> (NLU), or making sense of the text output from the ASR system:</p>\n<p>-<a href=\"https://www.amazon.science/publications/introducing-deep-reinforcement-learning-to-nlu-ranking-tasks\" target=\"_blank\"> <strong>Introducing deep reinforcement learning to NLU ranking tasks</strong></a><br />\n<a href=\"https://www.amazon.science/author/ge-yu\" target=\"_blank\">Ge Yu</a>, <a href=\"https://www.amazon.science/author/chengwei-su\" target=\"_blank\">Chengwei Su</a>, <a href=\"https://www.amazon.science/author/emre-barut\" target=\"_blank\">Emre Barut</a></p>\n<ul>\n<li><a href=\"https://www.amazon.science/publications/language-model-is-all-you-need-natural-language-understanding-as-question-answering\" target=\"_blank\"><strong>Language model is all you need: Natural language understanding as question answering</strong></a><br />\n<a href=\"https://www.amazon.science/author/mahdi-namazifar\" target=\"_blank\">Mahdi Namazifar</a>, <a href=\"https://www.amazon.science/author/alexandros-papangelis\" target=\"_blank\">Alexandros Papangelis</a>, <a href=\"https://www.amazon.science/author/gokhan-tur\" target=\"_blank\">Gokhan Tur</a>, <a href=\"https://www.amazon.science/author/dilek-hakkani-tur\" target=\"_blank\">Dilek Hakkani-Tür </a></li>\n</ul>\n<p>In some contexts, however, it’s possible to perform both ASR and NLU with a single model, in a task known as <strong>spoken-language understanding:</strong></p>\n<ul>\n<li>\n<p><a href=\"https://www.amazon.science/publications/do-as-i-mean-not-as-i-say-sequence-loss-training-for-spoken-language-understanding\" target=\"_blank\"><strong>Do as I mean, not as I say: Sequence loss training for spoken language understanding</strong></a><br />\n<a href=\"https://www.amazon.science/author/milind-rao\" target=\"_blank\">Milind Rao</a>, <a href=\"https://www.amazon.science/author/pranav-dheram\" target=\"_blank\">Pranav Dheram</a>, <a href=\"https://www.amazon.science/author/gautam-tiwari\" target=\"_blank\">Gautam Tiwari</a>, <a href=\"https://www.amazon.science/author/anirudh-raju\" target=\"_blank\">Anirudh Raju</a>, <a href=\"https://www.amazon.science/author/jasha-droppo\" target=\"_blank\">Jasha Droppo</a>, <a href=\"https://www.amazon.science/author/ariya-rastrow\" target=\"_blank\">Ariya Rastrow</a>, <a href=\"https://www.amazon.science/author/andreas-stolcke\" target=\"_blank\">Andreas Stolcke </a></p>\n</li>\n<li>\n<p><a href=\"https://www.amazon.science/publications/graph-enhanced-query-rewriting-for-spoken-language-understanding-system\" target=\"_blank\"><strong>Graph enhanced query rewriting for spoken language understanding system</strong></a><br />\nSiyang Yuan, <a href=\"https://www.amazon.science/author/saurabh-gupta\" target=\"_blank\">Saurabh Gupta</a>, <a href=\"https://www.amazon.science/author/xing-fan\" target=\"_blank\">Xing Fan</a>, <a href=\"https://www.amazon.science/author/derek-liu\" target=\"_blank\">Derek Liu</a>, <a href=\"https://www.amazon.science/author/yang-liu\" target=\"_blank\">Yang Liu</a>, C<a href=\"https://www.amazon.science/author/chenlei-guo\" target=\"_blank\">henlei (Edward) Guo</a></p>\n</li>\n<li>\n<p><a href=\"https://www.amazon.science/publications/top-down-attention-in-end-to-end-spoken-language-understanding\" target=\"_blank\"><strong>Top-down attention in end-to-end spoken language understanding</strong></a><br />\nYixin Chen, <a href=\"https://www.amazon.science/author/weiyi-lu\" target=\"_blank\">Weiyi Lu</a>, <a href=\"https://www.amazon.science/author/alejandro-mottini\" target=\"_blank\">Alejandro Mottini </a>, <a href=\"https://www.amazon.science/author/erran-li\" target=\"_blank\">Erran Li</a>, <a href=\"https://www.amazon.science/author/jasha-droppo\" target=\"_blank\">Jasha Droppo</a>, <a href=\"https://www.amazon.science/author/zheng-du\" target=\"_blank\">Zheng Du</a>, <a href=\"https://www.amazon.science/author/belinda-zeng\" target=\"_blank\">Belinda Zeng</a></p>\n</li>\n</ul>\n<p><img src=\"https://dev-media.amazoncloud.cn/f753621dbe374383bf89768b087db820_image.png\" alt=\"image.png\" /></p>\n<p>A spoken-language-understanding system combines automatic speech recognition (ASR) and natural-language understanding (NLU) in a single model.</p>\n<p>FROM <a href=\"https://www.amazon.science/publications/do-as-i-mean-not-as-i-say-sequence-loss-training-for-spoken-language-understanding\" target=\"_blank\">“DO AS I MEAN, NOT AS I SAY: SEQUENCE LOSS TRAINING FOR SPOKEN LANGUAGE UNDERSTANDING”</a></p>\n<p>An interaction with a voice service, which begins with keyword spotting, ASR, and NLU, often culminates with the agent’s use of synthesized speech to relay a response. The agent’s <strong>text-to-speech</strong> model converts the textual outputs of various NLU and dialogue systems into speech:</p>\n<ul>\n<li>\n<p><a href=\"https://www.amazon.science/publications/camp-a-two-stage-approach-to-modelling-prosody-in-context\" target=\"_blank\"><strong>CAMP: A two-stage approach to modelling prosody in context</strong></a><br />\nZack Hodari, <a href=\"https://www.amazon.science/author/alexis-moinet\" target=\"_blank\">Alexis Moinet</a>, <a href=\"https://www.amazon.science/author/sri-karlapati\" target=\"_blank\">Sri Karlapati</a>, <a href=\"https://www.amazon.science/author/jaime-lorenzo-trueba\" target=\"_blank\">Jaime Lorenzo-Trueba</a>, <a href=\"https://www.amazon.science/author/thomas-merritt\" target=\"_blank\">Thomas Merritt</a>, <a href=\"https://www.amazon.science/author/arnaud-joly\" target=\"_blank\">Arnaud Joly</a>, <a href=\"https://www.amazon.science/author/ammar-abbas\" target=\"_blank\">Ammar Abbas</a>, <a href=\"https://www.amazon.science/author/penny-karanasou\" target=\"_blank\">Penny Karanasou</a>, <a href=\"https://www.amazon.science/author/thomas-drugman\" target=\"_blank\">Thomas Drugman</a></p>\n</li>\n<li>\n<p><a href=\"https://www.amazon.science/publications/low-resource-expressive-text-to-speech-using-data-augmentation\" target=\"_blank\"><strong>Low-resource expressive text-to-speech using data augmentation</strong></a><br />\n<a href=\"https://www.amazon.science/author/goeric-huybrechts\" target=\"_blank\">Goeric Huybrechts</a>, <a href=\"https://www.amazon.science/author/thomas-merritt\" target=\"_blank\">Thomas Merritt</a>, <a href=\"https://www.amazon.science/author/giulia-comini\" target=\"_blank\">Giulia Comini</a>, <a href=\"https://www.amazon.science/author/bartek-perz\" target=\"_blank\">Bartek Perz</a>, <a href=\"https://www.amazon.science/author/raahil-shah\" target=\"_blank\">Raahil Shah</a>, <a href=\"https://www.amazon.science/author/jaime-lorenzo-trueba\" target=\"_blank\">Jaime Lorenzo-Trueba</a></p>\n</li>\n<li>\n<p><a href=\"https://www.amazon.science/publications/prosodic-representation-learning-and-contextual-sampling-for-neural-text-to-speech\" target=\"_blank\"><strong>Prosodic representation learning and contextual sampling for neural text-to-speech</strong></a><br />\n<a href=\"https://www.amazon.science/author/sri-karlapati\" target=\"_blank\">Sri Karlapati</a>, <a href=\"\" target=\"_blank\">Ammar Abbas</a>, Zack Hodari], <a href=\"https://www.amazon.science/author/alexis-moinet\" target=\"_blank\">Alexis Moinet</a>, <a href=\"https://www.amazon.science/author/arnaud-joly\" target=\"_blank\">Arnaud Joly</a>, <a href=\"https://www.amazon.science/author/penny-karanasou\" target=\"_blank\">Penny Karanasou</a>, <a href=\"https://www.amazon.science/author/thomas-drugman\" target=\"_blank\">Thomas Drugman</a></p>\n</li>\n<li>\n<p><a href=\"https://www.amazon.science/publications/universal-neural-vocoding-with-parallel-wavenet\" target=\"_blank\"><strong>Universal neural vocoding with Parallel WaveNet</strong></a><br />\n<a href=\"https://www.amazon.science/author/yunlong-jiao\" target=\"_blank\">Yunlong Jiao</a>, <a href=\"https://www.amazon.science/author/adam-gabrys\" target=\"_blank\">Adam Gabrys</a>, Georgi Tinchev, <a href=\"https://www.amazon.science/author/bartosz-putrycz\" target=\"_blank\">Bartosz Putrycz</a>, <a href=\"https://www.amazon.science/author/daniel-korzekwa\" target=\"_blank\">Daniel Korzekwa</a>, <a href=\"https://www.amazon.science/author/viacheslav-klimkov\" target=\"_blank\">Viacheslav Klimkov</a></p>\n</li>\n</ul>\n<p>All of the preceding research topics have implications for voice services like Alexa, but Amazon has a range of other products and services that rely on audio-signal processing. Three of Amazon’s papers at this year’s ICASSP relate to <strong>audio-video synchronization</strong>: two deal with dubbing audio in one language onto video shot in another, and one describes how to detect synchronization errors in video — as when, for example, the sound of a tennis ball being struck and the shot of the racquet hitting the ball are misaligned:</p>\n<ul>\n<li>\n<p><a href=\"https://www.amazon.science/publications/detection-of-audio-video-synchronization-errors-via-event-detection\" target=\"_blank\"><strong>Detection of audio-video synchronization errors via event detection</strong></a><br />\nJoshua P. Ebenezer, <a href=\"https://www.amazon.science/author/yongjun-wu\" target=\"_blank\">Yongjun Wu</a>, <a href=\"https://www.amazon.science/author/hai-wei\" target=\"_blank\">Hai Wei</a>, <a href=\"https://www.amazon.science/author/sriram-sethuraman\" target=\"_blank\">Sriram Sethuraman</a>, <a href=\"https://www.amazon.science/author/zongyi-liu\" target=\"_blank\">Zongyi Liu </a></p>\n</li>\n<li>\n<p><a href=\"https://www.amazon.science/publications/improvements-to-prosodic-alignment-for-automatic-dubbing\" target=\"_blank\"><strong>Improvements to prosodic alignment for automatic dubbing</strong></a><br />\n<a href=\"https://www.amazon.science/author/yogesh-virkar\" target=\"_blank\">Yogesh Virkar</a>, <a href=\"https://www.amazon.science/author/marcello-federico\" target=\"_blank\">Marcello Federico</a>, <a href=\"https://www.amazon.science/author/robert-enyedi\" target=\"_blank\">Robert Enyedi</a>, <a href=\"https://www.amazon.science/author/roberto-barra-chicote\" target=\"_blank\">Roberto Barra-Chicote</a></p>\n</li>\n<li>\n<p><a href=\"https://www.amazon.science/publications/machine-translation-verbosity-control-for-automatic-dubbing\" target=\"_blank\"><strong>Machine translation verbosity control for automatic dubbing</strong></a><br />\n<a href=\"https://www.amazon.science/https:/www.amazon.science/author/surafel-melaku-lakew\" target=\"_blank\">Surafel Melaku Lakew</a>, <a href=\"https://www.amazon.science/author/marcello-federico\" target=\"_blank\">Marcello Federico</a>, <a href=\"https://www.amazon.science/author/yue-wang\" target=\"_blank\">Yue Wang</a>, <a href=\"https://www.amazon.science/author/cuong-hoang\" target=\"_blank\">Cuong Hoang</a>, <a href=\"https://www.amazon.science/author/yogesh-virkar\" target=\"_blank\">Yogesh Virkar</a>, <a href=\"https://www.amazon.science/author/roberto-barra-chicote\" target=\"_blank\">Roberto Barra-Chicote</a>, <a href=\"https://www.amazon.science/author/robert-enyedi\" target=\"_blank\">Robert Enyedi </a></p>\n</li>\n</ul>\n<p>Amazon’s Text-to-Speech team has an ICASSP paper on the unusual topic of <strong>computer-assisted pronunciation training</strong>, a feature of some language learning applications. The researchers’ method would enable language learning apps to accept a wider range of word pronunciations, to score pronunciations more accurately, and to provide more reliable feedback:</p>\n<ul>\n<li><a href=\"https://www.amazon.science/publications/mispronunciation-detection-in-non-native-l2-english-with-uncertainty-modeling\" target=\"_blank\"><strong>Mispronunciation detection in non-native (L2) English with uncertainty modeling</strong></a><br />\n<a href=\"https://www.amazon.science/author/daniel-korzekwa\" target=\"_blank\">Daniel Korzekwa</a>, <a href=\"https://www.amazon.science/author/jaime-lorenzo-trueba\" target=\"_blank\">Jaime Lorenzo-Trueba</a>, Szymon Zaporowski, <a href=\"https://www.amazon.science/author/shira-calamaro\" target=\"_blank\">Shira Calamaro</a>, <a href=\"https://www.amazon.science/author/thomas-drugman\" target=\"_blank\">Thomas Drugman</a>, Bozena Kostek</li>\n</ul>\n<p><img src=\"https://dev-media.amazoncloud.cn/8c27775d555241e6a158f61419d36e54_image.png\" alt=\"image.png\" /></p>\n<p>The architecture of a new Amazon model for separating a recording’s vocal tracks and instrumental tracks.</p>\n<p>FROM <a href=\"https://www.amazon.science/publications/semi-supervised-singing-voice-separation-with-noise-self-training\" target=\"_blank\">“SEMI-SUPERVISED SINGING VOICE SEPARATION WITH NOISE SELF-TRAINING”</a></p>\n<p>Another paper investigates the topic of <strong>singing voice separation</strong>, or separating vocal tracks from instrumental tracks in song recordings:</p>\n<ul>\n<li><a href=\"https://www.amazon.science/publications/semi-supervised-singing-voice-separation-with-noise-self-training\" target=\"_blank\"><strong>Semi-supervised singing voice separation with noise self-training</strong></a><br />\nZhepei Wang, <a href=\"https://www.amazon.science/author/ritwik-giri\" target=\"_blank\">Ritwik Giri</a>, <a href=\"https://www.amazon.science/author/umut-isik\" target=\"_blank\">Umut Isik</a>, <a href=\"https://www.amazon.science/author/jean-marc-valin\" target=\"_blank\">Jean-Marc Valin</a>, <a href=\"https://www.amazon.science/author/arvindh-krishnaswamy\" target=\"_blank\">Arvindh Krishnaswamy </a></li>\n</ul>\n<p>Finally, two of Amazon’s ICASSP papers, although they do evaluate applications in speech recognition and audio classification, present general <strong>machine learning methodologies</strong> that could apply to a range of problems. One paper investigates federated learning, a distributed-learning technique in which multiple servers, each with a different, local store of training data, collectively build a machine learning model without exchanging data. The other presents a new loss function for training classification models on synthetic data created by transforming real data — for instance, training a sound classification model with samples that have noise added to them artificially.</p>\n<ul>\n<li><a href=\"https://www.amazon.science/publications/cross-silo-federated-training-in-the-cloud-with-diversity-scaling-and-semi-supervised-learning\" target=\"_blank\"><strong>Cross-silo federated training in the cloud with diversity scaling and semi-supervised learning</strong></a><br />\n<a href=\"https://www.amazon.science/author/kishore-nandury\" target=\"_blank\">Kishore Nandury</a>, <a href=\"https://www.amazon.science/author/anand-mohan\" target=\"_blank\">Anand Mohan</a>, <a href=\"https://www.amazon.science/author/frederick-weber\" target=\"_blank\">Frederick Weber</a></li>\n<li><a href=\"https://www.amazon.science/publications/enhancing-audio-augmentation-methods-with-consistency-learning\" target=\"_blank\"><strong>Enhancing audio augmentation methods with consistency learning</strong></a><br />\nTurab Iqbal, <a href=\"https://www.amazon.science/author/karim-helwani\" target=\"_blank\">Karim Helwani</a>, <a href=\"https://www.amazon.science/author/arvindh-krishnaswamy\" target=\"_blank\">Arvindh Krishnaswamy</a>, Wenwu Wang</li>\n</ul>\n<p>Also at ICASSP, on June 8, seven Amazon scientists will be participating in a half-hour live Q&A. Conference registrants may <a href=\"https://amazon.qualtrics.com/jfe/form/SV_0uf9M3PTwoL0Rfw\" target=\"_blank\">submit questions to the panelists</a> online.</p>\n<p>ABOUT THE AUTHOR</p>\n<h4><a id=\"Larry_Hardestyhttpswwwamazonscienceauthorlarryhardesty_162\"></a><a href=\"https://www.amazon.science/author/larry-hardesty\" target=\"_blank\">Larry Hardesty</a></h4>\n<p>Larry Hardesty is the editor of the Amazon Science blog. Previously, he was a senior editor at MIT Technology Review and the computer science writer at the MIT News Office.</p>\n"}