Voiceitt extends the voice revolution to people with nonstandard speech

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"*(Editor’s note: This is the third in a series of articles Amazon Science is publishing related to the science behind products and services from companies in which Amazon has invested. In 2018, the Alexa Fund provided Voiceitt with an accelerator investment.)*\n\nApproximately 7.5 million people in the U.S. have trouble using their voices, according to the ++[National Institute on Deafness and Other Communication Disorders](https://www.nidcd.nih.gov/health/statistics/statistics-voice-speech-and-language)++. As computer technology moves from text-based to voice-based interfaces, people with nonstandard speech are in danger of being left behind.\n\n++[Voiceitt](https://voiceitt.com/)++, a startup in Ramat-Gan, Israel, says it is committed to ensuring that that doesn’t happen. With Voiceitt, customers train their own, personalized speech recognition models, adapted to their speech patterns, that let them communicate with voice-controlled devices or with other people.\n\nLast week, Voiceitt announced the ++[official public release](https://www.prnewswire.com/news-releases/voiceitt-launches-state-of-the-art-speech-recognition-app-to-empower-communication-for-individuals-with-speech-impairments-301307887.html)++ of its app.\n\n<video src=\"https://dev-media.amazoncloud.cn/8ad1e11e4fbf42e3b48812cf5e78e8bd_Rec%200004.mp4\" class=\"manvaVedio\" controls=\"controls\" style=\"width:160px;height:160px\"></video>\n\n**Voiceitt moments**\n\nVoiceitt customers react to the control over their environments offered by Voiceitt's Alexa integration.\n\nThe ++[Alexa Fund](https://developer.amazon.com/en-US/alexa/alexa-startups/alexa-fund)++ — an Amazon venture capital investment program — was an early investor in Voiceitt, and integration with Alexa is built into the Voiceitt app.\n\n“We do see users who get Voiceitt specifically to use Alexa,” says Roy Weiss, Voiceitt’s vice president of products. “They see direct and quite immediate results. From the first command, they unlock capabilities they couldn’t access before.”\n\n“Now that I don’t have to call my mom and dad in, or my aid or my assistant in, and tell them ‘Hey, I need this; I need that’, I can do it independently,” says one Voiceitt user with cerebral palsy. “I use it all the time. … I use it to do everything.”\n\n“After now being speech motor disabled for over three years, including three years of speech dysfunction and two years of no intelligible speech, Voiceitt is a key part of my getting my voice back,” writes another Voiceitt user.\n\n\n#### **The app**\n\n\nVoiceitt’s interface is an iOS mobile app with two modes: conversation mode lets customers communicate with other people, using synthetic speech and the phone’s speaker; smart-home mode lets customers interact with Alexa.\n\nEach mode has a set of speech categories. For conversation mode, the categories are scenarios such as transportation, shopping, and medical visits; for smart home, they’re Alexa functions such as lights, music, and TV control.\n\nEach category includes a set of common, predefined phrases. In smart-home mode, those phrases are Alexa commands, such as “Lights on” to turn on lights. A command can be configured to trigger a particular action; for instance, “Lights on” might be configured to activate a specific light in a specific room. Customers repeat each phrase multiple times to train a personal speech recognition model.\n\n![image.png](https://dev-media.amazoncloud.cn/c7c16a0879f342229fb1f338773e984f_image.png)\n\nAn example of a Voiceitt scenario. Each scenario has a set of associated predefined phrases.\n\nCREDIT: COURTESY OF VOICEITT\n\n#### **Modeling nonstandard speech**\n\nRecognizing nonstandard speech differs from ordinary speech recognition in some fundamental ways, says Filip Jurcicek, speech recognition team lead at Voiceitt.\n\nWhen training data is sparse — as it is in Voiceitt’s case, since customers generate it on the fly — the common approach to automatic speech recognition (ASR) is a pipelined method. In that method, an acoustic model converts acoustic data into phonemes, the shortest units of speech; a “dictionary” provides candidate word-level interpretations of the phonemes; and a language model adjudicates among possible word-level interpretations, by considering the probability of each.\n\nBut with nonstandard speech, explains Matt Gibson, Voiceitt’s lead algorithm researcher, “we need to look farther than those phoneme-level features. We often see divergence from a normative pronunciation. For instance, if a word starts with a plosive like ‘b’ or ‘p’, the speaker might consistently precede it with an ‘n’ or ‘m’ sound — ‘mp’ or ‘mb’”. \n\nThis can cause problems for conventional mappings from sounds to phonemes and phonemes to words. Consequently, Jurcicek says, “We have to look at the phrase as a whole.”\n\n\n#### **Now that I don’t have to call my mom and dad in, or my aid or my assistant in, and tell them ‘Hey, I need this; I need that’, I can do it independently.**\n\n\nVoiceitt customer\n\nIn recent years, most commercial ASR services have moved from the pipelined approach to end-to-end models, in which a single neural network takes an acoustic signal as input and outputs text. This approach can improve accuracy, but it requires a large body of training data.\n\nTypically, end-to-end ASR models use recurrent neural networks, which process sequential inputs in order. An acoustic signal would be divided into a sequence of “frames”, each of which lasts just a few milliseconds, before passing to the recurrent neural net.\n\nIn order to “look at the phrase as a whole”, Jurcicek says, Voiceitt instead uses a convolutional neural network, which takes a much larger chunk of the acoustic signal as input. Originally designed to look for specific patterns of pixels wherever in an image they occur, convolutional neural networks can, similarly, look for telltale acoustic patterns wherever in a signal they occur.\n\n“As long as customers are consistent in their pronunciation, this gives us the opportunity to exploit that consistency,” Jurcicek says. “This is where I believe Voiceitt really adds value for the user. Pronunciation doesn’t have to follow a standard dictionary.”\n\nAs customers train their custom models, Voiceitt uses their recorded speech both for training and testing. Once the output confidence of the model crosses some threshold, the phrase is “unlocked”, and the customer may begin using it to control a voice agent or communicate with others.\n\nBut the training doesn’t stop there. Every time the customer uses a phrase, it provides more training data for the model, which Voiceitt says it continuously updates to improve performance.\n\n\n#### **The road ahead**\n\n<video src=\"https://dev-media.amazoncloud.cn/ab81dfc44ce64794bf01c54cbd2b74e4_Rec%200005.mp4\" class=\"manvaVedio\" controls=\"controls\" style=\"width:160px;height:160px\"></video>\n\n**First day of Voiceitt pilot study**\nIn July 2019, residents of the Inglis residential-care facility in Philadelphia participated in a pilot study of Voiceitt's Alexa integration. Within minutes of beginning to train their customized speech recognition models, they were able to use verbal cues to turn lights off and on, change the TV channel, and more.\n\nAt present, Voiceitt’s finite menu of actions means that it’s possible to learn and store separate models for each customer. But Voiceitt plans to scale the service up significantly, so Voiceitt researchers are investigating more efficient ways to train and store models.\n\n“We’re looking at ways to aggregate existing models to come up with a more general background model, which would then act as a starting point from which we could adapt to new users,” Gibson says. “It may be possible to find commonalities among users and cluster them together into groups.”\n\n**First day of Voiceitt pilot study**\n\nIn July 2019, residents of the Inglis residential-care facility in Philadelphia participated in a pilot study of Voiceitt's Alexa integration. Within minutes of beginning to train their customized speech recognition models, they were able to use verbal cues to turn lights off and on, change the TV channel, and more.\n\nIn the meantime, however, Voiceitt is already making a difference in its customers’ lives. Many people with difficulty using their voices also have difficulty using their limbs and hands. For them, Voiceitt doesn’t just offer the ability to interact with voice agents; it offers the ability to exert sometimes unprecedented control over their environments. In the videos above, the reactions of customers using Voiceitt for the first time testify to how transformative that ability can be.\n\n“It’s really inspiring to see,” Weiss says. “We all feel very privileged to create a product that really has a part in changing users’ lives.”\n\nABOUT THE AUTHOR\n\n#### **[Larry Hardesty](https://www.amazon.science/author/larry-hardesty)**\n\nLarry Hardesty is the editor of the Amazon Science blog. Previously, he was a senior editor at MIT Technology Review and the computer science writer at the MIT News Office.","render":"(Editor’s note: This is the third in a series of articles Amazon Science is publishing related to the science behind products and services from companies in which Amazon has invested. In 2018, the Alexa Fund provided Voiceitt with an accelerator investment.)\nApproximately 7.5 million people in the U.S. have trouble using their voices, according to the <ins><a href=\"https://www.nidcd.nih.gov/health/statistics/statistics-voice-speech-and-language\" target=\"_blank\">National Institute on Deafness and Other Communication Disorders</a></ins>. As computer technology moves from text-based to voice-based interfaces, people with nonstandard speech are in danger of being left behind.\n<ins><a href=\"https://voiceitt.com/\" target=\"_blank\">Voiceitt</a></ins>, a startup in Ramat-Gan, Israel, says it is committed to ensuring that that doesn’t happen. With Voiceitt, customers train their own, personalized speech recognition models, adapted to their speech patterns, that let them communicate with voice-controlled devices or with other people.\nLast week, Voiceitt announced the <ins><a href=\"https://www.prnewswire.com/news-releases/voiceitt-launches-state-of-the-art-speech-recognition-app-to-empower-communication-for-individuals-with-speech-impairments-301307887.html\" target=\"_blank\">official public release</a></ins> of its app.\n<video src=\"https://dev-media.amazoncloud.cn/8ad1e11e4fbf42e3b48812cf5e78e8bd_Rec%200004.mp4\" controls=\"controls\"></video>\nVoiceitt moments\nVoiceitt customers react to the control over their environments offered by Voiceitt’s Alexa integration.\nThe <ins><a href=\"https://developer.amazon.com/en-US/alexa/alexa-startups/alexa-fund\" target=\"_blank\">Alexa Fund</a></ins> — an Amazon venture capital investment program — was an early investor in Voiceitt, and integration with Alexa is built into the Voiceitt app.\n“We do see users who get Voiceitt specifically to use Alexa,” says Roy Weiss, Voiceitt’s vice president of products. “They see direct and quite immediate results. From the first command, they unlock capabilities they couldn’t access before.”\n“Now that I don’t have to call my mom and dad in, or my aid or my assistant in, and tell them ‘Hey, I need this; I need that’, I can do it independently,” says one Voiceitt user with cerebral palsy. “I use it all the time. … I use it to do everything.”\n“After now being speech motor disabled for over three years, including three years of speech dysfunction and two years of no intelligible speech, Voiceitt is a key part of my getting my voice back,” writes another Voiceitt user.\n<h4><a id=\"The_app_23\"></a>The app</h4>\nVoiceitt’s interface is an iOS mobile app with two modes: conversation mode lets customers communicate with other people, using synthetic speech and the phone’s speaker; smart-home mode lets customers interact with Alexa.\nEach mode has a set of speech categories. For conversation mode, the categories are scenarios such as transportation, shopping, and medical visits; for smart home, they’re Alexa functions such as lights, music, and TV control.\nEach category includes a set of common, predefined phrases. In smart-home mode, those phrases are Alexa commands, such as “Lights on” to turn on lights. A command can be configured to trigger a particular action; for instance, “Lights on” might be configured to activate a specific light in a specific room. Customers repeat each phrase multiple times to train a personal speech recognition model.\n<img src=\"https://dev-media.amazoncloud.cn/c7c16a0879f342229fb1f338773e984f_image.png\" alt=\"image.png\" />\nAn example of a Voiceitt scenario. Each scenario has a set of associated predefined phrases.\nCREDIT: COURTESY OF VOICEITT\n<h4><a id=\"Modeling_nonstandard_speech_38\"></a>Modeling nonstandard speech</h4>\nRecognizing nonstandard speech differs from ordinary speech recognition in some fundamental ways, says Filip Jurcicek, speech recognition team lead at Voiceitt.\nWhen training data is sparse — as it is in Voiceitt’s case, since customers generate it on the fly — the common approach to automatic speech recognition (ASR) is a pipelined method. In that method, an acoustic model converts acoustic data into phonemes, the shortest units of speech; a “dictionary” provides candidate word-level interpretations of the phonemes; and a language model adjudicates among possible word-level interpretations, by considering the probability of each.\nBut with nonstandard speech, explains Matt Gibson, Voiceitt’s lead algorithm researcher, “we need to look farther than those phoneme-level features. We often see divergence from a normative pronunciation. For instance, if a word starts with a plosive like ‘b’ or ‘p’, the speaker might consistently precede it with an ‘n’ or ‘m’ sound — ‘mp’ or ‘mb’”.\nThis can cause problems for conventional mappings from sounds to phonemes and phonemes to words. Consequently, Jurcicek says, “We have to look at the phrase as a whole.”\n<h4><a id=\"Now_that_I_dont_have_to_call_my_mom_and_dad_in_or_my_aid_or_my_assistant_in_and_tell_them_Hey_I_need_this_I_need_that_I_can_do_it_independently_49\"></a>Now that I don’t have to call my mom and dad in, or my aid or my assistant in, and tell them ‘Hey, I need this; I need that’, I can do it independently.</h4>\nVoiceitt customer\nIn recent years, most commercial ASR services have moved from the pipelined approach to end-to-end models, in which a single neural network takes an acoustic signal as input and outputs text. This approach can improve accuracy, but it requires a large body of training data.\nTypically, end-to-end ASR models use recurrent neural networks, which process sequential inputs in order. An acoustic signal would be divided into a sequence of “frames”, each of which lasts just a few milliseconds, before passing to the recurrent neural net.\nIn order to “look at the phrase as a whole”, Jurcicek says, Voiceitt instead uses a convolutional neural network, which takes a much larger chunk of the acoustic signal as input. Originally designed to look for specific patterns of pixels wherever in an image they occur, convolutional neural networks can, similarly, look for telltale acoustic patterns wherever in a signal they occur.\n“As long as customers are consistent in their pronunciation, this gives us the opportunity to exploit that consistency,” Jurcicek says. “This is where I believe Voiceitt really adds value for the user. Pronunciation doesn’t have to follow a standard dictionary.”\nAs customers train their custom models, Voiceitt uses their recorded speech both for training and testing. Once the output confidence of the model crosses some threshold, the phrase is “unlocked”, and the customer may begin using it to control a voice agent or communicate with others.\nBut the training doesn’t stop there. Every time the customer uses a phrase, it provides more training data for the model, which Voiceitt says it continuously updates to improve performance.\n<h4><a id=\"The_road_ahead_67\"></a>The road ahead</h4>\n<video src=\"https://dev-media.amazoncloud.cn/ab81dfc44ce64794bf01c54cbd2b74e4_Rec%200005.mp4\" controls=\"controls\"></video>\nFirst day of Voiceitt pilot study \nIn July 2019, residents of the Inglis residential-care facility in Philadelphia participated in a pilot study of Voiceitt’s Alexa integration. Within minutes of beginning to train their customized speech recognition models, they were able to use verbal cues to turn lights off and on, change the TV channel, and more.\nAt present, Voiceitt’s finite menu of actions means that it’s possible to learn and store separate models for each customer. But Voiceitt plans to scale the service up significantly, so Voiceitt researchers are investigating more efficient ways to train and store models.\n“We’re looking at ways to aggregate existing models to come up with a more general background model, which would then act as a starting point from which we could adapt to new users,” Gibson says. “It may be possible to find commonalities among users and cluster them together into groups.”\nFirst day of Voiceitt pilot study\nIn July 2019, residents of the Inglis residential-care facility in Philadelphia participated in a pilot study of Voiceitt’s Alexa integration. Within minutes of beginning to train their customized speech recognition models, they were able to use verbal cues to turn lights off and on, change the TV channel, and more.\nIn the meantime, however, Voiceitt is already making a difference in its customers’ lives. Many people with difficulty using their voices also have difficulty using their limbs and hands. For them, Voiceitt doesn’t just offer the ability to interact with voice agents; it offers the ability to exert sometimes unprecedented control over their environments. In the videos above, the reactions of customers using Voiceitt for the first time testify to how transformative that ability can be.\n“It’s really inspiring to see,” Weiss says. “We all feel very privileged to create a product that really has a part in changing users’ lives.”\nABOUT THE AUTHOR\n<h4><a id=\"Larry_Hardestyhttpswwwamazonscienceauthorlarryhardesty_88\"></a><a href=\"https://www.amazon.science/author/larry-hardesty\" target=\"_blank\">Larry Hardesty</a></h4>\nLarry Hardesty is the editor of the Amazon Science blog. Previously, he was a senior editor at MIT Technology Review and the computer science writer at the MIT News Office.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家