The science behind ultrasonic motion sensing for Echo

{"value":"Reducing false positives for rare events, adapting Echo hardware to ultrasound sensing, and enabling concurrent ultrasound sensing and music playback are just a few challenges Amazon researchers addressed.\n\nLast fall, Amazon introduced [ultrasound-based motion detection](https://www.amazon.com/gp/help/customer/display.html?nodeId=GSR22RYDWS3KBUYW), to enable Alexa customers to initiate [Routines](https://www.amazon.com/gp/help/customer/display.html?ref_=hp_left_v4_sib&nodeId=GJJGMCNB6PFEYJYX), or prespecified sequences of actions, when certain types of motion are detected (or not detected). For instance, Routines could be configured to automatically turn the lights on, play music, or announce weather or traffic when motion is detected near a customer’s Echo device, indicating that someone has entered the room.\n\nThere are many different motion detection technologies, but we selected ultrasound because it works in low-light conditions or even in the dark and, unlike radio waves, ultrasound waves do not travel through drywall, so there's less risk of detecting motion in other rooms.\n\nGetting the technology to work on existing Echo hardware required innovation on a number of fronts — among other things, reducing false alarms by adequately sampling long-tail data; devising a self-calibration feature to adjust to variations in commodity hardware; and filtering out distortion during concurrent ultrasound detection and music playback. We describe the details below.\n\n![下载.gif](https://dev-media.amazoncloud.cn/93966a0270f54d2d8ec011605e926946_%E4%B8%8B%E8%BD%BD.gif)\n\nAn example of how to use the Alexa app to configure Routines with ultrasound-based motion-detection triggers.\n\n\n#### **Ultrasound-based presence detection**\n\n\nWith ultrasound-based presence detection (USPD), an ultrasonic signal (>=32 kHz) is transmitted via onboard loudspeakers, and changes in the signal received at the microphones are monitored to detect motion.\n\nUltrasound sensors can be broadly categorized as using Doppler sensing or time-of-flight sensing. In Doppler sensing, once the signal is transmitted, the system detects motion by looking for frequency shifts in the recorded spectrum of the signal, which are caused by its reflection from moving objects. This frequency shift is similar to the shift in sound frequencies you hear in a police car siren it is approaching you or moving away from you.\n\n![下载.gif](https://dev-media.amazoncloud.cn/ede5a9540c7146d0b25441d8beb372fb_%E4%B8%8B%E8%BD%BD.gif)\n\nDoppler sensing detects motion by looking for frequency shifts in the recorded spectrum of a transmitted signal, which are caused by reflection from moving objects.\n\nIn time-of-flight sensing, variations in the arrival time of the reflected signal are monitored to detect changes in the environment. We use Doppler sensing due to the robustness of its motion detection signal and because it generalizes well across the cases when Alexa is or is not playing audio simultaneously.\n\nThe magnitude of the Doppler-shifted signal depends on factors such as distance from target to source, the size and absorption coefficient of the target, the absorption coefficient of the room, and even the humidity and temperature in the room. In addition, when a person moves through a closed space, not only do we observe multiple Doppler components due to various parts of the body moving in different directions with different speeds, but we also observe repetitions of those components due to reflections.\n\nBecause of all these complexities, the signal received at the source is not at all as clean as a single tone with a frequency shift. In practice, what we observe looks more like this:\n\n![下载.jpg](https://dev-media.amazoncloud.cn/7ad955c4869548209a1bbfa6c9eb094f_%E4%B8%8B%E8%BD%BD.jpg)\n\nSpectrogram of the signal received at device microphones when there is motion near the device.\n\nFurther, moving objects such as fans and curtains introduce their own Doppler shifts, which have to be rejected since they do not necessarily indicate people’s presence. Below are two spectrograms, one of a room with no motion other than a rotating floor fan and another with both a fan and human motion near a device. As can be seen, they are difficult to tell apart.\n\nThese complications mean that conventional signal processing is insufficient to recognize human motion from Doppler-shifted signals. So we instead use deep learning, which should be able to recognize more heterogeneous patterns in the signal.\n\n![下载.gif](https://dev-media.amazoncloud.cn/f64252c32507412bb96f74d60dc1fb87_%E4%B8%8B%E8%BD%BD.gif)\n\nSpectrogram of the signal received at device microphones when there is no motion in the room other than a rotating floor fan.\n\nBelow is a high-level block diagram of our USPD algorithm. On the signal transmitter side, a device- and environment-dependent optimal ultrasound signal is transmitted through the onboard loudspeaker. This signal gets reflected from a moving object and is then captured by the onboard microphone array. The signal is preprocessed and then passed to a neural-network-based classifier to detect motion.\n\n![下载.gif](https://dev-media.amazoncloud.cn/047217be0b984000a3189815f0c9f7a6_%E4%B8%8B%E8%BD%BD.gif)\n\nSpectrogram of the signal with both the rotating floor fan and human motion.\n\n![下载.jpg](https://dev-media.amazoncloud.cn/1f34f608f0d14cc8bd9e758a95d2842f_%E4%B8%8B%E8%BD%BD.jpg)\n\nHigh-level block diagram of USPD algorithm.\n\n\n#### **False alarms**\n\n\nThe biggest algorithmic challenge we faced was achieving high detection accuracy while keeping the false-alarm rate low. Reducing false-alarm rates is especially challenging because of the well-known long-tail problem in AI: there are a multitude of rare events that could fool a detector, but their rarity means that they’re usually underrepresented in training data.\n\nTo address this problem, we started by training a seed model on a relatively small amount of data. First, we used the seed model to sort through large amounts of data and extract infrequent events. Second, we used a model trained on that rare-event data to automatically capture infrequent events during our internal data collection process. Data captured by these methods eventually helped us address the long-tail problem and achieve extremely low false-alarm rates.\n\n\n#### **Deployment challenges**\n\n\nDeploying the trained model brought its own challenges. We wanted to enable USPD with the lowest possible emission level, while still retaining a sufficient detection range, and do all of this with no additional hardware costs (i.e., using the available microphones and loudspeakers on Echo devices instead of dedicated ultrasound transmitters). Further, we decided to support always-on motion detection. This meant being able to detect motion even when a user is playing music from the device speakers. Finally, we added algorithms to improve the user experience in the presence of only minor motion and spent a considerable amount of effort to support Amazon’s goal of reducing our devices’ power consumption. We describe these in more detail below.\n\n\n#### **Hardware variations and environmental conditions**\n\n\nUsing onboard loudspeakers and microphones for ultrasound transmission and sensing meant that we had to manage variable acoustic characteristics. Mass-produced devices are known to have a certain variation in amplitude and phase response, and it is very difficult to control the response of loudspeakers in the ultrasonic frequency range without affecting yield rates. To manage these hardware variations and environmental variations, we designed automatic device calibration modules to tailor emission frequencies and levels to both the devices’ hardware idiosyncrasies and the acoustic properties of the rooms in which they are used. This helped us provide a consistent user experience across devices without increasing device costs.\n\n\n#### **Sensing with concurrent music playback**\n\n\nMusic playback is a key use case for Echo devices, which poses challenges, since we use device loudspeakers to simultaneously play music and emit ultrasound. Specifically, when low-frequency music content (such as bass sounds) is played together with an ultrasonic signal, the distortion shows up as noise in the ultrasound region. This noise is inaudible to listeners, but it interferes with the frequencies we use for sensing.\n\n![下载.gif](https://dev-media.amazoncloud.cn/0f77835359364b2697f57e2a2ccf2213_%E4%B8%8B%E8%BD%BD.gif)\n\nSignal spectrum observed in an empty room with concurrent music playback.\n\nIn order to enhance the ultrasound signal and get reasonable range performance in the presence of concurrent music, we developed an adaptive algorithm that uses the different magnitude and phase of distortion and motion features at different microphones to identify and remove distortion.\n\n![下载.gif](https://dev-media.amazoncloud.cn/ca311ec7db0749a3a02f54c8768dd98f_%E4%B8%8B%E8%BD%BD.gif)\n\nSignal spectrum observed with both concurrent music playback and motion near device.\n\n\n#### **Major and minor motion**\n\n\nHuman movements can be broadly [categorized as either major](https://www.nema.org/standards/view/occupancy-motion-sensors-standard) or minor. Major movements include walking into or through an area, while minor movements include reaching for a telephone while seated, turning the pages in a book, opening a file folder, and picking up a coffee cup. Detecting minor movements is difficult, as their ultrasound spectra have very low signal-to-noise ratios (SNRs) compared to major movements, and detecting low-SNR events often means high false-positive rates. At the same time, detecting minor movements is extremely important for recognizing a user’s continued presence after walking into the room.","render":"Reducing false positives for rare events, adapting Echo hardware to ultrasound sensing, and enabling concurrent ultrasound sensing and music playback are just a few challenges Amazon researchers addressed.\nLast fall, Amazon introduced <a href=\"https://www.amazon.com/gp/help/customer/display.html?nodeId=GSR22RYDWS3KBUYW\" target=\"_blank\">ultrasound-based motion detection</a>, to enable Alexa customers to initiate <a href=\"https://www.amazon.com/gp/help/customer/display.html?ref_=hp_left_v4_sib&nodeId=GJJGMCNB6PFEYJYX\" target=\"_blank\">Routines</a>, or prespecified sequences of actions, when certain types of motion are detected (or not detected). For instance, Routines could be configured to automatically turn the lights on, play music, or announce weather or traffic when motion is detected near a customer’s Echo device, indicating that someone has entered the room.\nThere are many different motion detection technologies, but we selected ultrasound because it works in low-light conditions or even in the dark and, unlike radio waves, ultrasound waves do not travel through drywall, so there’s less risk of detecting motion in other rooms.\nGetting the technology to work on existing Echo hardware required innovation on a number of fronts — among other things, reducing false alarms by adequately sampling long-tail data; devising a self-calibration feature to adjust to variations in commodity hardware; and filtering out distortion during concurrent ultrasound detection and music playback. We describe the details below.\n<img src=\"https://dev-media.amazoncloud.cn/93966a0270f54d2d8ec011605e926946_%E4%B8%8B%E8%BD%BD.gif\" alt=\"下载.gif\" />\nAn example of how to use the Alexa app to configure Routines with ultrasound-based motion-detection triggers.\n<h4><a id=\"Ultrasoundbased_presence_detection_13\"></a>Ultrasound-based presence detection</h4>\nWith ultrasound-based presence detection (USPD), an ultrasonic signal (>=32 kHz) is transmitted via onboard loudspeakers, and changes in the signal received at the microphones are monitored to detect motion.\nUltrasound sensors can be broadly categorized as using Doppler sensing or time-of-flight sensing. In Doppler sensing, once the signal is transmitted, the system detects motion by looking for frequency shifts in the recorded spectrum of the signal, which are caused by its reflection from moving objects. This frequency shift is similar to the shift in sound frequencies you hear in a police car siren it is approaching you or moving away from you.\n<img src=\"https://dev-media.amazoncloud.cn/ede5a9540c7146d0b25441d8beb372fb_%E4%B8%8B%E8%BD%BD.gif\" alt=\"下载.gif\" />\nDoppler sensing detects motion by looking for frequency shifts in the recorded spectrum of a transmitted signal, which are caused by reflection from moving objects.\nIn time-of-flight sensing, variations in the arrival time of the reflected signal are monitored to detect changes in the environment. We use Doppler sensing due to the robustness of its motion detection signal and because it generalizes well across the cases when Alexa is or is not playing audio simultaneously.\nThe magnitude of the Doppler-shifted signal depends on factors such as distance from target to source, the size and absorption coefficient of the target, the absorption coefficient of the room, and even the humidity and temperature in the room. In addition, when a person moves through a closed space, not only do we observe multiple Doppler components due to various parts of the body moving in different directions with different speeds, but we also observe repetitions of those components due to reflections.\nBecause of all these complexities, the signal received at the source is not at all as clean as a single tone with a frequency shift. In practice, what we observe looks more like this:\n<img src=\"https://dev-media.amazoncloud.cn/7ad955c4869548209a1bbfa6c9eb094f_%E4%B8%8B%E8%BD%BD.jpg\" alt=\"下载.jpg\" />\nSpectrogram of the signal received at device microphones when there is motion near the device.\nFurther, moving objects such as fans and curtains introduce their own Doppler shifts, which have to be rejected since they do not necessarily indicate people’s presence. Below are two spectrograms, one of a room with no motion other than a rotating floor fan and another with both a fan and human motion near a device. As can be seen, they are difficult to tell apart.\nThese complications mean that conventional signal processing is insufficient to recognize human motion from Doppler-shifted signals. So we instead use deep learning, which should be able to recognize more heterogeneous patterns in the signal.\n<img src=\"https://dev-media.amazoncloud.cn/f64252c32507412bb96f74d60dc1fb87_%E4%B8%8B%E8%BD%BD.gif\" alt=\"下载.gif\" />\nSpectrogram of the signal received at device microphones when there is no motion in the room other than a rotating floor fan.\nBelow is a high-level block diagram of our USPD algorithm. On the signal transmitter side, a device- and environment-dependent optimal ultrasound signal is transmitted through the onboard loudspeaker. This signal gets reflected from a moving object and is then captured by the onboard microphone array. The signal is preprocessed and then passed to a neural-network-based classifier to detect motion.\n<img src=\"https://dev-media.amazoncloud.cn/047217be0b984000a3189815f0c9f7a6_%E4%B8%8B%E8%BD%BD.gif\" alt=\"下载.gif\" />\nSpectrogram of the signal with both the rotating floor fan and human motion.\n<img src=\"https://dev-media.amazoncloud.cn/1f34f608f0d14cc8bd9e758a95d2842f_%E4%B8%8B%E8%BD%BD.jpg\" alt=\"下载.jpg\" />\nHigh-level block diagram of USPD algorithm.\n<h4><a id=\"False_alarms_53\"></a>False alarms</h4>\nThe biggest algorithmic challenge we faced was achieving high detection accuracy while keeping the false-alarm rate low. Reducing false-alarm rates is especially challenging because of the well-known long-tail problem in AI: there are a multitude of rare events that could fool a detector, but their rarity means that they’re usually underrepresented in training data.\nTo address this problem, we started by training a seed model on a relatively small amount of data. First, we used the seed model to sort through large amounts of data and extract infrequent events. Second, we used a model trained on that rare-event data to automatically capture infrequent events during our internal data collection process. Data captured by these methods eventually helped us address the long-tail problem and achieve extremely low false-alarm rates.\n<h4><a id=\"Deployment_challenges_61\"></a>Deployment challenges</h4>\nDeploying the trained model brought its own challenges. We wanted to enable USPD with the lowest possible emission level, while still retaining a sufficient detection range, and do all of this with no additional hardware costs (i.e., using the available microphones and loudspeakers on Echo devices instead of dedicated ultrasound transmitters). Further, we decided to support always-on motion detection. This meant being able to detect motion even when a user is playing music from the device speakers. Finally, we added algorithms to improve the user experience in the presence of only minor motion and spent a considerable amount of effort to support Amazon’s goal of reducing our devices’ power consumption. We describe these in more detail below.\n<h4><a id=\"Hardware_variations_and_environmental_conditions_67\"></a>Hardware variations and environmental conditions</h4>\nUsing onboard loudspeakers and microphones for ultrasound transmission and sensing meant that we had to manage variable acoustic characteristics. Mass-produced devices are known to have a certain variation in amplitude and phase response, and it is very difficult to control the response of loudspeakers in the ultrasonic frequency range without affecting yield rates. To manage these hardware variations and environmental variations, we designed automatic device calibration modules to tailor emission frequencies and levels to both the devices’ hardware idiosyncrasies and the acoustic properties of the rooms in which they are used. This helped us provide a consistent user experience across devices without increasing device costs.\n<h4><a id=\"Sensing_with_concurrent_music_playback_73\"></a>Sensing with concurrent music playback</h4>\nMusic playback is a key use case for Echo devices, which poses challenges, since we use device loudspeakers to simultaneously play music and emit ultrasound. Specifically, when low-frequency music content (such as bass sounds) is played together with an ultrasonic signal, the distortion shows up as noise in the ultrasound region. This noise is inaudible to listeners, but it interferes with the frequencies we use for sensing.\n<img src=\"https://dev-media.amazoncloud.cn/0f77835359364b2697f57e2a2ccf2213_%E4%B8%8B%E8%BD%BD.gif\" alt=\"下载.gif\" />\nSignal spectrum observed in an empty room with concurrent music playback.\nIn order to enhance the ultrasound signal and get reasonable range performance in the presence of concurrent music, we developed an adaptive algorithm that uses the different magnitude and phase of distortion and motion features at different microphones to identify and remove distortion.\n<img src=\"https://dev-media.amazoncloud.cn/ca311ec7db0749a3a02f54c8768dd98f_%E4%B8%8B%E8%BD%BD.gif\" alt=\"下载.gif\" />\nSignal spectrum observed with both concurrent music playback and motion near device.\n<h4><a id=\"Major_and_minor_motion_89\"></a>Major and minor motion</h4>\nHuman movements can be broadly <a href=\"https://www.nema.org/standards/view/occupancy-motion-sensors-standard\" target=\"_blank\">categorized as either major</a> or minor. Major movements include walking into or through an area, while minor movements include reaching for a telephone while seated, turning the pages in a book, opening a file folder, and picking up a coffee cup. Detecting minor movements is difficult, as their ultrasound spectra have very low signal-to-noise ratios (SNRs) compared to major movements, and detecting low-SNR events often means high false-positive rates. At the same time, detecting minor movements is extremely important for recognizing a user’s continued presence after walking into the room.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家