Prime Video's work on sports field registration, recap/intro detection

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"Like all of Amazon’s major technology groups, Amazon Prime Video has a dedicated team of scientists who are working constantly to find new ways to delight our customers and improve our products.\n\nOur work was on display at this year’s IEEE ++[Winter Conference on Applications of Computer Vision](https://www.amazon.science/conferences-and-events/amazon-wacv-2021)++, where we presented two papers. ++[One](https://www.amazon.science/publications/a-robust-and-efficient-framework-for-sports-field-registration)++ was on sports field registration, or understanding the spatial relationships between objects depicted in sports videos. The ++[other](https://www.amazon.science/publications/intro-and-recap-detection-for-movies-and-tv-series)++ was on recap and intro detection, or automatically identifying the recaps and intros at the beginnings of TV shows, so viewers can skip them if they want.\n\nSports field registration involves mapping video images onto a topographical model of the field, to enable enhancement of the video feed. It’s the technology behind the virtual first-down lines in American-football broadcasts or the virtual world-record lines in swimming broadcasts.\n\n<video src=\"https://dev-media.amazoncloud.cn/7ece907991d44eefbe09e53baacf9543_astro-navigation-video.mp4\" class=\"manvaVedio\" controls=\"controls\" style=\"width:160px;height:160px\"></video>\n\n**American football, with dense features**\n\nAt top is video of an American football play; bottom left is a visualization of our grid keypoints; bottom right is a visualization of our dense features.\n\nUsually, sports field registration requires onsite cameras equipped with sensors and calibrated to reference points on the field. Combining the sensor output with the cameras’ video yields very accurate field registration.\n\nWe address the problem of sports field registration in the absence of instrumentation, using video from a single camera capable of pan, tilt, and zoom (PTZ) motion. This could enable the addition of cutting-edge graphics to broadcasts of minor-league or amateur sporting events, broadcasts of less-popular sports, or even video signals from uninstrumented secondary cameras at major sporting events.\n\nWhere previous work on this problem modeled field topography using only a few keypoints — usually, intersections of lines laid down on the field — we model the field using a dense grid of keypoints.\n\n![image.png](https://dev-media.amazoncloud.cn/97c3c42a3dee49b2a5c80cdefe46c17a_image.png)\n\nA traditional model of a soccer field (left), with a few keypoints at the intersections of lines, and our model (right), with a dense grid of keypoints.\n\nUsing video annotated according to our modeling scheme, we train a neural network to correlate image pixels with specific keypoints in our model of the field.\n\nThe dense grid increases the precision of our registration — provided that we correctly identify the keypoints. But of course, keypoints that don’t lie at the intersections of field lines are harder to identify.\n\nConsequently, we use a second source of information to improve our mapping. This is a set of dense field features that represent the standard distances between lines on the field and between other identifiable regions of the field.\n\nIn the figure below, for instance, the black-and-white model at left illustrates the lines of an American-football field, while the black-and-white model at right illustrates the numbers marking the yard lines.\n\n![image.png](https://dev-media.amazoncloud.cn/d49a65e92c904f5794ed5a5cfaa35434_image.png)\n\nAn American-football field (top); maps of linear and regional features of the field (second row); and representations of those features using only the distance from each black pixel to the nearest white pixel in the feature map.\n\nThe glowing green elements of the bottom images are meant to indicate that features of the black-and-white models are being represented, not according to their absolute location on the field, but according to normalized distances between black pixels and white pixels. \n\nThat is, whereas the keypoints represent absolute field positions, the dense feature set represents field position relative to recurring visual elements of the field. It’s thus a complementary feature set that improves the mapping between a video frame and the sports field.\n\nUsing the dense features to verify keypoints adds computational overhead, however, and our system needs to work in real time. Our network architecture therefore incorporates several properties meant to reduce this overhead.\n\nThe first is that it is a multitask network: from the input data, it produces a single vector representation that passes to both the keypoint estimator and the dense-feature extractor.\n\n![image.png](https://dev-media.amazoncloud.cn/5ee9fdb6b60245609551561c1b233434_image.png)\n\nOur network architecture. A shared encoder produces a vector representation of the input data that passes to both the keypoint detector and the dense-feature extractor.\n\nThe second is that the network uses the dense features for verification only if it has reason to believe that the keypoint estimates are inaccurate. Specifically, given the initial keypoint estimate for a frame of video, the network takes several different samples of keypoints and determines whether they align with each other. If they don’t, it uses the dense features to refine its estimate (the self-verification and online-refinement modules in the diagram above).\n\nBy combining these techniques, we were able to get our sports field registration system to work in real time. In tests, we compared it to multiple state-of-the-art sports field registration systems on five data sets: soccer, American football, ice hockey, basketball, and tennis.\n\nOn different sports, our system’s performance ranged from comparable to baseline to much better. For American football, for instance, according to the standard version of the intersection-over-union measure, our system was 2.5 times as accurate as the best-performing baseline.\n\n<video src=\"https://dev-media.amazoncloud.cn/cbd75fb781374d62b3e8468754e662ac_astro-navigation-video.mp4\" class=\"manvaVedio\" controls=\"controls\" style=\"width:160px;height:160px\"></video>\n\n**Five sports**\n\nAt left are grid keypoints and the projection of field templates onto the videos of five different sports; at right are mappings of the camera’s field of view onto models of the fields.\n\n\n#### **Intro and recap detection**\n\n\nFans of Prime Video’s hit shows, such as The Marvelous Mrs. Maisel, are familiar with the option of skipping the introductions — which usually feature credits and theme music — and recaps — quick summaries of the narrative to date — at the beginning of individual episodes.\n\nWith existing content, however, providing the option to skip intros and recaps requires hand coding. We’d like to extend that option to other Prime Video programming through automatic detection of intros and recaps.\n\nBoth intros and recaps have distinguishing features that should make them detectable. Intros tend to involve text (credits) superimposed on the screen, often with extended musical performances in the background, while recaps usually involve unusually quick cuts between scenes. Frequently, they’re also introduced by text.\n\nOur detector is a neural network, with an architecture chosen to maximize responsiveness to these elements of intros and recaps. Unlike alternative approaches that require an entire video series to find intro and recap timestamps, our approach can work on each episode independently, which makes it more practical.\n\nWith our system, a given frame of video passes first to a convolutional neural network (CNN). CNNs are designed to step through input images, applying the same filters to successive blocks of pixels. They can thus learn to identify text regardless of what region of the screen it falls in. We also pass the input audio to the same CNN, which learns a fused representation of audio and video.\n\nThe output of the CNN then passes to a bidirectional long-short-term-memory (Bi-LSTM) network. An LSTM is a type of neural network that processes sequential inputs in order, so that each output reflects both the inputs and outputs that preceded it. A Bi-LSTM passes through the same sequence both forward and backward. This allows our network to recognize longer-term dependencies — such as the cutting rates in particular video sequences.\n\n![image.png](https://dev-media.amazoncloud.cn/4f8f34d9524041b8b7d8979b8a33947f_image.png)\n\nThe architecture of our intro and recap detector. The blue lines at the bottom represent individual frames of input video. The outputs of the conditional random field (CRF) are “R” for recap, “I” for intro, and “C” for content.\n\nFinally, the output of the LSTM passes to a conditional random field, which essentially performs curve smoothing. Smoother contours within a segment of video enable clearer identification of the boundaries between segments — between, say, intros and recaps, or between either and the new content of an episode.\n\nIn tests, we compared the performance of our system to baselines that used the same CNN but different methods to process the CNN’s output: a single-layer LSTM; a two-layer LSTM; a Bi-LSTM; and a Bi-LSTM that uses Viterbi decoding, rather than a CRF, for smoothing. We find that our system dramatically outperforms all four baselines. \n\nABOUT THE AUTHOR\n\n#### **[Raffay Hamid](https://www.amazon.science/author/raffay-hamid)**\n\nRaffay Hamid is a principal scientist with Prime Video.","render":"Like all of Amazon’s major technology groups, Amazon Prime Video has a dedicated team of scientists who are working constantly to find new ways to delight our customers and improve our products.\nOur work was on display at this year’s IEEE <ins><a href=\"https://www.amazon.science/conferences-and-events/amazon-wacv-2021\" target=\"_blank\">Winter Conference on Applications of Computer Vision</a></ins>, where we presented two papers. <ins><a href=\"https://www.amazon.science/publications/a-robust-and-efficient-framework-for-sports-field-registration\" target=\"_blank\">One</a></ins> was on sports field registration, or understanding the spatial relationships between objects depicted in sports videos. The <ins><a href=\"https://www.amazon.science/publications/intro-and-recap-detection-for-movies-and-tv-series\" target=\"_blank\">other</a></ins> was on recap and intro detection, or automatically identifying the recaps and intros at the beginnings of TV shows, so viewers can skip them if they want.\nSports field registration involves mapping video images onto a topographical model of the field, to enable enhancement of the video feed. It’s the technology behind the virtual first-down lines in American-football broadcasts or the virtual world-record lines in swimming broadcasts.\n<video src=\"https://dev-media.amazoncloud.cn/7ece907991d44eefbe09e53baacf9543_astro-navigation-video.mp4\" controls=\"controls\"></video>\nAmerican football, with dense features\nAt top is video of an American football play; bottom left is a visualization of our grid keypoints; bottom right is a visualization of our dense features.\nUsually, sports field registration requires onsite cameras equipped with sensors and calibrated to reference points on the field. Combining the sensor output with the cameras’ video yields very accurate field registration.\nWe address the problem of sports field registration in the absence of instrumentation, using video from a single camera capable of pan, tilt, and zoom (PTZ) motion. This could enable the addition of cutting-edge graphics to broadcasts of minor-league or amateur sporting events, broadcasts of less-popular sports, or even video signals from uninstrumented secondary cameras at major sporting events.\nWhere previous work on this problem modeled field topography using only a few keypoints — usually, intersections of lines laid down on the field — we model the field using a dense grid of keypoints.\n<img src=\"https://dev-media.amazoncloud.cn/97c3c42a3dee49b2a5c80cdefe46c17a_image.png\" alt=\"image.png\" />\nA traditional model of a soccer field (left), with a few keypoints at the intersections of lines, and our model (right), with a dense grid of keypoints.\nUsing video annotated according to our modeling scheme, we train a neural network to correlate image pixels with specific keypoints in our model of the field.\nThe dense grid increases the precision of our registration — provided that we correctly identify the keypoints. But of course, keypoints that don’t lie at the intersections of field lines are harder to identify.\nConsequently, we use a second source of information to improve our mapping. This is a set of dense field features that represent the standard distances between lines on the field and between other identifiable regions of the field.\nIn the figure below, for instance, the black-and-white model at left illustrates the lines of an American-football field, while the black-and-white model at right illustrates the numbers marking the yard lines.\n<img src=\"https://dev-media.amazoncloud.cn/d49a65e92c904f5794ed5a5cfaa35434_image.png\" alt=\"image.png\" />\nAn American-football field (top); maps of linear and regional features of the field (second row); and representations of those features using only the distance from each black pixel to the nearest white pixel in the feature map.\nThe glowing green elements of the bottom images are meant to indicate that features of the black-and-white models are being represented, not according to their absolute location on the field, but according to normalized distances between black pixels and white pixels.\nThat is, whereas the keypoints represent absolute field positions, the dense feature set represents field position relative to recurring visual elements of the field. It’s thus a complementary feature set that improves the mapping between a video frame and the sports field.\nUsing the dense features to verify keypoints adds computational overhead, however, and our system needs to work in real time. Our network architecture therefore incorporates several properties meant to reduce this overhead.\nThe first is that it is a multitask network: from the input data, it produces a single vector representation that passes to both the keypoint estimator and the dense-feature extractor.\n<img src=\"https://dev-media.amazoncloud.cn/5ee9fdb6b60245609551561c1b233434_image.png\" alt=\"image.png\" />\nOur network architecture. A shared encoder produces a vector representation of the input data that passes to both the keypoint detector and the dense-feature extractor.\nThe second is that the network uses the dense features for verification only if it has reason to believe that the keypoint estimates are inaccurate. Specifically, given the initial keypoint estimate for a frame of video, the network takes several different samples of keypoints and determines whether they align with each other. If they don’t, it uses the dense features to refine its estimate (the self-verification and online-refinement modules in the diagram above).\nBy combining these techniques, we were able to get our sports field registration system to work in real time. In tests, we compared it to multiple state-of-the-art sports field registration systems on five data sets: soccer, American football, ice hockey, basketball, and tennis.\nOn different sports, our system’s performance ranged from comparable to baseline to much better. For American football, for instance, according to the standard version of the intersection-over-union measure, our system was 2.5 times as accurate as the best-performing baseline.\n<video src=\"https://dev-media.amazoncloud.cn/cbd75fb781374d62b3e8468754e662ac_astro-navigation-video.mp4\" controls=\"controls\"></video>\nFive sports\nAt left are grid keypoints and the projection of field templates onto the videos of five different sports; at right are mappings of the camera’s field of view onto models of the fields.\n<h4><a id=\"Intro_and_recap_detection_59\"></a>Intro and recap detection</h4>\nFans of Prime Video’s hit shows, such as The Marvelous Mrs. Maisel, are familiar with the option of skipping the introductions — which usually feature credits and theme music — and recaps — quick summaries of the narrative to date — at the beginning of individual episodes.\nWith existing content, however, providing the option to skip intros and recaps requires hand coding. We’d like to extend that option to other Prime Video programming through automatic detection of intros and recaps.\nBoth intros and recaps have distinguishing features that should make them detectable. Intros tend to involve text (credits) superimposed on the screen, often with extended musical performances in the background, while recaps usually involve unusually quick cuts between scenes. Frequently, they’re also introduced by text.\nOur detector is a neural network, with an architecture chosen to maximize responsiveness to these elements of intros and recaps. Unlike alternative approaches that require an entire video series to find intro and recap timestamps, our approach can work on each episode independently, which makes it more practical.\nWith our system, a given frame of video passes first to a convolutional neural network (CNN). CNNs are designed to step through input images, applying the same filters to successive blocks of pixels. They can thus learn to identify text regardless of what region of the screen it falls in. We also pass the input audio to the same CNN, which learns a fused representation of audio and video.\nThe output of the CNN then passes to a bidirectional long-short-term-memory (Bi-LSTM) network. An LSTM is a type of neural network that processes sequential inputs in order, so that each output reflects both the inputs and outputs that preceded it. A Bi-LSTM passes through the same sequence both forward and backward. This allows our network to recognize longer-term dependencies — such as the cutting rates in particular video sequences.\n<img src=\"https://dev-media.amazoncloud.cn/4f8f34d9524041b8b7d8979b8a33947f_image.png\" alt=\"image.png\" />\nThe architecture of our intro and recap detector. The blue lines at the bottom represent individual frames of input video. The outputs of the conditional random field (CRF) are “R” for recap, “I” for intro, and “C” for content.\nFinally, the output of the LSTM passes to a conditional random field, which essentially performs curve smoothing. Smoother contours within a segment of video enable clearer identification of the boundaries between segments — between, say, intros and recaps, or between either and the new content of an episode.\nIn tests, we compared the performance of our system to baselines that used the same CNN but different methods to process the CNN’s output: a single-layer LSTM; a two-layer LSTM; a Bi-LSTM; and a Bi-LSTM that uses Viterbi decoding, rather than a CRF, for smoothing. We find that our system dramatically outperforms all four baselines.\nABOUT THE AUTHOR\n<h4><a id=\"Raffay_Hamidhttpswwwamazonscienceauthorraffayhamid_84\"></a><a href=\"https://www.amazon.science/author/raffay-hamid\" target=\"_blank\">Raffay Hamid</a></h4>\nRaffay Hamid is a principal scientist with Prime Video.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家