Paper on translating images into maps wins ICRA best-paper award

{"value":"Today at the International Conference on Robotics and Automation ([ICRA](https://www.amazon.science/conferences-and-events/icra-2022)), a paper by my colleagues at the University of Surrey and me, “[Translating images into maps](https://www.amazon.science/publications/translating-images-into-maps)”, [won ](https://www.icra2022.org/program/awards)the conference’s overall outstanding-paper award.\n\nOur paper addresses the problem of constructing a top-down “bird’s-eye” view of a scene on the basis of standard sideways-on photographs. This is an important problem for autonomous vehicles, which need to build maps of their immediate environments to decide where it is safe to drive.\n\nOur approach exploits the well-known fact that every column of pixels in a digital image corresponds to a single ray extending across a 2-D map of the field of view. Each pixel in the column, in turn, corresponds to a point along that ray.\n\n![下载.gif](https://dev-media.amazoncloud.cn/bd4a16f014e94fa3abad348f3ad83ab2_%E4%B8%8B%E8%BD%BD.gif)\n\nEvery column of pixels in a digital image corresponds to a ray extending across a 2-D map of the field of view. The ray’s origin is the location of the camera on the map.\n\nOur insight was that, because of the one-to-one correspondence between pixels and points along a ray, the problem of translating images into maps has the same structure as sequence-to-sequence problems in natural-language processing ([NLP](https://www.amazon.science/tag/nlp)), such as machine translation, which converts a sequence of words in one language into a sequence of words in another.\n\nWe exploited this idea, using the established machinery for sequence-to-sequence processing — particularly, Transformer-based models — to convert images to maps by directly translating each individual column of pixels into one ray along the map. In experiments we report in the paper, we compared this approach to a range of existing approaches on three different datasets, strongly improving over all existing methods on all datasets.\n\n\n#### **Focused attention**\n\n\nThe key to Transformers’ success is their use of attention mechanisms, which determine which elements of the input matter most to which elements of the output. So, for instance, if the input is a sentence in Hindi, and the output is a sentence in Spanish, the attention mechanism determines which words of the input are most relevant when determining each word of the output.\n\nIn general, however, Transformers require much more data for computer vision applications than for NLP applications. That’s because, in a large 2-D image — unlike a short, 1-D sequence of words — there are so many candidates for attention: any given pixel might contain information that alters how other pixels should be interpreted.\n\nBy constraining our use of Transformers to individual columns of pixels and individual rays, we avoid this combinatorial explosion and can efficiently train on existing, smaller datasets.\n\n\n#### **Semantic content**\n\n\nThe analogy between our task and the sequence-to-sequence NLP tasks is quite precise. Many languages share a common structure, which means that words often — though not always — occur in similar places in source texts and their translations. In the same way, pixels further down an image column often — though not always — correspond to points closer to the camera along the associated ray. In both cases, Transformers can exploit this structure.\n\nA significant hurdle in the computer vision case, however, is that single pixels along a ray contain little information. In a street scene, for example, a single black pixel on a ray could correspond to the asphalt, a tire, or the shoe of pedestrian. To help resolve such ambiguities, we generate features that capture local context by preprocessing input images with a convolutional neural network (CNN).\n\nCNNs step through an image one block of pixels at a time, looking in each block for distinctive patterns, such as color gradations with particular orientations. Low-level patterns found by the bottom layers of the CNN are aggregated by higher layers, until they acquire semantic content — the curve of a dark tire, the parallel edges of a shiny signpost.\n\nThe inputs fed to our Transformer network, then, are not raw color values but pixel embeddings produced by a CNN. Those embeddings factor in information from pixels in other columns and include cues that can help determine depth along a ray — that a given pixel probably belongs to a car tire rather than a shoe, for instance.\n\nWe use a CNN pretrained on a standard image classification task, so it has already learned to recognize image features useful for computer vision task. But then we train our entire integrated model — CNN and Transformer — end to end, so that the CNN produces embeddings useful for image mapping.\n\nIn our experiments, we considered scenarios in which we were constructing maps from single images and from sequences of images (i.e., video). Our video-based maps were more accurate than those produced by a benchmark video-based model, and in general, they were more accurate than the maps our method produced from still images. But the margin of improvement was small, about 3% on average across all 14 classes.\n\n![下载.jpg](https://dev-media.amazoncloud.cn/b034bca01c5f4d88b0ec5f8c9bb96ca3_%E4%B8%8B%E8%BD%BD.jpg)\n\nMaps constructed from still images (leftmost column) by our still-image method (\"our spatial\"), our video method (\"our spatiotemporal\"), and three benchmarks (VPN, PON, and STA-S). The first column of maps is the ground-truth bird’s-eye-view map.\n\nAn intriguing topic for future research is whether we can better leverage perspectival information in the video stream to extract greater improvements in map accuracy relative to still images. We have also improved on this work by using novel graph-based methods to integrate 3-D object detection into our mapping algorithms. We describe these results in a paper we’re presenting at this year’s CVPR, “‘[The pedestrian next to the lamppost’: Adaptive object graphs for better instantaneous mapping](https://www.amazon.science/publications/the-pedestrian-next-to-the-lamppost-adaptive-object-graphs-for-better-instantaneous-mapping)”.\n\nABOUT THE AUTHOR\n\n#### **[Chris Russell](https://www.amazon.science/author/chris-russell)**\n\nChris Russell is a senior applied scientist with Amazon Web Services.","render":"Today at the International Conference on Robotics and Automation (<a href=\"https://www.amazon.science/conferences-and-events/icra-2022\" target=\"_blank\">ICRA</a>), a paper by my colleagues at the University of Surrey and me, “<a href=\"https://www.amazon.science/publications/translating-images-into-maps\" target=\"_blank\">Translating images into maps</a>”, <a href=\"https://www.icra2022.org/program/awards\" target=\"_blank\">won </a>the conference’s overall outstanding-paper award.\nOur paper addresses the problem of constructing a top-down “bird’s-eye” view of a scene on the basis of standard sideways-on photographs. This is an important problem for autonomous vehicles, which need to build maps of their immediate environments to decide where it is safe to drive.\nOur approach exploits the well-known fact that every column of pixels in a digital image corresponds to a single ray extending across a 2-D map of the field of view. Each pixel in the column, in turn, corresponds to a point along that ray.\n<img src=\"https://dev-media.amazoncloud.cn/bd4a16f014e94fa3abad348f3ad83ab2_%E4%B8%8B%E8%BD%BD.gif\" alt=\"下载.gif\" />\nEvery column of pixels in a digital image corresponds to a ray extending across a 2-D map of the field of view. The ray’s origin is the location of the camera on the map.\nOur insight was that, because of the one-to-one correspondence between pixels and points along a ray, the problem of translating images into maps has the same structure as sequence-to-sequence problems in natural-language processing (<a href=\"https://www.amazon.science/tag/nlp\" target=\"_blank\">NLP</a>), such as machine translation, which converts a sequence of words in one language into a sequence of words in another.\nWe exploited this idea, using the established machinery for sequence-to-sequence processing — particularly, Transformer-based models — to convert images to maps by directly translating each individual column of pixels into one ray along the map. In experiments we report in the paper, we compared this approach to a range of existing approaches on three different datasets, strongly improving over all existing methods on all datasets.\n<h4><a id=\"Focused_attention_15\"></a>Focused attention</h4>\nThe key to Transformers’ success is their use of attention mechanisms, which determine which elements of the input matter most to which elements of the output. So, for instance, if the input is a sentence in Hindi, and the output is a sentence in Spanish, the attention mechanism determines which words of the input are most relevant when determining each word of the output.\nIn general, however, Transformers require much more data for computer vision applications than for NLP applications. That’s because, in a large 2-D image — unlike a short, 1-D sequence of words — there are so many candidates for attention: any given pixel might contain information that alters how other pixels should be interpreted.\nBy constraining our use of Transformers to individual columns of pixels and individual rays, we avoid this combinatorial explosion and can efficiently train on existing, smaller datasets.\n<h4><a id=\"Semantic_content_25\"></a>Semantic content</h4>\nThe analogy between our task and the sequence-to-sequence NLP tasks is quite precise. Many languages share a common structure, which means that words often — though not always — occur in similar places in source texts and their translations. In the same way, pixels further down an image column often — though not always — correspond to points closer to the camera along the associated ray. In both cases, Transformers can exploit this structure.\nA significant hurdle in the computer vision case, however, is that single pixels along a ray contain little information. In a street scene, for example, a single black pixel on a ray could correspond to the asphalt, a tire, or the shoe of pedestrian. To help resolve such ambiguities, we generate features that capture local context by preprocessing input images with a convolutional neural network (CNN).\nCNNs step through an image one block of pixels at a time, looking in each block for distinctive patterns, such as color gradations with particular orientations. Low-level patterns found by the bottom layers of the CNN are aggregated by higher layers, until they acquire semantic content — the curve of a dark tire, the parallel edges of a shiny signpost.\nThe inputs fed to our Transformer network, then, are not raw color values but pixel embeddings produced by a CNN. Those embeddings factor in information from pixels in other columns and include cues that can help determine depth along a ray — that a given pixel probably belongs to a car tire rather than a shoe, for instance.\nWe use a CNN pretrained on a standard image classification task, so it has already learned to recognize image features useful for computer vision task. But then we train our entire integrated model — CNN and Transformer — end to end, so that the CNN produces embeddings useful for image mapping.\nIn our experiments, we considered scenarios in which we were constructing maps from single images and from sequences of images (i.e., video). Our video-based maps were more accurate than those produced by a benchmark video-based model, and in general, they were more accurate than the maps our method produced from still images. But the margin of improvement was small, about 3% on average across all 14 classes.\n<img src=\"https://dev-media.amazoncloud.cn/b034bca01c5f4d88b0ec5f8c9bb96ca3_%E4%B8%8B%E8%BD%BD.jpg\" alt=\"下载.jpg\" />\nMaps constructed from still images (leftmost column) by our still-image method (“our spatial”), our video method (“our spatiotemporal”), and three benchmarks (VPN, PON, and STA-S). The first column of maps is the ground-truth bird’s-eye-view map.\nAn intriguing topic for future research is whether we can better leverage perspectival information in the video stream to extract greater improvements in map accuracy relative to still images. We have also improved on this work by using novel graph-based methods to integrate 3-D object detection into our mapping algorithms. We describe these results in a paper we’re presenting at this year’s CVPR, “‘<a href=\"https://www.amazon.science/publications/the-pedestrian-next-to-the-lamppost-adaptive-object-graphs-for-better-instantaneous-mapping\" target=\"_blank\">The pedestrian next to the lamppost’: Adaptive object graphs for better instantaneous mapping</a>”.\nABOUT THE AUTHOR\n<h4><a id=\"Chris_Russellhttpswwwamazonscienceauthorchrisrussell_48\"></a><a href=\"https://www.amazon.science/author/chris-russell\" target=\"_blank\">Chris Russell</a></h4>\nChris Russell is a senior applied scientist with Amazon Web Services.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家