Using computer vision to weed out product catalogue errors

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"A product page in the Amazon Store will often include links to product variants, which differ by color, size, style, and so on. Sometimes, however, errors can creep into the product catalogue, resulting in links to unrelated products or duplicate listings, which can compromise customers’ shopping experiences.\n\nAt this year’s Winter Conference on Applications of Computer Vision (++[WACV](https://www.amazon.science/conferences-and-events/wacv-2022)++), we presented a new method for automatically identifying errors in product variation listings, which uses computer vision to determine whether the products depicted in different images are identical or different.\n\nWe frame the problem as a metric-learning problem, meaning that our machine learning model learns the function for measuring distances between vector representations of products in an embedding space. Embeddings of instances of the same products should be similar, while embeddings of different products should be dissimilar. Since this learned feature embedding typically generalizes well, the model can be applied to products unseen during training.\n\n![image.png](https://dev-media.amazoncloud.cn/a7dd895162b54e77a7d5cfaf1299cde7_image.png)\n\nTop left: a normal variation listing; top right: an erroneous listing (the image is of the wrong product); bottom row: duplicate variations (two separate detail pages for the same product)\n\nOur model is multimodal, in that its inputs include a product image and the product title. The only supervision signal is the overarching product descriptor that encompasses all the variants.\n\nIn experiments, we compared our model to a similarly multimodal benchmark model and found that it increased the area under the precision-recall curve (or PR-AUC, which evaluates the tradeoff between false positives and false negatives) by 5.2%.\n\n#### **The approach**\n\nThe purpose of using the product title is to guide the model toward learning more robust and relevant representations. For instance, the title provides context that helps the model focus on the relevant regions of the image, making it more robust to noisy backgrounds. It also helps resolve ambiguities that arise due to multiple objects appearing in the image.\n\n![image.png](https://dev-media.amazoncloud.cn/9a69165d34ac447292bc4141b8f18a8e_image.png)\n\nThe product title helps the model identify regions of interest in the product image.\n\n#### **The architecture**\n\nOur model has two branches, one global and one local. The global network takes the whole image as input, and based on the product title, it determines which portion of the image to focus on. That information is used to crop the input image, and the cropped image passes to the local branch.\n\nThe backbone of each branch is a convolutional neural network (CNN), a type of network commonly used in computer vision that applies a series of identical filters to portions of the image representation.\n\n![image.png](https://dev-media.amazoncloud.cn/559a812ea1bf4de1861f1c19d62a7075_image.png)\n\nThe MAPS (multimodal attention for product similarity) network architecture\n\nFeatures extracted by the CNN are augmented by a self-attention mechanism, to better capture spatial dependencies. The augmented features then pass to spatial and channel attention layers. The spatial attention — i.e., “where to attend\" — uses the title to attend over the relevant regions of the image. The channel attention — i.e., “what to attend\" — emphasizes the relevant features of the image representation.\n\nBoth the spatial attention and the channel attention are based on a self-attentive embedding of the title information — that is, an embedding that weighs each word of the title in light of the other words.\n\nWe train using both positive and negative examples. For positive examples, we simply pair instances of the same overarching product descriptor.\n\n![image.png](https://dev-media.amazoncloud.cn/7940649466864a9bbcfb7bef4f7986e7_image.png)\n\nExamples of successful predictions by the model. Left column: Same products predicted as same. Right column: Different products predicted as different.\n\nIn order for the model to learn efficiently, the negative examples have to be difficult: teaching the model to discriminate between, say, a shoe and a garden rake won’t help it distinguish between similar types of shoes. So for the negative examples, we pair products in the same subcategories. This results in a significant improvement in performance.\n\nTo test our approach, we created a dataset consisting of images and titles from three different product categories. As baselines in our experiments, we used image-only models and a recent multimodal approach that uses product attributes to attend over images.\n\nCompared to the image-only models, our approach yields an increase in PR-AUC of up to 17% gain. Compared to the multimodal benchmark, the improvement is 5.2%.\n\nABOUT THE AUTHOR\n\n#### **[Nilotpal Das](https://www.amazon.science/author/nilotpal-das)**\n\nNilotpal Das is an applied scientist at Amazon.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n","render":"A product page in the Amazon Store will often include links to product variants, which differ by color, size, style, and so on. Sometimes, however, errors can creep into the product catalogue, resulting in links to unrelated products or duplicate listings, which can compromise customers’ shopping experiences.\nAt this year’s Winter Conference on Applications of Computer Vision (<ins><a href=\"https://www.amazon.science/conferences-and-events/wacv-2022\" target=\"_blank\">WACV</a></ins>), we presented a new method for automatically identifying errors in product variation listings, which uses computer vision to determine whether the products depicted in different images are identical or different.\nWe frame the problem as a metric-learning problem, meaning that our machine learning model learns the function for measuring distances between vector representations of products in an embedding space. Embeddings of instances of the same products should be similar, while embeddings of different products should be dissimilar. Since this learned feature embedding typically generalizes well, the model can be applied to products unseen during training.\n<img src=\"https://dev-media.amazoncloud.cn/a7dd895162b54e77a7d5cfaf1299cde7_image.png\" alt=\"image.png\" />\nTop left: a normal variation listing; top right: an erroneous listing (the image is of the wrong product); bottom row: duplicate variations (two separate detail pages for the same product)\nOur model is multimodal, in that its inputs include a product image and the product title. The only supervision signal is the overarching product descriptor that encompasses all the variants.\nIn experiments, we compared our model to a similarly multimodal benchmark model and found that it increased the area under the precision-recall curve (or PR-AUC, which evaluates the tradeoff between false positives and false negatives) by 5.2%.\n<h4><a id=\"The_approach_14\"></a>The approach</h4>\nThe purpose of using the product title is to guide the model toward learning more robust and relevant representations. For instance, the title provides context that helps the model focus on the relevant regions of the image, making it more robust to noisy backgrounds. It also helps resolve ambiguities that arise due to multiple objects appearing in the image.\n<img src=\"https://dev-media.amazoncloud.cn/9a69165d34ac447292bc4141b8f18a8e_image.png\" alt=\"image.png\" />\nThe product title helps the model identify regions of interest in the product image.\n<h4><a id=\"The_architecture_22\"></a>The architecture</h4>\nOur model has two branches, one global and one local. The global network takes the whole image as input, and based on the product title, it determines which portion of the image to focus on. That information is used to crop the input image, and the cropped image passes to the local branch.\nThe backbone of each branch is a convolutional neural network (CNN), a type of network commonly used in computer vision that applies a series of identical filters to portions of the image representation.\n<img src=\"https://dev-media.amazoncloud.cn/559a812ea1bf4de1861f1c19d62a7075_image.png\" alt=\"image.png\" />\nThe MAPS (multimodal attention for product similarity) network architecture\nFeatures extracted by the CNN are augmented by a self-attention mechanism, to better capture spatial dependencies. The augmented features then pass to spatial and channel attention layers. The spatial attention — i.e., “where to attend" — uses the title to attend over the relevant regions of the image. The channel attention — i.e., “what to attend" — emphasizes the relevant features of the image representation.\nBoth the spatial attention and the channel attention are based on a self-attentive embedding of the title information — that is, an embedding that weighs each word of the title in light of the other words.\nWe train using both positive and negative examples. For positive examples, we simply pair instances of the same overarching product descriptor.\n<img src=\"https://dev-media.amazoncloud.cn/7940649466864a9bbcfb7bef4f7986e7_image.png\" alt=\"image.png\" />\nExamples of successful predictions by the model. Left column: Same products predicted as same. Right column: Different products predicted as different.\nIn order for the model to learn efficiently, the negative examples have to be difficult: teaching the model to discriminate between, say, a shoe and a garden rake won’t help it distinguish between similar types of shoes. So for the negative examples, we pair products in the same subcategories. This results in a significant improvement in performance.\nTo test our approach, we created a dataset consisting of images and titles from three different product categories. As baselines in our experiments, we used image-only models and a recent multimodal approach that uses product attributes to attend over images.\nCompared to the image-only models, our approach yields an increase in PR-AUC of up to 17% gain. Compared to the multimodal benchmark, the improvement is 5.2%.\nABOUT THE AUTHOR\n<h4><a id=\"Nilotpal_Dashttpswwwamazonscienceauthornilotpaldas_50\"></a><a href=\"https://www.amazon.science/author/nilotpal-das\" target=\"_blank\">Nilotpal Das</a></h4>\nNilotpal Das is an applied scientist at Amazon.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家