{"value":"When looking for cooking ideas, people often find inspiration on social media and in restaurants, saving screenshots or taking pictures of food they liked. At Amazon, we have built technology that lets people use those images to find the corresponding cooking recipes.\n\nAt the 2021 Conference of Computer Vision and Pattern Recognition ([CVPR](https://www.amazon.science/conferences-and-events/cvpr-2021)), my colleagues and I are presenting [a new method](https://www.amazon.science/publications/revamping-cross-modal-recipe-retrieval-with-hierarchical-transformers-and-self-supervised-learning) for doing cross-modal image-to-recipe retrieval that achieves state-of-the-art performance by using Transformer-based architectures and self-supervised learning.\n\nSelf-supervised learning is a paradigm in which automatic manipulation of unannotated data provides supplemental training examples for a machine learning model. In our case, in addition to supervised training using images annotated with the corresponding recipes, we do self-supervised learning using recipe data alone.\n\nOur method uses two separate encoder functions, one for the recipe text and one for the image (left and right, respectively, in the figure below). These functions extract representations that will be used for indexing and search at inference time. To encode recipe components, we use Transformer-based architectures, which are hierarchical for multi-sentence inputs (such as ingredients and instructions) and non-hierarchical for single-sentence inputs (recipe titles). For image inputs, we use the well-established image encoders ResNet and Vision Transformers.\n\n![image.png](https://dev-media.amazoncloud.cn/d4e42c807f8a4eeaa9bf1ea839cf8612_image.png)\n\nThe researchers train their recipe retrieval model using two different loss functions, a self-supervised loss function, Lrec, and a supervised loss function, Lpair, which measures the distance between representations of recipe texts and food images in a shared space.\n\nOur model is trained with two loss functions, Lpair and Lrec (see figure above). The supervised loss, Lpair, is computed between representations extracted from the recipe (left) and the image (right). This loss will ensure that text and image representations are close to each other in a common high-dimensional space if they belong to the same training example (e.g., the image of a chocolate chip cookie and its corresponding recipe text) and far apart otherwise (e.g., the same chocolate chip cookie image and the text from a lasagna recipe).\n\nOur novel self-supervised loss, Lrec, is computed between the representations of individual recipe components. This loss ensures that representations of recipe components (e.g., title and ingredients) will be close to each other in the representation space if they belong to the same recipe and far apart otherwise (see figure below). Intuitively, the title of a mac and cheese recipe and the names of its ingredients (macaroni, onion, parmesan cheese, etc.) share semantic cues that can enable a model to learn better recipe representations.\n\n![image.png](https://dev-media.amazoncloud.cn/c5c3ad3247b24a439f1d36c3a5c42dbf_image.png)\n\nDuring training, matched images and recipes serve as positive examples, while mismatched images and recipes serve as negative examples.\n\nSince this loss does not require an image as an input, it can be computed for training examples without images, which are very common in web recipe data; in practice, 66% of our training set is composed of text-only recipe samples. Our experiments show that both the new self-supervised loss term (even when applied only to image-recipe training pairs) and the additional training data contribute to an improvement in retrieval performance.\n\n![image.png](https://dev-media.amazoncloud.cn/50397d991c0a48c795881b629172a487_image.png)\n\nThe researchers’ self-supervised loss function pushes together representations of components of the same recipe and pulls apart representations of components from different recipes\n\nIn our experiments, we performed cross-modal retrieval in both directions: finding recipes that match images and images that match recipes. Our method demonstrated state-of-the-art performance on the Recipe1M database, a common benchmark in the field. In the image-to-recipe retrieval task, our method achieved a Recall@10 of 92.9% when searching on a recipe database of 1,000 elements. This means that given a database of 1,000 recipes and 1,000 food image queries, our method is able to find the correct recipe within the top 10 retrieved results for 92.9% of the image queries.\n\nWe show some qualitative results in the figure below, which reveal that our method is able to encode semantics in image and recipe representations and can find recipes that match the query at a fine-grained ingredient level (e.g., “bread”, “garlic”, and “loaf” in row one, or “salmon” and “asparagus” in row six).\n\n![image.png](https://dev-media.amazoncloud.cn/c40ad5509e3e4f739cbff94fe8b977df_image.png)\n\nResults for image-to-recipe (odd rows) and recipe-to-image (even rows) retrieval modes. The query image/recipe is highlighted in blue, followed by the top five retrieved items. The correct item is highlighted in green. Recipes are displayed as word clouds (the word size being proportional to the word frequency in the recipe).\n\nCheck out [our paper](https://www.amazon.science/publications/revamping-cross-modal-recipe-retrieval-with-hierarchical-transformers-and-self-supervised-learning) to learn the details. Our code and model weights are also [publicly available](https://github.com/amzn/image-to-recipe-transformers).","render":"<p>When looking for cooking ideas, people often find inspiration on social media and in restaurants, saving screenshots or taking pictures of food they liked. At Amazon, we have built technology that lets people use those images to find the corresponding cooking recipes.</p>\n<p>At the 2021 Conference of Computer Vision and Pattern Recognition (<a href=\"https://www.amazon.science/conferences-and-events/cvpr-2021\" target=\"_blank\">CVPR</a>), my colleagues and I are presenting <a href=\"https://www.amazon.science/publications/revamping-cross-modal-recipe-retrieval-with-hierarchical-transformers-and-self-supervised-learning\" target=\"_blank\">a new method</a> for doing cross-modal image-to-recipe retrieval that achieves state-of-the-art performance by using Transformer-based architectures and self-supervised learning.</p>\n<p>Self-supervised learning is a paradigm in which automatic manipulation of unannotated data provides supplemental training examples for a machine learning model. In our case, in addition to supervised training using images annotated with the corresponding recipes, we do self-supervised learning using recipe data alone.</p>\n<p>Our method uses two separate encoder functions, one for the recipe text and one for the image (left and right, respectively, in the figure below). These functions extract representations that will be used for indexing and search at inference time. To encode recipe components, we use Transformer-based architectures, which are hierarchical for multi-sentence inputs (such as ingredients and instructions) and non-hierarchical for single-sentence inputs (recipe titles). For image inputs, we use the well-established image encoders ResNet and Vision Transformers.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/d4e42c807f8a4eeaa9bf1ea839cf8612_image.png\" alt=\"image.png\" /></p>\n<p>The researchers train their recipe retrieval model using two different loss functions, a self-supervised loss function, Lrec, and a supervised loss function, Lpair, which measures the distance between representations of recipe texts and food images in a shared space.</p>\n<p>Our model is trained with two loss functions, Lpair and Lrec (see figure above). The supervised loss, Lpair, is computed between representations extracted from the recipe (left) and the image (right). This loss will ensure that text and image representations are close to each other in a common high-dimensional space if they belong to the same training example (e.g., the image of a chocolate chip cookie and its corresponding recipe text) and far apart otherwise (e.g., the same chocolate chip cookie image and the text from a lasagna recipe).</p>\n<p>Our novel self-supervised loss, Lrec, is computed between the representations of individual recipe components. This loss ensures that representations of recipe components (e.g., title and ingredients) will be close to each other in the representation space if they belong to the same recipe and far apart otherwise (see figure below). Intuitively, the title of a mac and cheese recipe and the names of its ingredients (macaroni, onion, parmesan cheese, etc.) share semantic cues that can enable a model to learn better recipe representations.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/c5c3ad3247b24a439f1d36c3a5c42dbf_image.png\" alt=\"image.png\" /></p>\n<p>During training, matched images and recipes serve as positive examples, while mismatched images and recipes serve as negative examples.</p>\n<p>Since this loss does not require an image as an input, it can be computed for training examples without images, which are very common in web recipe data; in practice, 66% of our training set is composed of text-only recipe samples. Our experiments show that both the new self-supervised loss term (even when applied only to image-recipe training pairs) and the additional training data contribute to an improvement in retrieval performance.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/50397d991c0a48c795881b629172a487_image.png\" alt=\"image.png\" /></p>\n<p>The researchers’ self-supervised loss function pushes together representations of components of the same recipe and pulls apart representations of components from different recipes</p>\n<p>In our experiments, we performed cross-modal retrieval in both directions: finding recipes that match images and images that match recipes. Our method demonstrated state-of-the-art performance on the Recipe1M database, a common benchmark in the field. In the image-to-recipe retrieval task, our method achieved a Recall@10 of 92.9% when searching on a recipe database of 1,000 elements. This means that given a database of 1,000 recipes and 1,000 food image queries, our method is able to find the correct recipe within the top 10 retrieved results for 92.9% of the image queries.</p>\n<p>We show some qualitative results in the figure below, which reveal that our method is able to encode semantics in image and recipe representations and can find recipes that match the query at a fine-grained ingredient level (e.g., “bread”, “garlic”, and “loaf” in row one, or “salmon” and “asparagus” in row six).</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/c40ad5509e3e4f739cbff94fe8b977df_image.png\" alt=\"image.png\" /></p>\n<p>Results for image-to-recipe (odd rows) and recipe-to-image (even rows) retrieval modes. The query image/recipe is highlighted in blue, followed by the top five retrieved items. The correct item is highlighted in green. Recipes are displayed as word clouds (the word size being proportional to the word frequency in the recipe).</p>\n<p>Check out <a href=\"https://www.amazon.science/publications/revamping-cross-modal-recipe-retrieval-with-hierarchical-transformers-and-self-supervised-learning\" target=\"_blank\">our paper</a> to learn the details. Our code and model weights are also <a href=\"https://github.com/amzn/image-to-recipe-transformers\" target=\"_blank\">publicly available</a>.</p>\n"}