Build taxonomy-based contextual targeting using Amazon Media Intelligence and Hugging Face BERT

海外精选
海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时,内容中提到的“AWS” 是 “Amazon Web Services” 的缩写,在此网站不作为商标展示。
0
0
{"value":"As new data privacy regulations like GDPR (General Data Protection Regulation, 2017) have come into effect, customers are under increased pressure to monetize media assets while abiding by the new rules. Monetizing media while respecting privacy regulations requires the ability to automatically extract granular metadata from assets like text, images, video, and audio files at internet scale. It also requires a scalable way to map media assets to industry taxonomies that facilitate discovery and monetization of content. This use case is particularly significant for the advertising industry as data privacy rules cause a shift from behavioral targeting using third-party cookies.\n\n[Third-party cookies](https://cookie-script.com/all-you-need-to-know-about-third-party-cookies.html) help enable personalized ads for web users, and allow advertisers to reach their intended audience. A traditional solution to serve ads without third-party cookies is contextual advertising, which places ads on webpages based on the content published on the pages. However, contextual advertising poses the challenge of extracting context from media assets at scale, and likewise using that context to monetize the assets.\n\nIn this post, we discuss how you can build a machine learning (ML) solution that we call Contextual Intelligence Taxonomy Mapper (CITM) to extract context from digital content and map it to standard taxonomies in order to generate value. Although we apply this solution to contextual advertising, you can use it to solve other use cases. For example, education technology companies can use it to map their content to industry taxonomies in order to facilitate adaptive learning that delivers personalized learning experiences based on students’ individual needs.\n\n### **Solution overview**\n\nThe solution comprises two components: [AWS Media Intelligence](https://aws.amazon.com/machine-learning/ml-use-cases/media-intelligence/) (AWS MI) capabilities for context extraction from content on web pages, and CITM for intelligent mapping of content to an industry taxonomy. You can access the solution’s [code repository](https://github.com/aws-samples/contextual-ad-intelligence-on-aws/tree/main/contextual-intelligence-taxonomy-mapper) for a detailed view of how we implement its components.\n\n#### **AWS Media Intelligence**\n\nAWS MI capabilities enable automatic extraction of metadata that provides contextual understanding of a webpage’s content. You can combine ML techniques like computer vision, speech to text, and natural language processing (NLP) to automatically generate metadata from text, videos, images, and audio files for use in downstream processing. Managed AI services such as [Amazon Rekognition](https://aws.amazon.com/rekognition/), [Amazon Transcribe](https://aws.amazon.com/transcribe/), [Amazon Comprehend](https://aws.amazon.com/comprehend/), and [Amazon Textract](https://aws.amazon.com/textract/) make these ML techniques accessible using API calls. This eliminates the overhead needed to train and build ML models from scratch. In this post, you see how using Amazon Comprehend and Amazon Rekognition for media intelligence enables metadata extraction at scale.\n\n#### **Contextual Intelligence Taxonomy Mapper**\n\nAfter you extract metadata from media content, you need a way to map that metadata to an industry taxonomy in order to facilitate contextual targeting. To do this, you build Contextual Intelligence Taxonomy Mapper (CITM), which is powered by a BERT sentence transformer from Hugging Face.\n\nThe BERT sentence transformer enables CITM to categorize web content with contextually related keywords. For example, it can categorize a web article about healthy living with keywords from the industry taxonomy, such as “Healthy Cooking and Eating,” “Running and Jogging,” and more, based on the text written and the images used within the article. CITM also provides the ability to choose the mapped taxonomy terms to use for your ad bidding process based on your criteria.\n\nThe following diagram illustrates the conceptual view of the architecture with CITM.\n\n![image.png](https://dev-media.amazoncloud.cn/a5464b60d117432681b0aabc58018a77_image.png)\n\n#### **The IAB (Interactive Advertising Bureau) Content Taxonomy**\n\nFor this post, we use the [IAB Tech Lab’s Content Taxonomy](https://iabtechlab.com/standards/content-taxonomy/) as the industry standard taxonomy for the contextual advertising use case. By design, the IAB taxonomy helps content creators more accurately describe their content, and it provides a common language for all parties in the programmatic advertising process. The use of a common terminology is crucial because the selection of ads for a webpage a user visits has to happen within milliseconds. The IAB taxonomy serves as a standardized way to categorize content from various sources while also being an industry protocol that real-time bidding platforms use for ad selection. It has a hierarchical structure, which provides granularity of taxonomy terms and enhanced context for advertisers.\n\n#### **Solution workflow**\n\nThe following diagram illustrates the solution workflow.\n\n![image.png](https://dev-media.amazoncloud.cn/70f9c3cfbdd24c86a68ebca64ba6e0be_image.png)\n\nThe steps are as follows:\n\n1. [Amazon Simple Storage Service](http://aws.amazon.com/s3) (Amazon S3) stores the IAB content taxonomy and extracted web content.\n2. Amazon Comprehend performs topic modeling to extract common themes from the collection of articles.\n3. The Amazon Rekognition [object label API](https://docs.aws.amazon.com/rekognition/latest/dg/labels.html) detects labels in images.\n4. CITM maps content to a standard taxonomy.\n5. Optionally, you can store content to taxonomy mapping in a metadata store.\n\n\nIn the following sections, we walk through each step in detail.\n\n### **Amazon S3 stores the IAB content taxonomy and extracted web content**\n\nWe store extracted text and images from a collection of web articles in an S3 bucket. We also store the IAB content taxonomy. As a first step, we concatenate different tiers on the taxonomy to create combined taxonomy terms. This approach helps maintain the taxonomy’s hierarchical structure when the BERT sentence transformer creates embeddings for each keyword. See the following code:\n\n```\ndef prepare_taxonomy(taxonomy_df):\n \n \"\"\"\n Concatenate IAB Tech Lab content taxonomy tiers and prepare keywords for BERT embedding. \n Use this function as-is if using the IAB Content Taxonomy\n \n Parameters (input):\n ----------\n taxonomy_df : Content taxonomy dataframe\n\n Returns (output):\n -------\n df_clean : Content taxonomy with tiers in the taxonomy concatenated\n keyword_list: List of concatenated content taxonomy keywords\n ids: List of ids for the content taxonomy keywords\n \"\"\"\n \n df = taxonomy_df[['Unique ID ','Parent','Name','Tier 1','Tier 2','Tier 3']] \n df_str = df.astype({\"Unique ID \": 'str', \"Parent\": 'str', \"Tier 1\": 'str', \"Tier 2\": 'str', \"Tier 3\": 'str'})\n df_clean = df_str.replace('nan','')\n \n #create a column that concatenates all tiers for each taxonomy keyword\n df_clean['combined']=df_clean[df_clean.columns[2:6]].apply(lambda x: ' '.join(x.dropna().astype(str)),axis=1)\n \n #turn taxonomy keyords to list of strings a prep for encoding with BERT sentence transformer\n keyword_list=df_clean['combined'].to_list()\n \n #get list of taxonomy ids\n ids = df_clean['Unique ID '].to_list() \n \n return df_clean, keyword_list, ids\n\ntaxonomy_df, taxonomy_terms, taxonomy_ids = prepare_taxonomy(read_taxonomy)\n\n```\n\nThe following diagram illustrates the IAB context taxonomy with combined tiers.\n\n![image.png](https://dev-media.amazoncloud.cn/ad78d2402dab45a092add935541a7400_image.png)\n\n\n### **Amazon Comprehend performs topic modeling to extract common themes from the collection of articles**\n\nWith the Amazon Comprehend topic modeling API, you analyze all the article texts using the Latent Dirichlet Allocation (LDA) model. The model examines each article in the corpus and groups keywords into the same topic based on the context and frequency in which they appear across the entire collection of articles. To ensure the LDA model detects highly coherent topics, you perform a preprocessing step prior to calling the Amazon Comprehend API. You can use the [gensim library’s](https://radimrehurek.com/gensim/) CoherenceModel to determine the optimal number of topics to detect from the collection of articles or text files. See the following code:\n\n```\ndef compute_coherence_scores(dictionary, corpus, texts, limit, start=2, step=3):\n \"\"\"\n Compute coherence scores for various number of topics for your topic model. \n Adjust the parameters below based on your data\n\n Parameters (input):\n ----------\n dictionary : Gensim dictionary created earlier from input texts\n corpus : Gensim corpus created earlier from input texts\n texts : List of input texts\n limit : The maximum number of topics to test. Amazon Comprehend can detect up to 100 topics in a collection\n\n Returns (output):\n -------\n models : List of LDA topic models\n coherence_scores : Coherence values corresponding to the LDA model with respective number of topics\n \"\"\"\n coherence_scores = []\n models = []\n for num_topics in range(start, limit, step):\n model = gensim.models.LdaMulticore(corpus=corpus, num_topics=num_topics, id2word=id2word)\n models.append(model)\n coherencemodel = CoherenceModel(model=model, texts=corpus_words, dictionary=id2word, coherence='c_v')\n coherence_scores.append(coherencemodel.get_coherence())\n\n return models, coherence_scores\n\nmodels, coherence_scores = compute_coherence_scores(dictionary=id2word, corpus=corpus_tdf, texts=corpus_words, start=2, limit=100, step=3)\n```\n\nAfter you get the optimal number of topics, you use that value for the Amazon Comprehend topic modeling job. Providing different values for the NumberOfTopics parameter in the Amazon Comprehend [StartTopicsDetectionJob operation](https://docs.aws.amazon.com/comprehend/latest/dg/API_StartTopicsDetectionJob.html#API_StartTopicsDetectionJob_RequestSyntax) results in a variation in the distribution of keywords placed in each topic group. An optimized value for the NumberOfTopics parameter represents the number of topics that provide the most coherent grouping of keywords with higher contextual relevance. You can store the topic modeling output from Amazon Comprehend in its raw format in Amazon S3.\n\n### **The Amazon Rekognition object label API detects labels in images**\n\nYou analyze each image extracted from all webpages using the [Amazon Rekognition DetectLabels operation](https://docs.aws.amazon.com/rekognition/latest/dg/labels-detect-labels-image.html). For each image, the operation provides a JSON response with all labels detected within the image, coupled with a confidence score for each. For our use case, we arbitrarily select a confidence score of 60% or higher as the threshold for object labels to use in the next step. You store object labels in their raw format in Amazon S3. See the following code:\n\n```\n\"\"\"\nCreate a function to extract object labels from a given image using Amazon Rekognition\n\"\"\"\n\ndef get_image_labels(image_loc):\n labels = []\n with fs.open(image_loc, \"rb\") as im:\n response = rekognition_client.detect_labels(Image={\"Bytes\": im.read()})\n \n for label in response[\"Labels\"]:\n if label[\"Confidence\"] >= 60: #change to desired confidence score threshold, value between [0,100]:\n object_label = label[\"Name\"]\n labels.append(object_label)\n return labels\n\n```\n### **CITM maps content to a standard taxonomy**\n\n\nCITM compares extracted content metadata (topics from text and labels from images) with keywords on the IAB taxonomy, and then maps the content metadata to keywords from the taxonomy that are semantically related. For this task, CITM completes the following three steps:\n\n1. Generate neural embeddings for the content taxonomy, topic keywords, and image labels using Hugging Face’s BERT sentence transformer. We access the sentence transformer model from [Amazon SageMaker](https://aws.amazon.com/sagemaker/). In this post, we use the [paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2) model, which maps keywords and labels to a 384 dimensional dense vector space.\n2. Compute the cosine similarity score between taxonomy keywords and topic keywords using their embeddings. It also computes cosine similarity between the taxonomy keywords and the image object labels. We use cosine similarity as a scoring mechanism to find semantically similar matches between the content metadata and the taxonomy. See the following code:\n\n\n```\ndef compute_similarity(entity_embeddings, entity_terms, taxonomy_embeddings, taxonomy_terms):\n \"\"\"\n Compute cosine scores between entity embeddings and taxonomy embeddings\n \n Parameters (input):\n ----------\n entity_embeddings : Embeddings for either topic keywords from Amazon Comprehend or image labels from Amazon Rekognition\n entity_terms : Terms for topic keywords or image labels\n taxonomy_embeddings : Embeddings for the content taxonomy\n taxonomy_terms : Terms for the taxonomy keywords\n\n Returns (output):\n -------\n mapping_df : Dataframe that matches each entity keyword to each taxonomy keyword and their cosine similarity score\n \"\"\"\n \n #calculate cosine score, pairing each entity embedding with each taxonomy keyword embedding\n cosine_scores = util.pytorch_cos_sim(entity_embeddings, taxonomy_embeddings)\n pairs = []\n for i in range(len(cosine_scores)-1):\n for j in range(0, cosine_scores.shape[1]):\n pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})\n \n #Sort cosine similarity scores in decreasing order\n pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)\n rows = []\n for pair in pairs:\n i, j = pair['index']\n rows.append([entity_terms[i], taxonomy_terms[j], pair['score']])\n \n #move sorted values to a dataframe\n mapping_df= pd.DataFrame(rows, columns=[\"term\", \"taxonomy_keyword\",\"cosine_similarity\"])\n mapping_df['cosine_similarity'] = mapping_df['cosine_similarity'].astype('float')\n mapping_df= mapping_df.sort_values(by=['term','cosine_similarity'], ascending=False)\n drop_dups= mapping_df.drop_duplicates(subset=['term'], keep='first')\n mapping_df = drop_dups.sort_values(by=['cosine_similarity'], ascending=False).reset_index(drop=True)\n return mapping_df\n \n#compute cosine_similairty score between topic keywords and content taxonomy keywords using BERT embeddings \ntext_taxonomy_mapping=compute_similarity(keyword_embeddings, topic_keywords, taxonomy_embeddings, taxonomy_terms)\n\n```\n\n3. Identify pairings with similarity scores that are above a user-defined threshold and use them to map the content to semantically related keywords on the content taxonomy. In our test, we select all keywords from pairings that have a cosine similarity score of 0.5 or higher. See the following code:\n```\n#merge text and image keywords mapped to content taxonomy\nrtb_keywords=pd.concat([text_taxonomy_mapping[[\"term\",\"taxonomy_keyword\",\"cosine_similarity\"]],image_taxonomy_mapping]).sort_values(by='cosine_similarity',ascending=False).reset_index(drop=True)\n\n#select keywords with a cosine_similarity score greater than your desired threshold ( the value should be from 0 to 1)\nrtb_keywords[rtb_keywords[\"cosine_similarity\"]> 50] # change to desired threshold for cosine score, value between [0,100]:\n\n```\n\nA common challenge when working with internet-scale language representation (such as in this use case) is that you need a model that can fit most of the content—in this case, words in the English language. Hugging Face’s BERT transformer has been pre-trained using a large corpus of Wikipedia posts in the English language to represent the semantic meaning of words in relation to one another. You fine-tune the pre-trained model using your specific dataset of topic keywords, image labels, and taxonomy keywords. When you place all embeddings in the same feature space and visualize them, you see that BERT logically represents semantic similarity between terms.\n\nThe following example visualizes IAB content taxonomy keywords for the class Automotive represented as vectors using BERT. BERT places Automotive keywords from the taxonomy close to semantically similar terms.\n\n![image.png](https://dev-media.amazoncloud.cn/e894c0b81b734d5abd526f4998052c3e_image.png)\n\n\nThe feature vectors allow CITM to compare the metadata labels and taxonomy keywords in the same feature space. In this feature space, CITM calculates cosine similarity between each feature vector for taxonomy keywords and each feature vector for topic keywords. In a separate step, CITM compares taxonomy feature vectors and feature vectors for image labels. Pairings with cosine scores closest to 1 are identified as semantically similar. Note that a pairing can either be a topic keyword and a taxonomy keyword, or an object label and a taxonomy keyword.\n\nThe following screenshot shows example pairings of topic keywords and taxonomy keywords using cosine similarity calculated with BERT embeddings.\n\n![image.png](https://dev-media.amazoncloud.cn/d22f26888e8544c0ac69cf7a0041a4bc_image.png)\n\n\nTo map content to taxonomy keywords, CITM selects keywords from pairings with cosine scores that meet a user-defined threshold. These are the keywords that will be used on real-time bidding platforms to select ads for the webpage’s inventory. The result is a rich mapping of online content to the taxonomy.\n\n### **Optionally store content to taxonomy mapping in a metadata store**\n\nAfter you identify contextually similar taxonomy terms from CITM, you need a way for low-latency APIs to access this information. In programmatic bidding for advertisements, low response time and high concurrency play an important role in monetizing the content. The schema for the data store needs to be flexible to accommodate additional metadata when needed to enrich bid requests. [Amazon DynamoDB](https://aws.amazon.com/dynamodb/) can match the data access patterns and operational requirements for such a service.\n\n### **Conclusion**\n\nIn this post, you learned how to build a taxonomy-based contextual targeting solution using Contextual Intelligence Taxonomy Mapper (CITM). You learned how to use Amazon Comprehend and Amazon Rekognition to extract granular metadata from your media assets. Then, using CITM you mapped the assets to an industry standard taxonomy to facilitate programmatic ad bidding for contextually related ads. You can apply this framework to other use cases that require use of a standard taxonomy to enhance the value of existing media assets.\n\nTo experiment with CITM, you can access its [code repository](https://github.com/aws-samples/contextual-ad-intelligence-on-aws/tree/main/contextual-intelligence-taxonomy-mapper) and use it with a text and image dataset of your choice.\n\nWe recommend learning more about the solution components introduced in this post. Discover more about [AWS Media Intelligence](https://aws.amazon.com/machine-learning/ml-use-cases/media-intelligence/) to extract metadata from media content. Also, learn more about how to use [Hugging Face models for NLP using Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/hugging-face.html).\n\n#### **About the Authors**\n\n![image.png](https://dev-media.amazoncloud.cn/05b301010d2c4279ac90c4119f11c648_image.png)\n\n**Aramide Kehinde** is a Sr. Partner Solution Architect at AWS in Machine Learning and AI. Her career journey has spanned the areas of Business Intelligence and Advanced Analytics across multiple industries. She works to enable partners to build solutions with AWS AI/ML services that serve customers needs for innovation. She also enjoys building the intersection of AI and creative arenas and spending time with her family.\n\n![image.png](https://dev-media.amazoncloud.cn/38fd2f4f873245d2804a12bea0dceb45_image.png)\n\n**Anuj Gupta** is a Principal Solutions Architect working with hyper-growth companies on their cloud native journey. He is passionate about using technology to solve challenging problems and has worked with customers to build highly distributed and low latency applications. He contributes to open-source Serverless and Machine Learning solutions. Outside of work, he loves traveling with his family and writing poems and philosophical blogs.","render":"<p>As new data privacy regulations like GDPR (General Data Protection Regulation, 2017) have come into effect, customers are under increased pressure to monetize media assets while abiding by the new rules. Monetizing media while respecting privacy regulations requires the ability to automatically extract granular metadata from assets like text, images, video, and audio files at internet scale. It also requires a scalable way to map media assets to industry taxonomies that facilitate discovery and monetization of content. This use case is particularly significant for the advertising industry as data privacy rules cause a shift from behavioral targeting using third-party cookies.</p>\n<p><a href=\"https://cookie-script.com/all-you-need-to-know-about-third-party-cookies.html\" target=\"_blank\">Third-party cookies</a> help enable personalized ads for web users, and allow advertisers to reach their intended audience. A traditional solution to serve ads without third-party cookies is contextual advertising, which places ads on webpages based on the content published on the pages. However, contextual advertising poses the challenge of extracting context from media assets at scale, and likewise using that context to monetize the assets.</p>\n<p>In this post, we discuss how you can build a machine learning (ML) solution that we call Contextual Intelligence Taxonomy Mapper (CITM) to extract context from digital content and map it to standard taxonomies in order to generate value. Although we apply this solution to contextual advertising, you can use it to solve other use cases. For example, education technology companies can use it to map their content to industry taxonomies in order to facilitate adaptive learning that delivers personalized learning experiences based on students’ individual needs.</p>\n<h3><a id=\"Solution_overview_6\"></a><strong>Solution overview</strong></h3>\n<p>The solution comprises two components: <a href=\"https://aws.amazon.com/machine-learning/ml-use-cases/media-intelligence/\" target=\"_blank\">AWS Media Intelligence</a> (AWS MI) capabilities for context extraction from content on web pages, and CITM for intelligent mapping of content to an industry taxonomy. You can access the solution’s <a href=\"https://github.com/aws-samples/contextual-ad-intelligence-on-aws/tree/main/contextual-intelligence-taxonomy-mapper\" target=\"_blank\">code repository</a> for a detailed view of how we implement its components.</p>\n<h4><a id=\"AWS_Media_Intelligence_10\"></a><strong>AWS Media Intelligence</strong></h4>\n<p>AWS MI capabilities enable automatic extraction of metadata that provides contextual understanding of a webpage’s content. You can combine ML techniques like computer vision, speech to text, and natural language processing (NLP) to automatically generate metadata from text, videos, images, and audio files for use in downstream processing. Managed AI services such as <a href=\"https://aws.amazon.com/rekognition/\" target=\"_blank\">Amazon Rekognition</a>, <a href=\"https://aws.amazon.com/transcribe/\" target=\"_blank\">Amazon Transcribe</a>, <a href=\"https://aws.amazon.com/comprehend/\" target=\"_blank\">Amazon Comprehend</a>, and <a href=\"https://aws.amazon.com/textract/\" target=\"_blank\">Amazon Textract</a> make these ML techniques accessible using API calls. This eliminates the overhead needed to train and build ML models from scratch. In this post, you see how using Amazon Comprehend and Amazon Rekognition for media intelligence enables metadata extraction at scale.</p>\n<h4><a id=\"Contextual_Intelligence_Taxonomy_Mapper_14\"></a><strong>Contextual Intelligence Taxonomy Mapper</strong></h4>\n<p>After you extract metadata from media content, you need a way to map that metadata to an industry taxonomy in order to facilitate contextual targeting. To do this, you build Contextual Intelligence Taxonomy Mapper (CITM), which is powered by a BERT sentence transformer from Hugging Face.</p>\n<p>The BERT sentence transformer enables CITM to categorize web content with contextually related keywords. For example, it can categorize a web article about healthy living with keywords from the industry taxonomy, such as “Healthy Cooking and Eating,” “Running and Jogging,” and more, based on the text written and the images used within the article. CITM also provides the ability to choose the mapped taxonomy terms to use for your ad bidding process based on your criteria.</p>\n<p>The following diagram illustrates the conceptual view of the architecture with CITM.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/a5464b60d117432681b0aabc58018a77_image.png\" alt=\"image.png\" /></p>\n<h4><a id=\"The_IAB_Interactive_Advertising_Bureau_Content_Taxonomy_24\"></a><strong>The IAB (Interactive Advertising Bureau) Content Taxonomy</strong></h4>\n<p>For this post, we use the <a href=\"https://iabtechlab.com/standards/content-taxonomy/\" target=\"_blank\">IAB Tech Lab’s Content Taxonomy</a> as the industry standard taxonomy for the contextual advertising use case. By design, the IAB taxonomy helps content creators more accurately describe their content, and it provides a common language for all parties in the programmatic advertising process. The use of a common terminology is crucial because the selection of ads for a webpage a user visits has to happen within milliseconds. The IAB taxonomy serves as a standardized way to categorize content from various sources while also being an industry protocol that real-time bidding platforms use for ad selection. It has a hierarchical structure, which provides granularity of taxonomy terms and enhanced context for advertisers.</p>\n<h4><a id=\"Solution_workflow_28\"></a><strong>Solution workflow</strong></h4>\n<p>The following diagram illustrates the solution workflow.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/70f9c3cfbdd24c86a68ebca64ba6e0be_image.png\" alt=\"image.png\" /></p>\n<p>The steps are as follows:</p>\n<ol>\n<li><a href=\"http://aws.amazon.com/s3\" target=\"_blank\">Amazon Simple Storage Service</a> (Amazon S3) stores the IAB content taxonomy and extracted web content.</li>\n<li>Amazon Comprehend performs topic modeling to extract common themes from the collection of articles.</li>\n<li>The Amazon Rekognition <a href=\"https://docs.aws.amazon.com/rekognition/latest/dg/labels.html\" target=\"_blank\">object label API</a> detects labels in images.</li>\n<li>CITM maps content to a standard taxonomy.</li>\n<li>Optionally, you can store content to taxonomy mapping in a metadata store.</li>\n</ol>\n<p>In the following sections, we walk through each step in detail.</p>\n<h3><a id=\"Amazon_S3_stores_the_IAB_content_taxonomy_and_extracted_web_content_45\"></a><strong>Amazon S3 stores the IAB content taxonomy and extracted web content</strong></h3>\n<p>We store extracted text and images from a collection of web articles in an S3 bucket. We also store the IAB content taxonomy. As a first step, we concatenate different tiers on the taxonomy to create combined taxonomy terms. This approach helps maintain the taxonomy’s hierarchical structure when the BERT sentence transformer creates embeddings for each keyword. See the following code:</p>\n<pre><code class=\"lang-\">def prepare_taxonomy(taxonomy_df):\n \n &quot;&quot;&quot;\n Concatenate IAB Tech Lab content taxonomy tiers and prepare keywords for BERT embedding. \n Use this function as-is if using the IAB Content Taxonomy\n \n Parameters (input):\n ----------\n taxonomy_df : Content taxonomy dataframe\n\n Returns (output):\n -------\n df_clean : Content taxonomy with tiers in the taxonomy concatenated\n keyword_list: List of concatenated content taxonomy keywords\n ids: List of ids for the content taxonomy keywords\n &quot;&quot;&quot;\n \n df = taxonomy_df[['Unique ID ','Parent','Name','Tier 1','Tier 2','Tier 3']] \n df_str = df.astype({&quot;Unique ID &quot;: 'str', &quot;Parent&quot;: 'str', &quot;Tier 1&quot;: 'str', &quot;Tier 2&quot;: 'str', &quot;Tier 3&quot;: 'str'})\n df_clean = df_str.replace('nan','')\n \n #create a column that concatenates all tiers for each taxonomy keyword\n df_clean['combined']=df_clean[df_clean.columns[2:6]].apply(lambda x: ' '.join(x.dropna().astype(str)),axis=1)\n \n #turn taxonomy keyords to list of strings a prep for encoding with BERT sentence transformer\n keyword_list=df_clean['combined'].to_list()\n \n #get list of taxonomy ids\n ids = df_clean['Unique ID '].to_list() \n \n return df_clean, keyword_list, ids\n\ntaxonomy_df, taxonomy_terms, taxonomy_ids = prepare_taxonomy(read_taxonomy)\n\n</code></pre>\n<p>The following diagram illustrates the IAB context taxonomy with combined tiers.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/ad78d2402dab45a092add935541a7400_image.png\" alt=\"image.png\" /></p>\n<h3><a id=\"Amazon_Comprehend_performs_topic_modeling_to_extract_common_themes_from_the_collection_of_articles_91\"></a><strong>Amazon Comprehend performs topic modeling to extract common themes from the collection of articles</strong></h3>\n<p>With the Amazon Comprehend topic modeling API, you analyze all the article texts using the Latent Dirichlet Allocation (LDA) model. The model examines each article in the corpus and groups keywords into the same topic based on the context and frequency in which they appear across the entire collection of articles. To ensure the LDA model detects highly coherent topics, you perform a preprocessing step prior to calling the Amazon Comprehend API. You can use the <a href=\"https://radimrehurek.com/gensim/\" target=\"_blank\">gensim library’s</a> CoherenceModel to determine the optimal number of topics to detect from the collection of articles or text files. See the following code:</p>\n<pre><code class=\"lang-\">def compute_coherence_scores(dictionary, corpus, texts, limit, start=2, step=3):\n &quot;&quot;&quot;\n Compute coherence scores for various number of topics for your topic model. \n Adjust the parameters below based on your data\n\n Parameters (input):\n ----------\n dictionary : Gensim dictionary created earlier from input texts\n corpus : Gensim corpus created earlier from input texts\n texts : List of input texts\n limit : The maximum number of topics to test. Amazon Comprehend can detect up to 100 topics in a collection\n\n Returns (output):\n -------\n models : List of LDA topic models\n coherence_scores : Coherence values corresponding to the LDA model with respective number of topics\n &quot;&quot;&quot;\n coherence_scores = []\n models = []\n for num_topics in range(start, limit, step):\n model = gensim.models.LdaMulticore(corpus=corpus, num_topics=num_topics, id2word=id2word)\n models.append(model)\n coherencemodel = CoherenceModel(model=model, texts=corpus_words, dictionary=id2word, coherence='c_v')\n coherence_scores.append(coherencemodel.get_coherence())\n\n return models, coherence_scores\n\nmodels, coherence_scores = compute_coherence_scores(dictionary=id2word, corpus=corpus_tdf, texts=corpus_words, start=2, limit=100, step=3)\n</code></pre>\n<p>After you get the optimal number of topics, you use that value for the Amazon Comprehend topic modeling job. Providing different values for the NumberOfTopics parameter in the Amazon Comprehend <a href=\"https://docs.aws.amazon.com/comprehend/latest/dg/API_StartTopicsDetectionJob.html#API_StartTopicsDetectionJob_RequestSyntax\" target=\"_blank\">StartTopicsDetectionJob operation</a> results in a variation in the distribution of keywords placed in each topic group. An optimized value for the NumberOfTopics parameter represents the number of topics that provide the most coherent grouping of keywords with higher contextual relevance. You can store the topic modeling output from Amazon Comprehend in its raw format in Amazon S3.</p>\n<h3><a id=\"The_Amazon_Rekognition_object_label_API_detects_labels_in_images_128\"></a><strong>The Amazon Rekognition object label API detects labels in images</strong></h3>\n<p>You analyze each image extracted from all webpages using the <a href=\"https://docs.aws.amazon.com/rekognition/latest/dg/labels-detect-labels-image.html\" target=\"_blank\">Amazon Rekognition DetectLabels operation</a>. For each image, the operation provides a JSON response with all labels detected within the image, coupled with a confidence score for each. For our use case, we arbitrarily select a confidence score of 60% or higher as the threshold for object labels to use in the next step. You store object labels in their raw format in Amazon S3. See the following code:</p>\n<pre><code class=\"lang-\">&quot;&quot;&quot;\nCreate a function to extract object labels from a given image using Amazon Rekognition\n&quot;&quot;&quot;\n\ndef get_image_labels(image_loc):\n labels = []\n with fs.open(image_loc, &quot;rb&quot;) as im:\n response = rekognition_client.detect_labels(Image={&quot;Bytes&quot;: im.read()})\n \n for label in response[&quot;Labels&quot;]:\n if label[&quot;Confidence&quot;] &gt;= 60: #change to desired confidence score threshold, value between [0,100]:\n object_label = label[&quot;Name&quot;]\n labels.append(object_label)\n return labels\n\n</code></pre>\n<h3><a id=\"CITM_maps_content_to_a_standard_taxonomy_149\"></a><strong>CITM maps content to a standard taxonomy</strong></h3>\n<p>CITM compares extracted content metadata (topics from text and labels from images) with keywords on the IAB taxonomy, and then maps the content metadata to keywords from the taxonomy that are semantically related. For this task, CITM completes the following three steps:</p>\n<ol>\n<li>Generate neural embeddings for the content taxonomy, topic keywords, and image labels using Hugging Face’s BERT sentence transformer. We access the sentence transformer model from <a href=\"https://aws.amazon.com/sagemaker/\" target=\"_blank\">Amazon SageMaker</a>. In this post, we use the <a href=\"https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2\" target=\"_blank\">paraphrase-MiniLM-L6-v2</a> model, which maps keywords and labels to a 384 dimensional dense vector space.</li>\n<li>Compute the cosine similarity score between taxonomy keywords and topic keywords using their embeddings. It also computes cosine similarity between the taxonomy keywords and the image object labels. We use cosine similarity as a scoring mechanism to find semantically similar matches between the content metadata and the taxonomy. See the following code:</li>\n</ol>\n<pre><code class=\"lang-\">def compute_similarity(entity_embeddings, entity_terms, taxonomy_embeddings, taxonomy_terms):\n &quot;&quot;&quot;\n Compute cosine scores between entity embeddings and taxonomy embeddings\n \n Parameters (input):\n ----------\n entity_embeddings : Embeddings for either topic keywords from Amazon Comprehend or image labels from Amazon Rekognition\n entity_terms : Terms for topic keywords or image labels\n taxonomy_embeddings : Embeddings for the content taxonomy\n taxonomy_terms : Terms for the taxonomy keywords\n\n Returns (output):\n -------\n mapping_df : Dataframe that matches each entity keyword to each taxonomy keyword and their cosine similarity score\n &quot;&quot;&quot;\n \n #calculate cosine score, pairing each entity embedding with each taxonomy keyword embedding\n cosine_scores = util.pytorch_cos_sim(entity_embeddings, taxonomy_embeddings)\n pairs = []\n for i in range(len(cosine_scores)-1):\n for j in range(0, cosine_scores.shape[1]):\n pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})\n \n #Sort cosine similarity scores in decreasing order\n pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)\n rows = []\n for pair in pairs:\n i, j = pair['index']\n rows.append([entity_terms[i], taxonomy_terms[j], pair['score']])\n \n #move sorted values to a dataframe\n mapping_df= pd.DataFrame(rows, columns=[&quot;term&quot;, &quot;taxonomy_keyword&quot;,&quot;cosine_similarity&quot;])\n mapping_df['cosine_similarity'] = mapping_df['cosine_similarity'].astype('float')\n mapping_df= mapping_df.sort_values(by=['term','cosine_similarity'], ascending=False)\n drop_dups= mapping_df.drop_duplicates(subset=['term'], keep='first')\n mapping_df = drop_dups.sort_values(by=['cosine_similarity'], ascending=False).reset_index(drop=True)\n return mapping_df\n \n#compute cosine_similairty score between topic keywords and content taxonomy keywords using BERT embeddings \ntext_taxonomy_mapping=compute_similarity(keyword_embeddings, topic_keywords, taxonomy_embeddings, taxonomy_terms)\n\n</code></pre>\n<ol start=\"3\">\n<li>Identify pairings with similarity scores that are above a user-defined threshold and use them to map the content to semantically related keywords on the content taxonomy. In our test, we select all keywords from pairings that have a cosine similarity score of 0.5 or higher. See the following code:</li>\n</ol>\n<pre><code class=\"lang-\">#merge text and image keywords mapped to content taxonomy\nrtb_keywords=pd.concat([text_taxonomy_mapping[[&quot;term&quot;,&quot;taxonomy_keyword&quot;,&quot;cosine_similarity&quot;]],image_taxonomy_mapping]).sort_values(by='cosine_similarity',ascending=False).reset_index(drop=True)\n\n#select keywords with a cosine_similarity score greater than your desired threshold ( the value should be from 0 to 1)\nrtb_keywords[rtb_keywords[&quot;cosine_similarity&quot;]&gt; 50] # change to desired threshold for cosine score, value between [0,100]:\n\n</code></pre>\n<p>A common challenge when working with internet-scale language representation (such as in this use case) is that you need a model that can fit most of the content—in this case, words in the English language. Hugging Face’s BERT transformer has been pre-trained using a large corpus of Wikipedia posts in the English language to represent the semantic meaning of words in relation to one another. You fine-tune the pre-trained model using your specific dataset of topic keywords, image labels, and taxonomy keywords. When you place all embeddings in the same feature space and visualize them, you see that BERT logically represents semantic similarity between terms.</p>\n<p>The following example visualizes IAB content taxonomy keywords for the class Automotive represented as vectors using BERT. BERT places Automotive keywords from the taxonomy close to semantically similar terms.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/e894c0b81b734d5abd526f4998052c3e_image.png\" alt=\"image.png\" /></p>\n<p>The feature vectors allow CITM to compare the metadata labels and taxonomy keywords in the same feature space. In this feature space, CITM calculates cosine similarity between each feature vector for taxonomy keywords and each feature vector for topic keywords. In a separate step, CITM compares taxonomy feature vectors and feature vectors for image labels. Pairings with cosine scores closest to 1 are identified as semantically similar. Note that a pairing can either be a topic keyword and a taxonomy keyword, or an object label and a taxonomy keyword.</p>\n<p>The following screenshot shows example pairings of topic keywords and taxonomy keywords using cosine similarity calculated with BERT embeddings.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/d22f26888e8544c0ac69cf7a0041a4bc_image.png\" alt=\"image.png\" /></p>\n<p>To map content to taxonomy keywords, CITM selects keywords from pairings with cosine scores that meet a user-defined threshold. These are the keywords that will be used on real-time bidding platforms to select ads for the webpage’s inventory. The result is a rich mapping of online content to the taxonomy.</p>\n<h3><a id=\"Optionally_store_content_to_taxonomy_mapping_in_a_metadata_store_228\"></a><strong>Optionally store content to taxonomy mapping in a metadata store</strong></h3>\n<p>After you identify contextually similar taxonomy terms from CITM, you need a way for low-latency APIs to access this information. In programmatic bidding for advertisements, low response time and high concurrency play an important role in monetizing the content. The schema for the data store needs to be flexible to accommodate additional metadata when needed to enrich bid requests. <a href=\"https://aws.amazon.com/dynamodb/\" target=\"_blank\">Amazon DynamoDB</a> can match the data access patterns and operational requirements for such a service.</p>\n<h3><a id=\"Conclusion_232\"></a><strong>Conclusion</strong></h3>\n<p>In this post, you learned how to build a taxonomy-based contextual targeting solution using Contextual Intelligence Taxonomy Mapper (CITM). You learned how to use Amazon Comprehend and Amazon Rekognition to extract granular metadata from your media assets. Then, using CITM you mapped the assets to an industry standard taxonomy to facilitate programmatic ad bidding for contextually related ads. You can apply this framework to other use cases that require use of a standard taxonomy to enhance the value of existing media assets.</p>\n<p>To experiment with CITM, you can access its <a href=\"https://github.com/aws-samples/contextual-ad-intelligence-on-aws/tree/main/contextual-intelligence-taxonomy-mapper\" target=\"_blank\">code repository</a> and use it with a text and image dataset of your choice.</p>\n<p>We recommend learning more about the solution components introduced in this post. Discover more about <a href=\"https://aws.amazon.com/machine-learning/ml-use-cases/media-intelligence/\" target=\"_blank\">AWS Media Intelligence</a> to extract metadata from media content. Also, learn more about how to use <a href=\"https://docs.aws.amazon.com/sagemaker/latest/dg/hugging-face.html\" target=\"_blank\">Hugging Face models for NLP using Amazon SageMaker</a>.</p>\n<h4><a id=\"About_the_Authors_240\"></a><strong>About the Authors</strong></h4>\n<p><img src=\"https://dev-media.amazoncloud.cn/05b301010d2c4279ac90c4119f11c648_image.png\" alt=\"image.png\" /></p>\n<p><strong>Aramide Kehinde</strong> is a Sr. Partner Solution Architect at AWS in Machine Learning and AI. Her career journey has spanned the areas of Business Intelligence and Advanced Analytics across multiple industries. She works to enable partners to build solutions with AWS AI/ML services that serve customers needs for innovation. She also enjoys building the intersection of AI and creative arenas and spending time with her family.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/38fd2f4f873245d2804a12bea0dceb45_image.png\" alt=\"image.png\" /></p>\n<p><strong>Anuj Gupta</strong> is a Principal Solutions Architect working with hyper-growth companies on their cloud native journey. He is passionate about using technology to solve challenging problems and has worked with customers to build highly distributed and low latency applications. He contributes to open-source Serverless and Machine Learning solutions. Outside of work, he loves traveling with his family and writing poems and philosophical blogs.</p>\n"}
目录
亚马逊云科技解决方案 基于行业客户应用场景及技术领域的解决方案
联系亚马逊云科技专家
亚马逊云科技解决方案
基于行业客户应用场景及技术领域的解决方案
联系专家
0
目录
关闭