Trigger an Amazon Glue DataBrew job based on an event generated from another DataBrew job

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"Organizations today have continuous incoming data, and analyzing this data in a timely fashion is becoming a common requirement for data analytics and machine learning (ML) use cases. As part of this, you need clean data in order to gain insights that can enable enterprises to get the most out of their data for business growth and profitability. You can now use [AWS Glue DataBrew](https://aws.amazon.com/glue/features/databrew/), a visual data preparation tool that makes it easy to transform and prepare datasets for analytics and ML workloads.\n\nAs we build these data analytics pipelines, we can decouple the jobs by building event-driven analytics and ML workflow pipelines. In this post, we walk through how to trigger a DataBrew job automatically on an event generated from another DataBrew job using [Amazon EventBridge](https://aws.amazon.com/eventbridge/) and [AWS Step Functions](https://aws.amazon.com/step-functions).\n\n#### **Overview of solution**\nThe following diagram illustrates the architecture of the solution. We use [AWS CloudFormation](\nhttp://aws.amazon.com/cloudformation) to deploy an EventBridge rule, an [Amazon Simple Queue Service](https://aws.amazon.com/sqs/) (Amazon SQS) queue, and Step Functions resources to trigger the second DataBrew job.\n\n![image.png](https://dev-media.amazoncloud.cn/17e10474ef8f434fabe498bab7b5f70d_image.png)\n\nThe steps in this solution are as follows:\n\n1. Import your dataset to [Amazon Simple Storage Service](http://aws.amazon.com/s3) (Amazon S3).\n2. DataBrew queries the data from Amazon S3 by creating a recipe and performing transformations.\n3. The first DataBrew recipe job writes the output to an S3 bucket.\n4. When the first recipe job is complete, it triggers an EventBridge event.\n5. A Step Functions state machine is invoked based on the event, which in turn invokes the second DataBrew recipe job for further processing.\n6. The event is delivered to the [dead-letter queue](https://docs.aws.amazon.com/eventbridge/latest/userguide/rule-dlq.html) if the rule in EventBridge can’t invoke the state machine successfully.\n7. DataBrew queries data from an S3 bucket by creating a recipe and performing transformations.\n8. The second DataBrew recipe job writes the output to the same S3 bucket.\n\n#### **Prerequisites**\n\nTo use this solution, you need the following prerequisites:\n\n- An [AWS account](https://signin.aws.amazon.com/signin?redirect_uri=https%3A%2F%2Fportal.aws.amazon.com%2Fbilling%2Fsignup%2Fresume&client_id=signup)\n- [AWS Identity and Access Management](http://aws.amazon.com/iam) (IAM) permissions for DataBrew (for more information, see [Setting up IAM policies for DataBrew](https://docs.aws.amazon.com/databrew/latest/dg/setting-up-iam-policies-for-databrew.html))\n- An S3 bucket (to store data)\n\n#### **Load the dataset into Amazon S3**\nFor this post, we use the [Credit Card customers](https://www.kaggle.com/sakshigoyal7/credit-card-customers) sample dataset from Kaggle. This data consists of 10,000 customers, including their age, salary, marital status, credit card limit, credit card category, and more. Download the sample dataset and follow the instructions. We recommend creating all your resources in the same account and Region.\n\n#### **Create a DataBrew project**\nTo create a DataBrew project, complete the following steps:\n\n1. On the DataBrew console, choose **Projects** and choose **Create project**.\n2. For **Project name**, enter ```marketing-campaign-project-1```.\n3. For **Select a dataset**, select **New dataset**.\n\n![image.png](https://dev-media.amazoncloud.cn/9db6f0251dcb46f0a6d7526e81163573_image.png)\n\n4. Under **Data lake/data store**, choose **Amazon S3**.\n5. For **Enter your source from S3**, enter the S3 path of the sample dataset.\n6. Select the dataset CSV file.\n\n![image.png](https://dev-media.amazoncloud.cn/d489a661d390405e99b864ec7691761d_image.png)\n\n7. Under **Permissions**, for **Role name**, choose an existing IAM role created during the prerequisites or create a new role.\n8. For **New IAM role suffix**, enter a suffix.\n\n![image.png](https://dev-media.amazoncloud.cn/5725a00f24794409827a4ac71d0801b8_image.png)\n\n9. Choose **Create project**.\n\nAfter the project is opened, a DataBrew interactive session is created. DataBrew retrieves sample data based on your sampling configuration selection.\n\n#### **Create the DataBrew jobs**\nNow we can create the recipe jobs.\n1. On the DataBrew console, in the navigation pane, choose **Projects**.\n2. On the **Projects** page, select the project ```marketing-campaign-project-1```.\n3. Choose **Open project** and choose **Add step**.\n\n![image.png](https://dev-media.amazoncloud.cn/1d507fa357724b139c8bbaad32679024_image.png)\n\n4. In this step, we choose **Delete** to drop the unnecessary columns from our dataset that aren’t required for this exercise.\n\nYou can choose from over 250 built-in functions to merge, pivot, and transpose the data without writing code.\n\n![image.png](https://dev-media.amazoncloud.cn/cac927c969d34e68a2f441cefbc259b1_image.png)\n\n5.Select the columns to delete and choose **Apply**.\n\n![image.png](https://dev-media.amazoncloud.cn/a984803075b743518cc3f9b22d2d51e1_image.png)\n\n6. Choose **Create job**.\n\n![image.png](https://dev-media.amazoncloud.cn/c764ef9faaa247f09083a9eb983d39b3_image.png)\n\n7. For **Job name**, enter ```marketing-campaign-job1```.\n\n![image.png](https://dev-media.amazoncloud.cn/29343f2e925b4a94896dcb9502f84da2_image.png)\n\n8. Under **Job output settings**¸ for **File type**, choose your final storage format (for this post, we choose **CSV**).\n9. For **S3 location**, enter your final S3 output bucket path.\n\n![image.png](https://dev-media.amazoncloud.cn/27498f68aa3a47d8a799cba43721f1d5_image.png)\n\n10. Under **Settings**, for **File output storage**, select **Replace output files for each job run**.\n11. Choose **Save**.\n\n![image.png](https://dev-media.amazoncloud.cn/033264a70b174ba4bcd79ff15432379c_image.png)\n\n12. Under **Permissions**, for **Role name**¸ choose an existing role created during the prerequisites or create a new role.\n13. Choose **Create job**.\n\n![image.png](https://dev-media.amazoncloud.cn/b099f161570a48208d853387ae21a624_image.png)\n\nNow we repeat the same steps to create another DataBrew project and DataBrew job.\n\n15. For this post, I named the second project ```marketing-campaign-project2``` and named the job ```marketing-campaign-job2```.\n16. When you create the new project, this time use the job1 output file location as the new dataset.\n17. For this job, we deselect **Unknown** and** Uneducated** in the **Education_Level** column.\n\n![image.png](https://dev-media.amazoncloud.cn/2f92e53ec55843a2ad43065c2f810815_image.png)\n\n#### **Deploy your resources using CloudFormation**\n\nFor a quick start of this solution, we deploy the resources with a CloudFormation stack. The stack creates the EventBridge rule, SQS queue, and Step Functions state machine in your account to trigger the second DataBrew job when the first job runs successfully.\nChoose Launch Stack:\n\n1. Choose **Launch Stack**:\n[![image.png](https://dev-media.amazoncloud.cn/a889e5ed79a449098a6ee00e6a67f7ae_image.png)](https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/BDB-1795/DataBrew+Jobs+event+based+trigger.yaml)\n2. For **DataBrew source job name**, enter marketing-campaign-job1.\n3. For DataBrew target job name, enter ```marketing-campaign-job2```.\n4. For both IAM role configurations, make the following choice:\n\t- If you choose **Create a new Role**, the stack automatically creates a role for you.\n\t- If you choose **Attach an existing IAM role**, you must populate the IAM role ARN manually in the following field or else the stack creation fails.\n5. Choose **Next**\n\n![image.png](https://dev-media.amazoncloud.cn/cf0146f806d44e08a8dcfd85abd5ef18_image.png)\n\n6. Select the two acknowledgement check boxes.\n7. Choose **Create stack**.\n\n![image.png](https://dev-media.amazoncloud.cn/de6c67d2ac5b4da2982d3a5596f6c9c8_image.png)\n\n#### **Test the solution**\nTo test the solution, complete the following steps:\n\n1. On the DataBrew console, choose **Jobs**.\n2. Select the job ```marketing-campaign-job1``` and choose **Run job**.\n\nThis action automatically triggers the second job, ```marketing-campaign-job2```, via EventBridge and Step Functions.\n\n![image.png](https://dev-media.amazoncloud.cn/55cbe882e0d3464d98fed156fda6c2da_image.png)\n\n3. When both jobs are complete, open the output link for ```marketing-campaign-job2```.\n\n![image.png](https://dev-media.amazoncloud.cn/16cf4152d47d41f2892c986717b1d1d0_image.png)\n\nYou’re redirected to the Amazon S3 console to access the output file.\n\n![image.png](https://dev-media.amazoncloud.cn/513a56002239403c9a1b0bdf8a7e8806_image.png)\n\n\nIn this solution, we created a workflow that required minimal code. The first job triggers the second job, and both jobs deliver the transformed data files to Amazon S3.\n\n#### **Clean up**\nTo avoid incurring future charges, delete all the resources created during this walkthrough:\n\n- IAM roles\n- DataBrew projects and their associated recipe jobs\n- S3 bucket\n- CloudFormation stack\n\n#### **Conclusion**\nIn this post, we walked through how to use DataBrew along with EventBridge and Step Functions to run a DataBrew job that automatically triggers another DataBrew job. We encourage you to use this pattern for event-driven pipelines where you can build sequence jobs to run multiple jobs in conjunction with other jobs.\n\n##### **About the Authors**\n\n![image.png](https://dev-media.amazoncloud.cn/dabff0980cce412eb043dd696ef63f33_image.png)\n\n**Nipun Chagari** is a Senior Solutions Architect at AWS, where he helps customers build highly available, scalable, and resilient applications on the AWS Cloud. He is passionate about helping customers adopt serverless technology to meet their business objectives.\n\n![image.png](https://dev-media.amazoncloud.cn/d08f88de1e2648ceaabd778bc0eb52bc_image.png)\n\n**Prarthana Angadi** is a Software Development Engineer II at AWS, where she has been expanding what is possible with code in order to make life more efficient for AWS customers.","render":"Organizations today have continuous incoming data, and analyzing this data in a timely fashion is becoming a common requirement for data analytics and machine learning (ML) use cases. As part of this, you need clean data in order to gain insights that can enable enterprises to get the most out of their data for business growth and profitability. You can now use <a href=\"https://aws.amazon.com/glue/features/databrew/\" target=\"_blank\">AWS Glue DataBrew</a>, a visual data preparation tool that makes it easy to transform and prepare datasets for analytics and ML workloads.\nAs we build these data analytics pipelines, we can decouple the jobs by building event-driven analytics and ML workflow pipelines. In this post, we walk through how to trigger a DataBrew job automatically on an event generated from another DataBrew job using <a href=\"https://aws.amazon.com/eventbridge/\" target=\"_blank\">Amazon EventBridge</a> and <a href=\"https://aws.amazon.com/step-functions\" target=\"_blank\">AWS Step Functions</a>.\n<h4><a id=\"Overview_of_solution_4\"></a>Overview of solution</h4>\nThe following diagram illustrates the architecture of the solution. We use <a href=\"http://aws.amazon.com/cloudformation\" target=\"_blank\">AWS CloudFormation</a> to deploy an EventBridge rule, an <a href=\"https://aws.amazon.com/sqs/\" target=\"_blank\">Amazon Simple Queue Service</a> (Amazon SQS) queue, and Step Functions resources to trigger the second DataBrew job.\n<img src=\"https://dev-media.amazoncloud.cn/17e10474ef8f434fabe498bab7b5f70d_image.png\" alt=\"image.png\" />\nThe steps in this solution are as follows:\n<ol>\n<li>Import your dataset to <a href=\"http://aws.amazon.com/s3\" target=\"_blank\">Amazon Simple Storage Service</a> (Amazon S3).</li>\n<li>DataBrew queries the data from Amazon S3 by creating a recipe and performing transformations.</li>\n<li>The first DataBrew recipe job writes the output to an S3 bucket.</li>\n<li>When the first recipe job is complete, it triggers an EventBridge event.</li>\n<li>A Step Functions state machine is invoked based on the event, which in turn invokes the second DataBrew recipe job for further processing.</li>\n<li>The event is delivered to the <a href=\"https://docs.aws.amazon.com/eventbridge/latest/userguide/rule-dlq.html\" target=\"_blank\">dead-letter queue</a> if the rule in EventBridge can’t invoke the state machine successfully.</li>\n<li>DataBrew queries data from an S3 bucket by creating a recipe and performing transformations.</li>\n<li>The second DataBrew recipe job writes the output to the same S3 bucket.</li>\n</ol>\n<h4><a id=\"Prerequisites_21\"></a>Prerequisites</h4>\nTo use this solution, you need the following prerequisites:\n<ul>\n<li>An <a href=\"https://signin.aws.amazon.com/signin?redirect_uri=https%3A%2F%2Fportal.aws.amazon.com%2Fbilling%2Fsignup%2Fresume&client_id=signup\" target=\"_blank\">AWS account</a></li>\n<li><a href=\"http://aws.amazon.com/iam\" target=\"_blank\">AWS Identity and Access Management</a> (IAM) permissions for DataBrew (for more information, see <a href=\"https://docs.aws.amazon.com/databrew/latest/dg/setting-up-iam-policies-for-databrew.html\" target=\"_blank\">Setting up IAM policies for DataBrew</a>)</li>\n<li>An S3 bucket (to store data)</li>\n</ul>\n<h4><a id=\"Load_the_dataset_into_Amazon_S3_29\"></a>Load the dataset into Amazon S3</h4>\nFor this post, we use the <a href=\"https://www.kaggle.com/sakshigoyal7/credit-card-customers\" target=\"_blank\">Credit Card customers</a> sample dataset from Kaggle. This data consists of 10,000 customers, including their age, salary, marital status, credit card limit, credit card category, and more. Download the sample dataset and follow the instructions. We recommend creating all your resources in the same account and Region.\n<h4><a id=\"Create_a_DataBrew_project_32\"></a>Create a DataBrew project</h4>\nTo create a DataBrew project, complete the following steps:\n<ol>\n<li>On the DataBrew console, choose Projects and choose Create project.</li>\n<li>For Project name, enter <code>marketing-campaign-project-1</code>.</li>\n<li>For Select a dataset, select New dataset.</li>\n</ol>\n<img src=\"https://dev-media.amazoncloud.cn/9db6f0251dcb46f0a6d7526e81163573_image.png\" alt=\"image.png\" />\n<ol start=\"4\">\n<li>Under Data lake/data store, choose Amazon S3.</li>\n<li>For Enter your source from S3, enter the S3 path of the sample dataset.</li>\n<li>Select the dataset CSV file.</li>\n</ol>\n<img src=\"https://dev-media.amazoncloud.cn/d489a661d390405e99b864ec7691761d_image.png\" alt=\"image.png\" />\n<ol start=\"7\">\n<li>Under Permissions, for Role name, choose an existing IAM role created during the prerequisites or create a new role.</li>\n<li>For New IAM role suffix, enter a suffix.</li>\n</ol>\n<img src=\"https://dev-media.amazoncloud.cn/5725a00f24794409827a4ac71d0801b8_image.png\" alt=\"image.png\" />\n<ol start=\"9\">\n<li>Choose Create project.</li>\n</ol>\nAfter the project is opened, a DataBrew interactive session is created. DataBrew retrieves sample data based on your sampling configuration selection.\n<h4><a id=\"Create_the_DataBrew_jobs_56\"></a>Create the DataBrew jobs</h4>\nNow we can create the recipe jobs.\n<ol>\n<li>On the DataBrew console, in the navigation pane, choose Projects.</li>\n<li>On the Projects page, select the project <code>marketing-campaign-project-1</code>.</li>\n<li>Choose Open project and choose Add step.</li>\n</ol>\n<img src=\"https://dev-media.amazoncloud.cn/1d507fa357724b139c8bbaad32679024_image.png\" alt=\"image.png\" />\n<ol start=\"4\">\n<li>In this step, we choose Delete to drop the unnecessary columns from our dataset that aren’t required for this exercise.</li>\n</ol>\nYou can choose from over 250 built-in functions to merge, pivot, and transpose the data without writing code.\n<img src=\"https://dev-media.amazoncloud.cn/cac927c969d34e68a2f441cefbc259b1_image.png\" alt=\"image.png\" />\n5.Select the columns to delete and choose Apply.\n<img src=\"https://dev-media.amazoncloud.cn/a984803075b743518cc3f9b22d2d51e1_image.png\" alt=\"image.png\" />\n<ol start=\"6\">\n<li>Choose Create job.</li>\n</ol>\n<img src=\"https://dev-media.amazoncloud.cn/c764ef9faaa247f09083a9eb983d39b3_image.png\" alt=\"image.png\" />\n<ol start=\"7\">\n<li>For Job name, enter <code>marketing-campaign-job1</code>.</li>\n</ol>\n<img src=\"https://dev-media.amazoncloud.cn/29343f2e925b4a94896dcb9502f84da2_image.png\" alt=\"image.png\" />\n<ol start=\"8\">\n<li>Under Job output settings¸ for File type, choose your final storage format (for this post, we choose CSV).</li>\n<li>For S3 location, enter your final S3 output bucket path.</li>\n</ol>\n<img src=\"https://dev-media.amazoncloud.cn/27498f68aa3a47d8a799cba43721f1d5_image.png\" alt=\"image.png\" />\n<ol start=\"10\">\n<li>Under Settings, for File output storage, select Replace output files for each job run.</li>\n<li>Choose Save.</li>\n</ol>\n<img src=\"https://dev-media.amazoncloud.cn/033264a70b174ba4bcd79ff15432379c_image.png\" alt=\"image.png\" />\n<ol start=\"12\">\n<li>Under Permissions, for Role name¸ choose an existing role created during the prerequisites or create a new role.</li>\n<li>Choose Create job.</li>\n</ol>\n<img src=\"https://dev-media.amazoncloud.cn/b099f161570a48208d853387ae21a624_image.png\" alt=\"image.png\" />\nNow we repeat the same steps to create another DataBrew project and DataBrew job.\n<ol start=\"15\">\n<li>For this post, I named the second project <code>marketing-campaign-project2</code> and named the job <code>marketing-campaign-job2</code>.</li>\n<li>When you create the new project, this time use the job1 output file location as the new dataset.</li>\n<li>For this job, we deselect Unknown and** Uneducated** in the Education_Level column.</li>\n</ol>\n<img src=\"https://dev-media.amazoncloud.cn/2f92e53ec55843a2ad43065c2f810815_image.png\" alt=\"image.png\" />\n<h4><a id=\"Deploy_your_resources_using_CloudFormation_105\"></a>Deploy your resources using CloudFormation</h4>\nFor a quick start of this solution, we deploy the resources with a CloudFormation stack. The stack creates the EventBridge rule, SQS queue, and Step Functions state machine in your account to trigger the second DataBrew job when the first job runs successfully. \nChoose Launch Stack:\n<ol>\n<li>Choose Launch Stack: \n<a href=\"https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/BDB-1795/DataBrew+Jobs+event+based+trigger.yaml\" target=\"_blank\"><img src=\"https://dev-media.amazoncloud.cn/a889e5ed79a449098a6ee00e6a67f7ae_image.png\" alt=\"image.png\" /></a></li>\n<li>For DataBrew source job name, enter marketing-campaign-job1.</li>\n<li>For DataBrew target job name, enter <code>marketing-campaign-job2</code>.</li>\n<li>For both IAM role configurations, make the following choice:\n<ul>\n<li>If you choose Create a new Role, the stack automatically creates a role for you.</li>\n<li>If you choose Attach an existing IAM role, you must populate the IAM role ARN manually in the following field or else the stack creation fails.</li>\n</ul>\n</li>\n<li>Choose Next</li>\n</ol>\n<img src=\"https://dev-media.amazoncloud.cn/cf0146f806d44e08a8dcfd85abd5ef18_image.png\" alt=\"image.png\" />\n<ol start=\"6\">\n<li>Select the two acknowledgement check boxes.</li>\n<li>Choose Create stack.</li>\n</ol>\n<img src=\"https://dev-media.amazoncloud.cn/de6c67d2ac5b4da2982d3a5596f6c9c8_image.png\" alt=\"image.png\" />\n<h4><a id=\"Test_the_solution_126\"></a>Test the solution</h4>\nTo test the solution, complete the following steps:\n<ol>\n<li>On the DataBrew console, choose Jobs.</li>\n<li>Select the job <code>marketing-campaign-job1</code> and choose Run job.</li>\n</ol>\nThis action automatically triggers the second job, <code>marketing-campaign-job2</code>, via EventBridge and Step Functions.\n<img src=\"https://dev-media.amazoncloud.cn/55cbe882e0d3464d98fed156fda6c2da_image.png\" alt=\"image.png\" />\n<ol start=\"3\">\n<li>When both jobs are complete, open the output link for <code>marketing-campaign-job2</code>.</li>\n</ol>\n<img src=\"https://dev-media.amazoncloud.cn/16cf4152d47d41f2892c986717b1d1d0_image.png\" alt=\"image.png\" />\nYou’re redirected to the Amazon S3 console to access the output file.\n<img src=\"https://dev-media.amazoncloud.cn/513a56002239403c9a1b0bdf8a7e8806_image.png\" alt=\"image.png\" />\nIn this solution, we created a workflow that required minimal code. The first job triggers the second job, and both jobs deliver the transformed data files to Amazon S3.\n<h4><a id=\"Clean_up_147\"></a>Clean up</h4>\nTo avoid incurring future charges, delete all the resources created during this walkthrough:\n<ul>\n<li>IAM roles</li>\n<li>DataBrew projects and their associated recipe jobs</li>\n<li>S3 bucket</li>\n<li>CloudFormation stack</li>\n</ul>\n<h4><a id=\"Conclusion_155\"></a>Conclusion</h4>\nIn this post, we walked through how to use DataBrew along with EventBridge and Step Functions to run a DataBrew job that automatically triggers another DataBrew job. We encourage you to use this pattern for event-driven pipelines where you can build sequence jobs to run multiple jobs in conjunction with other jobs.\n<h5><a id=\"About_the_Authors_158\"></a>About the Authors</h5>\n<img src=\"https://dev-media.amazoncloud.cn/dabff0980cce412eb043dd696ef63f33_image.png\" alt=\"image.png\" />\nNipun Chagari is a Senior Solutions Architect at AWS, where he helps customers build highly available, scalable, and resilient applications on the AWS Cloud. He is passionate about helping customers adopt serverless technology to meet their business objectives.\n<img src=\"https://dev-media.amazoncloud.cn/d08f88de1e2648ceaabd778bc0eb52bc_image.png\" alt=\"image.png\" />\nPrarthana Angadi is a Software Development Engineer II at AWS, where she has been expanding what is possible with code in order to make life more efficient for AWS customers.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家