{"value":"Organizations today have continuous incoming data, and analyzing this data in a timely fashion is becoming a common requirement for data analytics and machine learning (ML) use cases. As part of this, you need clean data in order to gain insights that can enable enterprises to get the most out of their data for business growth and profitability. You can now use [AWS Glue DataBrew](https://aws.amazon.com/glue/features/databrew/), a visual data preparation tool that makes it easy to transform and prepare datasets for analytics and ML workloads.\n\nAs we build these data analytics pipelines, we can decouple the jobs by building event-driven analytics and ML workflow pipelines. In this post, we walk through how to trigger a DataBrew job automatically on an event generated from another DataBrew job using [Amazon EventBridge](https://aws.amazon.com/eventbridge/) and [AWS Step Functions](https://aws.amazon.com/step-functions).\n\n#### **Overview of solution**\nThe following diagram illustrates the architecture of the solution. We use [AWS CloudFormation](\nhttp://aws.amazon.com/cloudformation) to deploy an EventBridge rule, an [Amazon Simple Queue Service](https://aws.amazon.com/sqs/) (Amazon SQS) queue, and Step Functions resources to trigger the second DataBrew job.\n\n![image.png](https://dev-media.amazoncloud.cn/17e10474ef8f434fabe498bab7b5f70d_image.png)\n\nThe steps in this solution are as follows:\n\n1. Import your dataset to [Amazon Simple Storage Service](http://aws.amazon.com/s3) (Amazon S3).\n2. DataBrew queries the data from Amazon S3 by creating a recipe and performing transformations.\n3. The first DataBrew recipe job writes the output to an S3 bucket.\n4. When the first recipe job is complete, it triggers an EventBridge event.\n5. A Step Functions state machine is invoked based on the event, which in turn invokes the second DataBrew recipe job for further processing.\n6. The event is delivered to the [dead-letter queue](https://docs.aws.amazon.com/eventbridge/latest/userguide/rule-dlq.html) if the rule in EventBridge can’t invoke the state machine successfully.\n7. DataBrew queries data from an S3 bucket by creating a recipe and performing transformations.\n8. The second DataBrew recipe job writes the output to the same S3 bucket.\n\n#### **Prerequisites**\n\nTo use this solution, you need the following prerequisites:\n\n- An [AWS account](https://signin.aws.amazon.com/signin?redirect_uri=https%3A%2F%2Fportal.aws.amazon.com%2Fbilling%2Fsignup%2Fresume&client_id=signup)\n- [AWS Identity and Access Management](http://aws.amazon.com/iam) (IAM) permissions for DataBrew (for more information, see [Setting up IAM policies for DataBrew](https://docs.aws.amazon.com/databrew/latest/dg/setting-up-iam-policies-for-databrew.html))\n- An S3 bucket (to store data)\n\n#### **Load the dataset into Amazon S3**\nFor this post, we use the [Credit Card customers](https://www.kaggle.com/sakshigoyal7/credit-card-customers) sample dataset from Kaggle. This data consists of 10,000 customers, including their age, salary, marital status, credit card limit, credit card category, and more. Download the sample dataset and follow the instructions. We recommend creating all your resources in the same account and Region.\n\n#### **Create a DataBrew project**\nTo create a DataBrew project, complete the following steps:\n\n1. On the DataBrew console, choose **Projects** and choose **Create project**.\n2. For **Project name**, enter ```marketing-campaign-project-1```.\n3. For **Select a dataset**, select **New dataset**.\n\n![image.png](https://dev-media.amazoncloud.cn/9db6f0251dcb46f0a6d7526e81163573_image.png)\n\n4. Under **Data lake/data store**, choose **Amazon S3**.\n5. For **Enter your source from S3**, enter the S3 path of the sample dataset.\n6. Select the dataset CSV file.\n\n![image.png](https://dev-media.amazoncloud.cn/d489a661d390405e99b864ec7691761d_image.png)\n\n7. Under **Permissions**, for **Role name**, choose an existing IAM role created during the prerequisites or create a new role.\n8. For **New IAM role suffix**, enter a suffix.\n\n![image.png](https://dev-media.amazoncloud.cn/5725a00f24794409827a4ac71d0801b8_image.png)\n\n9. Choose **Create project**.\n\nAfter the project is opened, a DataBrew interactive session is created. DataBrew retrieves sample data based on your sampling configuration selection.\n\n#### **Create the DataBrew jobs**\nNow we can create the recipe jobs.\n1. On the DataBrew console, in the navigation pane, choose **Projects**.\n2. On the **Projects** page, select the project ```marketing-campaign-project-1```.\n3. Choose **Open project** and choose **Add step**.\n\n![image.png](https://dev-media.amazoncloud.cn/1d507fa357724b139c8bbaad32679024_image.png)\n\n4. In this step, we choose **Delete** to drop the unnecessary columns from our dataset that aren’t required for this exercise.\n\nYou can choose from over 250 built-in functions to merge, pivot, and transpose the data without writing code.\n\n![image.png](https://dev-media.amazoncloud.cn/cac927c969d34e68a2f441cefbc259b1_image.png)\n\n5.Select the columns to delete and choose **Apply**.\n\n![image.png](https://dev-media.amazoncloud.cn/a984803075b743518cc3f9b22d2d51e1_image.png)\n\n6. Choose **Create job**.\n\n![image.png](https://dev-media.amazoncloud.cn/c764ef9faaa247f09083a9eb983d39b3_image.png)\n\n7. For **Job name**, enter ```marketing-campaign-job1```.\n\n![image.png](https://dev-media.amazoncloud.cn/29343f2e925b4a94896dcb9502f84da2_image.png)\n\n8. Under **Job output settings**¸ for **File type**, choose your final storage format (for this post, we choose **CSV**).\n9. For **S3 location**, enter your final S3 output bucket path.\n\n![image.png](https://dev-media.amazoncloud.cn/27498f68aa3a47d8a799cba43721f1d5_image.png)\n\n10. Under **Settings**, for **File output storage**, select **Replace output files for each job run**.\n11. Choose **Save**.\n\n![image.png](https://dev-media.amazoncloud.cn/033264a70b174ba4bcd79ff15432379c_image.png)\n\n12. Under **Permissions**, for **Role name**¸ choose an existing role created during the prerequisites or create a new role.\n13. Choose **Create job**.\n\n![image.png](https://dev-media.amazoncloud.cn/b099f161570a48208d853387ae21a624_image.png)\n\nNow we repeat the same steps to create another DataBrew project and DataBrew job.\n\n15. For this post, I named the second project ```marketing-campaign-project2``` and named the job ```marketing-campaign-job2```.\n16. When you create the new project, this time use the job1 output file location as the new dataset.\n17. For this job, we deselect **Unknown** and** Uneducated** in the **Education_Level** column.\n\n![image.png](https://dev-media.amazoncloud.cn/2f92e53ec55843a2ad43065c2f810815_image.png)\n\n#### **Deploy your resources using CloudFormation**\n\nFor a quick start of this solution, we deploy the resources with a CloudFormation stack. The stack creates the EventBridge rule, SQS queue, and Step Functions state machine in your account to trigger the second DataBrew job when the first job runs successfully.\nChoose Launch Stack:\n\n1. Choose **Launch Stack**:\n[![image.png](https://dev-media.amazoncloud.cn/a889e5ed79a449098a6ee00e6a67f7ae_image.png)](https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/BDB-1795/DataBrew+Jobs+event+based+trigger.yaml)\n2. For **DataBrew source job name**, enter marketing-campaign-job1.\n3. For DataBrew target job name, enter ```marketing-campaign-job2```.\n4. For both IAM role configurations, make the following choice:\n\t- If you choose **Create a new Role**, the stack automatically creates a role for you.\n\t- If you choose **Attach an existing IAM role**, you must populate the IAM role ARN manually in the following field or else the stack creation fails.\n5. Choose **Next**\n\n![image.png](https://dev-media.amazoncloud.cn/cf0146f806d44e08a8dcfd85abd5ef18_image.png)\n\n6. Select the two acknowledgement check boxes.\n7. Choose **Create stack**.\n\n![image.png](https://dev-media.amazoncloud.cn/de6c67d2ac5b4da2982d3a5596f6c9c8_image.png)\n\n#### **Test the solution**\nTo test the solution, complete the following steps:\n\n1. On the DataBrew console, choose **Jobs**.\n2. Select the job ```marketing-campaign-job1``` and choose **Run job**.\n\nThis action automatically triggers the second job, ```marketing-campaign-job2```, via EventBridge and Step Functions.\n\n![image.png](https://dev-media.amazoncloud.cn/55cbe882e0d3464d98fed156fda6c2da_image.png)\n\n3. When both jobs are complete, open the output link for ```marketing-campaign-job2```.\n\n![image.png](https://dev-media.amazoncloud.cn/16cf4152d47d41f2892c986717b1d1d0_image.png)\n\nYou’re redirected to the Amazon S3 console to access the output file.\n\n![image.png](https://dev-media.amazoncloud.cn/513a56002239403c9a1b0bdf8a7e8806_image.png)\n\n\nIn this solution, we created a workflow that required minimal code. The first job triggers the second job, and both jobs deliver the transformed data files to Amazon S3.\n\n#### **Clean up**\nTo avoid incurring future charges, delete all the resources created during this walkthrough:\n\n- IAM roles\n- DataBrew projects and their associated recipe jobs\n- S3 bucket\n- CloudFormation stack\n\n#### **Conclusion**\nIn this post, we walked through how to use DataBrew along with EventBridge and Step Functions to run a DataBrew job that automatically triggers another DataBrew job. We encourage you to use this pattern for event-driven pipelines where you can build sequence jobs to run multiple jobs in conjunction with other jobs.\n\n##### **About the Authors**\n\n![image.png](https://dev-media.amazoncloud.cn/dabff0980cce412eb043dd696ef63f33_image.png)\n\n**Nipun Chagari** is a Senior Solutions Architect at AWS, where he helps customers build highly available, scalable, and resilient applications on the AWS Cloud. He is passionate about helping customers adopt serverless technology to meet their business objectives.\n\n![image.png](https://dev-media.amazoncloud.cn/d08f88de1e2648ceaabd778bc0eb52bc_image.png)\n\n**Prarthana Angadi** is a Software Development Engineer II at AWS, where she has been expanding what is possible with code in order to make life more efficient for AWS customers.","render":"<p>Organizations today have continuous incoming data, and analyzing this data in a timely fashion is becoming a common requirement for data analytics and machine learning (ML) use cases. As part of this, you need clean data in order to gain insights that can enable enterprises to get the most out of their data for business growth and profitability. You can now use <a href=\"https://aws.amazon.com/glue/features/databrew/\" target=\"_blank\">AWS Glue DataBrew</a>, a visual data preparation tool that makes it easy to transform and prepare datasets for analytics and ML workloads.</p>\n<p>As we build these data analytics pipelines, we can decouple the jobs by building event-driven analytics and ML workflow pipelines. In this post, we walk through how to trigger a DataBrew job automatically on an event generated from another DataBrew job using <a href=\"https://aws.amazon.com/eventbridge/\" target=\"_blank\">Amazon EventBridge</a> and <a href=\"https://aws.amazon.com/step-functions\" target=\"_blank\">AWS Step Functions</a>.</p>\n<h4><a id=\"Overview_of_solution_4\"></a><strong>Overview of solution</strong></h4>\n<p>The following diagram illustrates the architecture of the solution. We use <a href=\"http://aws.amazon.com/cloudformation\" target=\"_blank\">AWS CloudFormation</a> to deploy an EventBridge rule, an <a href=\"https://aws.amazon.com/sqs/\" target=\"_blank\">Amazon Simple Queue Service</a> (Amazon SQS) queue, and Step Functions resources to trigger the second DataBrew job.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/17e10474ef8f434fabe498bab7b5f70d_image.png\" alt=\"image.png\" /></p>\n<p>The steps in this solution are as follows:</p>\n<ol>\n<li>Import your dataset to <a href=\"http://aws.amazon.com/s3\" target=\"_blank\">Amazon Simple Storage Service</a> (Amazon S3).</li>\n<li>DataBrew queries the data from Amazon S3 by creating a recipe and performing transformations.</li>\n<li>The first DataBrew recipe job writes the output to an S3 bucket.</li>\n<li>When the first recipe job is complete, it triggers an EventBridge event.</li>\n<li>A Step Functions state machine is invoked based on the event, which in turn invokes the second DataBrew recipe job for further processing.</li>\n<li>The event is delivered to the <a href=\"https://docs.aws.amazon.com/eventbridge/latest/userguide/rule-dlq.html\" target=\"_blank\">dead-letter queue</a> if the rule in EventBridge can’t invoke the state machine successfully.</li>\n<li>DataBrew queries data from an S3 bucket by creating a recipe and performing transformations.</li>\n<li>The second DataBrew recipe job writes the output to the same S3 bucket.</li>\n</ol>\n<h4><a id=\"Prerequisites_21\"></a><strong>Prerequisites</strong></h4>\n<p>To use this solution, you need the following prerequisites:</p>\n<ul>\n<li>An <a href=\"https://signin.aws.amazon.com/signin?redirect_uri=https%3A%2F%2Fportal.aws.amazon.com%2Fbilling%2Fsignup%2Fresume&client_id=signup\" target=\"_blank\">AWS account</a></li>\n<li><a href=\"http://aws.amazon.com/iam\" target=\"_blank\">AWS Identity and Access Management</a> (IAM) permissions for DataBrew (for more information, see <a href=\"https://docs.aws.amazon.com/databrew/latest/dg/setting-up-iam-policies-for-databrew.html\" target=\"_blank\">Setting up IAM policies for DataBrew</a>)</li>\n<li>An S3 bucket (to store data)</li>\n</ul>\n<h4><a id=\"Load_the_dataset_into_Amazon_S3_29\"></a><strong>Load the dataset into Amazon S3</strong></h4>\n<p>For this post, we use the <a href=\"https://www.kaggle.com/sakshigoyal7/credit-card-customers\" target=\"_blank\">Credit Card customers</a> sample dataset from Kaggle. This data consists of 10,000 customers, including their age, salary, marital status, credit card limit, credit card category, and more. Download the sample dataset and follow the instructions. We recommend creating all your resources in the same account and Region.</p>\n<h4><a id=\"Create_a_DataBrew_project_32\"></a><strong>Create a DataBrew project</strong></h4>\n<p>To create a DataBrew project, complete the following steps:</p>\n<ol>\n<li>On the DataBrew console, choose <strong>Projects</strong> and choose <strong>Create project</strong>.</li>\n<li>For <strong>Project name</strong>, enter <code>marketing-campaign-project-1</code>.</li>\n<li>For <strong>Select a dataset</strong>, select <strong>New dataset</strong>.</li>\n</ol>\n<p><img src=\"https://dev-media.amazoncloud.cn/9db6f0251dcb46f0a6d7526e81163573_image.png\" alt=\"image.png\" /></p>\n<ol start=\"4\">\n<li>Under <strong>Data lake/data store</strong>, choose <strong>Amazon S3</strong>.</li>\n<li>For <strong>Enter your source from S3</strong>, enter the S3 path of the sample dataset.</li>\n<li>Select the dataset CSV file.</li>\n</ol>\n<p><img src=\"https://dev-media.amazoncloud.cn/d489a661d390405e99b864ec7691761d_image.png\" alt=\"image.png\" /></p>\n<ol start=\"7\">\n<li>Under <strong>Permissions</strong>, for <strong>Role name</strong>, choose an existing IAM role created during the prerequisites or create a new role.</li>\n<li>For <strong>New IAM role suffix</strong>, enter a suffix.</li>\n</ol>\n<p><img src=\"https://dev-media.amazoncloud.cn/5725a00f24794409827a4ac71d0801b8_image.png\" alt=\"image.png\" /></p>\n<ol start=\"9\">\n<li>Choose <strong>Create project</strong>.</li>\n</ol>\n<p>After the project is opened, a DataBrew interactive session is created. DataBrew retrieves sample data based on your sampling configuration selection.</p>\n<h4><a id=\"Create_the_DataBrew_jobs_56\"></a><strong>Create the DataBrew jobs</strong></h4>\n<p>Now we can create the recipe jobs.</p>\n<ol>\n<li>On the DataBrew console, in the navigation pane, choose <strong>Projects</strong>.</li>\n<li>On the <strong>Projects</strong> page, select the project <code>marketing-campaign-project-1</code>.</li>\n<li>Choose <strong>Open project</strong> and choose <strong>Add step</strong>.</li>\n</ol>\n<p><img src=\"https://dev-media.amazoncloud.cn/1d507fa357724b139c8bbaad32679024_image.png\" alt=\"image.png\" /></p>\n<ol start=\"4\">\n<li>In this step, we choose <strong>Delete</strong> to drop the unnecessary columns from our dataset that aren’t required for this exercise.</li>\n</ol>\n<p>You can choose from over 250 built-in functions to merge, pivot, and transpose the data without writing code.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/cac927c969d34e68a2f441cefbc259b1_image.png\" alt=\"image.png\" /></p>\n<p>5.Select the columns to delete and choose <strong>Apply</strong>.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/a984803075b743518cc3f9b22d2d51e1_image.png\" alt=\"image.png\" /></p>\n<ol start=\"6\">\n<li>Choose <strong>Create job</strong>.</li>\n</ol>\n<p><img src=\"https://dev-media.amazoncloud.cn/c764ef9faaa247f09083a9eb983d39b3_image.png\" alt=\"image.png\" /></p>\n<ol start=\"7\">\n<li>For <strong>Job name</strong>, enter <code>marketing-campaign-job1</code>.</li>\n</ol>\n<p><img src=\"https://dev-media.amazoncloud.cn/29343f2e925b4a94896dcb9502f84da2_image.png\" alt=\"image.png\" /></p>\n<ol start=\"8\">\n<li>Under <strong>Job output settings</strong>¸ for <strong>File type</strong>, choose your final storage format (for this post, we choose <strong>CSV</strong>).</li>\n<li>For <strong>S3 location</strong>, enter your final S3 output bucket path.</li>\n</ol>\n<p><img src=\"https://dev-media.amazoncloud.cn/27498f68aa3a47d8a799cba43721f1d5_image.png\" alt=\"image.png\" /></p>\n<ol start=\"10\">\n<li>Under <strong>Settings</strong>, for <strong>File output storage</strong>, select <strong>Replace output files for each job run</strong>.</li>\n<li>Choose <strong>Save</strong>.</li>\n</ol>\n<p><img src=\"https://dev-media.amazoncloud.cn/033264a70b174ba4bcd79ff15432379c_image.png\" alt=\"image.png\" /></p>\n<ol start=\"12\">\n<li>Under <strong>Permissions</strong>, for <strong>Role name</strong>¸ choose an existing role created during the prerequisites or create a new role.</li>\n<li>Choose <strong>Create job</strong>.</li>\n</ol>\n<p><img src=\"https://dev-media.amazoncloud.cn/b099f161570a48208d853387ae21a624_image.png\" alt=\"image.png\" /></p>\n<p>Now we repeat the same steps to create another DataBrew project and DataBrew job.</p>\n<ol start=\"15\">\n<li>For this post, I named the second project <code>marketing-campaign-project2</code> and named the job <code>marketing-campaign-job2</code>.</li>\n<li>When you create the new project, this time use the job1 output file location as the new dataset.</li>\n<li>For this job, we deselect <strong>Unknown</strong> and** Uneducated** in the <strong>Education_Level</strong> column.</li>\n</ol>\n<p><img src=\"https://dev-media.amazoncloud.cn/2f92e53ec55843a2ad43065c2f810815_image.png\" alt=\"image.png\" /></p>\n<h4><a id=\"Deploy_your_resources_using_CloudFormation_105\"></a><strong>Deploy your resources using CloudFormation</strong></h4>\n<p>For a quick start of this solution, we deploy the resources with a CloudFormation stack. The stack creates the EventBridge rule, SQS queue, and Step Functions state machine in your account to trigger the second DataBrew job when the first job runs successfully.<br />\nChoose Launch Stack:</p>\n<ol>\n<li>Choose <strong>Launch Stack</strong>:<br />\n<a href=\"https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/BDB-1795/DataBrew+Jobs+event+based+trigger.yaml\" target=\"_blank\"><img src=\"https://dev-media.amazoncloud.cn/a889e5ed79a449098a6ee00e6a67f7ae_image.png\" alt=\"image.png\" /></a></li>\n<li>For <strong>DataBrew source job name</strong>, enter marketing-campaign-job1.</li>\n<li>For DataBrew target job name, enter <code>marketing-campaign-job2</code>.</li>\n<li>For both IAM role configurations, make the following choice:\n<ul>\n<li>If you choose <strong>Create a new Role</strong>, the stack automatically creates a role for you.</li>\n<li>If you choose <strong>Attach an existing IAM role</strong>, you must populate the IAM role ARN manually in the following field or else the stack creation fails.</li>\n</ul>\n</li>\n<li>Choose <strong>Next</strong></li>\n</ol>\n<p><img src=\"https://dev-media.amazoncloud.cn/cf0146f806d44e08a8dcfd85abd5ef18_image.png\" alt=\"image.png\" /></p>\n<ol start=\"6\">\n<li>Select the two acknowledgement check boxes.</li>\n<li>Choose <strong>Create stack</strong>.</li>\n</ol>\n<p><img src=\"https://dev-media.amazoncloud.cn/de6c67d2ac5b4da2982d3a5596f6c9c8_image.png\" alt=\"image.png\" /></p>\n<h4><a id=\"Test_the_solution_126\"></a><strong>Test the solution</strong></h4>\n<p>To test the solution, complete the following steps:</p>\n<ol>\n<li>On the DataBrew console, choose <strong>Jobs</strong>.</li>\n<li>Select the job <code>marketing-campaign-job1</code> and choose <strong>Run job</strong>.</li>\n</ol>\n<p>This action automatically triggers the second job, <code>marketing-campaign-job2</code>, via EventBridge and Step Functions.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/55cbe882e0d3464d98fed156fda6c2da_image.png\" alt=\"image.png\" /></p>\n<ol start=\"3\">\n<li>When both jobs are complete, open the output link for <code>marketing-campaign-job2</code>.</li>\n</ol>\n<p><img src=\"https://dev-media.amazoncloud.cn/16cf4152d47d41f2892c986717b1d1d0_image.png\" alt=\"image.png\" /></p>\n<p>You’re redirected to the Amazon S3 console to access the output file.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/513a56002239403c9a1b0bdf8a7e8806_image.png\" alt=\"image.png\" /></p>\n<p>In this solution, we created a workflow that required minimal code. The first job triggers the second job, and both jobs deliver the transformed data files to Amazon S3.</p>\n<h4><a id=\"Clean_up_147\"></a><strong>Clean up</strong></h4>\n<p>To avoid incurring future charges, delete all the resources created during this walkthrough:</p>\n<ul>\n<li>IAM roles</li>\n<li>DataBrew projects and their associated recipe jobs</li>\n<li>S3 bucket</li>\n<li>CloudFormation stack</li>\n</ul>\n<h4><a id=\"Conclusion_155\"></a><strong>Conclusion</strong></h4>\n<p>In this post, we walked through how to use DataBrew along with EventBridge and Step Functions to run a DataBrew job that automatically triggers another DataBrew job. We encourage you to use this pattern for event-driven pipelines where you can build sequence jobs to run multiple jobs in conjunction with other jobs.</p>\n<h5><a id=\"About_the_Authors_158\"></a><strong>About the Authors</strong></h5>\n<p><img src=\"https://dev-media.amazoncloud.cn/dabff0980cce412eb043dd696ef63f33_image.png\" alt=\"image.png\" /></p>\n<p><strong>Nipun Chagari</strong> is a Senior Solutions Architect at AWS, where he helps customers build highly available, scalable, and resilient applications on the AWS Cloud. He is passionate about helping customers adopt serverless technology to meet their business objectives.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/d08f88de1e2648ceaabd778bc0eb52bc_image.png\" alt=\"image.png\" /></p>\n<p><strong>Prarthana Angadi</strong> is a Software Development Engineer II at AWS, where she has been expanding what is possible with code in order to make life more efficient for AWS customers.</p>\n"}