Synchronize your Amazon Glue Studio Visual Jobs to different environments

海外精选
海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时,内容中提到的“AWS” 是 “Amazon Web Services” 的缩写,在此网站不作为商标展示。
0
0
{"value":"[AWS Glue](https://aws.amazon.com/glue/) has become a popular option for integrating data from disparate data sources due to its ability to integrate large volumes of data using distributed data processing frameworks. Many customers use AWS Glue to build data lakes and data warehouses. Data engineers who prefer to develop data processing pipelines visually using [AWS Glue Studio](https://docs.aws.amazon.com/glue/latest/ug/what-is-glue-studio.html) to create data integration jobs. This post introduces **Glue Visual Job API** to author the Glue Studio Visual Jobs programmatically, and **Glue Job Sync utility** that uses the API to easily synchronize Glue jobs to different environments without losing the visual representation.\n\n### **Glue Job Visual API**\n\nAWS Glue Studio has a graphical interface called **Visual Editor** that makes it easy to author extract, transform, and load (ETL) jobs in AWS Glue. The Glue jobs created in the Visual Editor contain its visual representation that composes data transformation. In this post, we call the jobs **Glue Studio Visual Jobs**.\n\nFor example, it’s common to develop and test AWS Glue jobs in a dev account, and then promote the jobs to a prod account. Previously, when you copied the AWS Glue Studio Visual jobs to a different environment, there was no mechanism to copy the visual representation together. This means that the visual representation of the job was lost and you could only copy the code produced with Glue Studio. It can be time consuming and tedious to either copy the code or recreate the job.\n\n[AWS Glue Job Visual API](https://docs.aws.amazon.com/glue/latest/ug/visual-job-api-chapter.html) lets you programmatically create and update Glue Studio Visual Jobs by providing a JSON object that indicates visual representation, and also retrieve the visual representation from existing Glue Studio Visual Jobs. A Glue Studio Visual Job consists of data source nodes for reading the data, transform nodes for modifying the data, and data target nodes for writing the data.\n\n![image.png](https://dev-media.amazoncloud.cn/07109824b9184588b4015c8fc72451e3_image.png)\n\nThere are some typical use cases for Glue Visual Job API:\n\n- Automate creation of Glue Visual Jobs.\n- Migrate your ETL jobs from third-party or on-premises ETL tools to AWS Glue. Many AWS partners, such as [Bitwise](https://www.bitwiseglobal.com/risk-free-etl-migration-to-aws-glue/), [Bladebridge](https://wavicledata.com/migrate-your-data-to-aws-glue-in-days/), and others have built convertors from the third-party ETL tools to AWS Glue.\n- Synchronize AWS Glue Studio Visual jobs from one environment to another without losing visual representation.\n\nIn this post, we focus on a utility that uses Glue Job Visual APIs to achieve the mass synchronization of your Glue Studio Visual Jobs without losing the visual representation.\n\n### **Glue Job Sync Utility**\n\nThere are common requirements to synchronize the Glue Visual Jobs between different environments.\n\n- Promote Glue Visual Jobs from a dev account to a prod account.\n- Transfer ownership of Glue Visual Jobs between different AWS accounts.\n- Replicate Glue Visual Job configurations from one region to another for disaster recovery purpose.\n\n**Glue Job Sync Utility** is built on top of **Glue Visual Job API**, and the utility lets you synchronize the jobs to different accounts without losing the visual representation. The Glue Job Sync Utility is a python application that enables you to synchronize your AWS Glue Studio Visual jobs to different environments using the new Glue Job Visual API. This utility requires that you provide source and target AWS environment profiles. Optionally, you can provide a list of jobs that you want to synchronize, and specify how the utility should replace your environment-specific objects using a mapping file. For example, [Amazon Simple Storage Service (Amazon S3)](https://aws.amazon.com/s3/) locations in your development environment and role can be different than your production environment. The mapping config file will be used to replace the environment specific objects.\n\n### **How to use Glue Job Sync Utility**\n\nIn this example, we’re synchronizing two AWS Glue Studio Visual jobs, ```test1``` and ```test2```, from the development environment to the production environment in a different account.\n\n- Source environment (dev environment)\n- AWS Account ID: ```123456789012```\n- AWS Region: eu-west-3 (Paris)\n- AWS Glue Studio Visual jobs: ```test1```, ```test2```\n- AWS Identity and Access Management (IAM) Role ARN for Glue job execution role: ```arn:aws:iam::123456789012:role/GlueServiceRole```\n- Amazon S3 bucket for Glue job script and other asset location: \n```s3://aws-glue-assets-123456789012-eu-west-3/```\n- Amazon S3 bucket for data location: ```s3://dev-environment/```\n- Destination environment (prod environment)\n- AWS Account ID: ```234567890123```\n- AWS Region: eu-west-3 (Paris)\n- IAM Role ARN for Glue job execution role: ```arn:aws:iam::234567890123:role/GlueServiceRole```\n- Amazon S3 bucket for Glue job script and other asset location: \n```s3://aws-glue-assets-234567890123-eu-west-3/```\n- Amazon S3 bucket for data location: ```s3://prod-environment/```\n\n### **Set up the utility in your local environment**\n\nYou will need the following prerequisites for this utility:\n\n- Python 3.6 or later.\n- [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html) 1.24.39 or later.\n- Create two AWS named profiles, ```dev``` and ```prod```, with the corresponding credentials in your environment. Follow [this instruction](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html).\n\n### **Download the Glue Job Sync Utility**\n\nDownload the sync utility from the [GitHub repository](https://github.com/aws-samples/aws-glue-samples/tree/master/utilities/job_sync) to your local machine.\n\n### **Create AWS Glue Studio Visual Jobs**\n\n1.Create two AWS Glue Studio Visual jobs, ```test1```, and ```test2```, in the source account.\n\n- If you don’t have any AWS Glue Studio Visual jobs, then follow this instruction to create the Glue Studio Visual jobs.\n\n![image.png](https://dev-media.amazoncloud.cn/53769be4d2ff4d63aaaf22d5d19eb9ec_image.png)\n\n2.Open AWS Glue Studio in the destination account and verify that the ```test1``` and ```test2``` jobs aren’t present.\n\n![image.png](https://dev-media.amazoncloud.cn/a554a4d92c314bec86b0a71ccfab531e_image.png)\n\n### **Run the Job Sync Utility**\n\n1.Create a new file named ```mapping.json```, and enter the following JSON code. With the configuration in line 1, the sync utility will replace all of the Amazon S3 references within the job (in this case s3://aws-glue-assets-123456789012-eu-west-3) to the mapped location (in this case s3://aws-glue-assets-234567890123-eu-west-3). Then, the utility will create the job to the destination environment. Along these lines, line 2 and line 3 will trigger appropriate substitutions in the job. **Note that these are example values and you’ll need to substitute the right values that match your environment**.\n\n```\n{\n \"s3://aws-glue-assets-123456789012-eu-west-3\": \"s3://aws-glue-assets-234567890123-eu-west-3\",\n \"arn:aws:iam::123456789012:role/GlueServiceRole\": \"arn:aws:iam::234567890123:role/GlueServiceRole\",\n \"s3://dev-environment\": \"s3://prod-environment\"\n}\n```\n\n2.Execute the utility by running the following command:\n\n```\n$ python3 sync.py --src-profile dev --src-region eu-west-3 --dst-profile prod --dst-region eu-west-3 --src-job-names test1,test2 --config-path mapping.json\n```\n\n3.Verify successful synchronization by opening AWS Glue Studio in the destination account:\n\n![image.png](https://dev-media.amazoncloud.cn/84f2730740fb4a5eb81e96b7d02f0e47_image.png)\n\n4.Open the Glue Studio Visual jobs, ```test1```, and ```test2```, and verify the visual representation of the DAG.\n\n![image.png](https://dev-media.amazoncloud.cn/ba7c4b0e47fb4994a67b71cb8e7cbc5e_image.png)\n\nThe screenshot above shows that you were able to copy the jobs ```test1``` and ```test2``` while keeping DAG into the destination account.\n\n### **Conclusion**\n\nAWS Glue Job Visual API and the AWS Glue Sync Utility simplify how you synchronize your jobs to different environments. These are designed to easily integrate into your Continuous Integration pipelines while retaining the visual representation that improves the readability of the ETL pipeline.\n\n#### **About the Authors**\n\n![image.png](https://dev-media.amazoncloud.cn/8378d37db1eb46a595b90b716bfb93f5_image.png)\n\n**Noritaka Sekiyama** is a Principal Big Data Architect on the AWS Glue team. He is responsible for designing AWS features, implementing software artifacts, and helping customer architectures. In his spare time, he enjoys watching anime in Prime Video.\n\n![image.png](https://dev-media.amazoncloud.cn/aaa438bcb10d42ed8d5e5f83599ffadb_image.png)\n\n**Aaron Meltzer** is a Software Engineer on the AWS Glue Studio team. He leads the design and implementation of features to simplify the management of AWS Glue jobs. Outside of work, Aaron likes to read and learn new recipes.\n\n![image.png](https://dev-media.amazoncloud.cn/e00f0aeb5e5649e8925285cb68c18a8d_image.png)\n\n**Mohamed Kiswani** is the Software Development Manager on the AWS Glue Team\n\n![image.png](https://dev-media.amazoncloud.cn/34da63ec6fbc408a96060188b602ac4a_image.png)\n\n**Shiv Narayanan** is a Senior Technical Product Manager on the AWS Glue team.","render":"<p><a href=\"https://aws.amazon.com/glue/\" target=\"_blank\">AWS Glue</a> has become a popular option for integrating data from disparate data sources due to its ability to integrate large volumes of data using distributed data processing frameworks. Many customers use AWS Glue to build data lakes and data warehouses. Data engineers who prefer to develop data processing pipelines visually using <a href=\"https://docs.aws.amazon.com/glue/latest/ug/what-is-glue-studio.html\" target=\"_blank\">AWS Glue Studio</a> to create data integration jobs. This post introduces <strong>Glue Visual Job API</strong> to author the Glue Studio Visual Jobs programmatically, and <strong>Glue Job Sync utility</strong> that uses the API to easily synchronize Glue jobs to different environments without losing the visual representation.</p>\n<h3><a id=\"Glue_Job_Visual_API_2\"></a><strong>Glue Job Visual API</strong></h3>\n<p>AWS Glue Studio has a graphical interface called <strong>Visual Editor</strong> that makes it easy to author extract, transform, and load (ETL) jobs in AWS Glue. The Glue jobs created in the Visual Editor contain its visual representation that composes data transformation. In this post, we call the jobs <strong>Glue Studio Visual Jobs</strong>.</p>\n<p>For example, it’s common to develop and test AWS Glue jobs in a dev account, and then promote the jobs to a prod account. Previously, when you copied the AWS Glue Studio Visual jobs to a different environment, there was no mechanism to copy the visual representation together. This means that the visual representation of the job was lost and you could only copy the code produced with Glue Studio. It can be time consuming and tedious to either copy the code or recreate the job.</p>\n<p><a href=\"https://docs.aws.amazon.com/glue/latest/ug/visual-job-api-chapter.html\" target=\"_blank\">AWS Glue Job Visual API</a> lets you programmatically create and update Glue Studio Visual Jobs by providing a JSON object that indicates visual representation, and also retrieve the visual representation from existing Glue Studio Visual Jobs. A Glue Studio Visual Job consists of data source nodes for reading the data, transform nodes for modifying the data, and data target nodes for writing the data.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/07109824b9184588b4015c8fc72451e3_image.png\" alt=\"image.png\" /></p>\n<p>There are some typical use cases for Glue Visual Job API:</p>\n<ul>\n<li>Automate creation of Glue Visual Jobs.</li>\n<li>Migrate your ETL jobs from third-party or on-premises ETL tools to AWS Glue. Many AWS partners, such as <a href=\"https://www.bitwiseglobal.com/risk-free-etl-migration-to-aws-glue/\" target=\"_blank\">Bitwise</a>, <a href=\"https://wavicledata.com/migrate-your-data-to-aws-glue-in-days/\" target=\"_blank\">Bladebridge</a>, and others have built convertors from the third-party ETL tools to AWS Glue.</li>\n<li>Synchronize AWS Glue Studio Visual jobs from one environment to another without losing visual representation.</li>\n</ul>\n<p>In this post, we focus on a utility that uses Glue Job Visual APIs to achieve the mass synchronization of your Glue Studio Visual Jobs without losing the visual representation.</p>\n<h3><a id=\"Glue_Job_Sync_Utility_20\"></a><strong>Glue Job Sync Utility</strong></h3>\n<p>There are common requirements to synchronize the Glue Visual Jobs between different environments.</p>\n<ul>\n<li>Promote Glue Visual Jobs from a dev account to a prod account.</li>\n<li>Transfer ownership of Glue Visual Jobs between different AWS accounts.</li>\n<li>Replicate Glue Visual Job configurations from one region to another for disaster recovery purpose.</li>\n</ul>\n<p><strong>Glue Job Sync Utility</strong> is built on top of <strong>Glue Visual Job API</strong>, and the utility lets you synchronize the jobs to different accounts without losing the visual representation. The Glue Job Sync Utility is a python application that enables you to synchronize your AWS Glue Studio Visual jobs to different environments using the new Glue Job Visual API. This utility requires that you provide source and target AWS environment profiles. Optionally, you can provide a list of jobs that you want to synchronize, and specify how the utility should replace your environment-specific objects using a mapping file. For example, <a href=\"https://aws.amazon.com/s3/\" target=\"_blank\">Amazon Simple Storage Service (Amazon S3)</a> locations in your development environment and role can be different than your production environment. The mapping config file will be used to replace the environment specific objects.</p>\n<h3><a id=\"How_to_use_Glue_Job_Sync_Utility_30\"></a><strong>How to use Glue Job Sync Utility</strong></h3>\n<p>In this example, we’re synchronizing two AWS Glue Studio Visual jobs, <code>test1</code> and <code>test2</code>, from the development environment to the production environment in a different account.</p>\n<ul>\n<li>Source environment (dev environment)</li>\n<li>AWS Account ID: <code>123456789012</code></li>\n<li>AWS Region: eu-west-3 (Paris)</li>\n<li>AWS Glue Studio Visual jobs: <code>test1</code>, <code>test2</code></li>\n<li>AWS Identity and Access Management (IAM) Role ARN for Glue job execution role: <code>arn:aws:iam::123456789012:role/GlueServiceRole</code></li>\n<li>Amazon S3 bucket for Glue job script and other asset location:<br />\n<code>s3://aws-glue-assets-123456789012-eu-west-3/</code></li>\n<li>Amazon S3 bucket for data location: <code>s3://dev-environment/</code></li>\n<li>Destination environment (prod environment)</li>\n<li>AWS Account ID: <code>234567890123</code></li>\n<li>AWS Region: eu-west-3 (Paris)</li>\n<li>IAM Role ARN for Glue job execution role: <code>arn:aws:iam::234567890123:role/GlueServiceRole</code></li>\n<li>Amazon S3 bucket for Glue job script and other asset location:<br />\n<code>s3://aws-glue-assets-234567890123-eu-west-3/</code></li>\n<li>Amazon S3 bucket for data location: <code>s3://prod-environment/</code></li>\n</ul>\n<h3><a id=\"Set_up_the_utility_in_your_local_environment_50\"></a><strong>Set up the utility in your local environment</strong></h3>\n<p>You will need the following prerequisites for this utility:</p>\n<ul>\n<li>Python 3.6 or later.</li>\n<li><a href=\"https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html\" target=\"_blank\">boto3</a> 1.24.39 or later.</li>\n<li>Create two AWS named profiles, <code>dev</code> and <code>prod</code>, with the corresponding credentials in your environment. Follow <a href=\"https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html\" target=\"_blank\">this instruction</a>.</li>\n</ul>\n<h3><a id=\"Download_the_Glue_Job_Sync_Utility_58\"></a><strong>Download the Glue Job Sync Utility</strong></h3>\n<p>Download the sync utility from the <a href=\"https://github.com/aws-samples/aws-glue-samples/tree/master/utilities/job_sync\" target=\"_blank\">GitHub repository</a> to your local machine.</p>\n<h3><a id=\"Create_AWS_Glue_Studio_Visual_Jobs_62\"></a><strong>Create AWS Glue Studio Visual Jobs</strong></h3>\n<p>1.Create two AWS Glue Studio Visual jobs, <code>test1</code>, and <code>test2</code>, in the source account.</p>\n<ul>\n<li>If you don’t have any AWS Glue Studio Visual jobs, then follow this instruction to create the Glue Studio Visual jobs.</li>\n</ul>\n<p><img src=\"https://dev-media.amazoncloud.cn/53769be4d2ff4d63aaaf22d5d19eb9ec_image.png\" alt=\"image.png\" /></p>\n<p>2.Open AWS Glue Studio in the destination account and verify that the <code>test1</code> and <code>test2</code> jobs aren’t present.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/a554a4d92c314bec86b0a71ccfab531e_image.png\" alt=\"image.png\" /></p>\n<h3><a id=\"Run_the_Job_Sync_Utility_74\"></a><strong>Run the Job Sync Utility</strong></h3>\n<p>1.Create a new file named <code>mapping.json</code>, and enter the following JSON code. With the configuration in line 1, the sync utility will replace all of the Amazon S3 references within the job (in this case s3://aws-glue-assets-123456789012-eu-west-3) to the mapped location (in this case s3://aws-glue-assets-234567890123-eu-west-3). Then, the utility will create the job to the destination environment. Along these lines, line 2 and line 3 will trigger appropriate substitutions in the job. <strong>Note that these are example values and you’ll need to substitute the right values that match your environment</strong>.</p>\n<pre><code class=\"lang-\">{\n &quot;s3://aws-glue-assets-123456789012-eu-west-3&quot;: &quot;s3://aws-glue-assets-234567890123-eu-west-3&quot;,\n &quot;arn:aws:iam::123456789012:role/GlueServiceRole&quot;: &quot;arn:aws:iam::234567890123:role/GlueServiceRole&quot;,\n &quot;s3://dev-environment&quot;: &quot;s3://prod-environment&quot;\n}\n</code></pre>\n<p>2.Execute the utility by running the following command:</p>\n<pre><code class=\"lang-\">$ python3 sync.py --src-profile dev --src-region eu-west-3 --dst-profile prod --dst-region eu-west-3 --src-job-names test1,test2 --config-path mapping.json\n</code></pre>\n<p>3.Verify successful synchronization by opening AWS Glue Studio in the destination account:</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/84f2730740fb4a5eb81e96b7d02f0e47_image.png\" alt=\"image.png\" /></p>\n<p>4.Open the Glue Studio Visual jobs, <code>test1</code>, and <code>test2</code>, and verify the visual representation of the DAG.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/ba7c4b0e47fb4994a67b71cb8e7cbc5e_image.png\" alt=\"image.png\" /></p>\n<p>The screenshot above shows that you were able to copy the jobs <code>test1</code> and <code>test2</code> while keeping DAG into the destination account.</p>\n<h3><a id=\"Conclusion_102\"></a><strong>Conclusion</strong></h3>\n<p>AWS Glue Job Visual API and the AWS Glue Sync Utility simplify how you synchronize your jobs to different environments. These are designed to easily integrate into your Continuous Integration pipelines while retaining the visual representation that improves the readability of the ETL pipeline.</p>\n<h4><a id=\"About_the_Authors_106\"></a><strong>About the Authors</strong></h4>\n<p><img src=\"https://dev-media.amazoncloud.cn/8378d37db1eb46a595b90b716bfb93f5_image.png\" alt=\"image.png\" /></p>\n<p><strong>Noritaka Sekiyama</strong> is a Principal Big Data Architect on the AWS Glue team. He is responsible for designing AWS features, implementing software artifacts, and helping customer architectures. In his spare time, he enjoys watching anime in Prime Video.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/aaa438bcb10d42ed8d5e5f83599ffadb_image.png\" alt=\"image.png\" /></p>\n<p><strong>Aaron Meltzer</strong> is a Software Engineer on the AWS Glue Studio team. He leads the design and implementation of features to simplify the management of AWS Glue jobs. Outside of work, Aaron likes to read and learn new recipes.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/e00f0aeb5e5649e8925285cb68c18a8d_image.png\" alt=\"image.png\" /></p>\n<p><strong>Mohamed Kiswani</strong> is the Software Development Manager on the AWS Glue Team</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/34da63ec6fbc408a96060188b602ac4a_image.png\" alt=\"image.png\" /></p>\n<p><strong>Shiv Narayanan</strong> is a Senior Technical Product Manager on the AWS Glue team.</p>\n"}
目录
亚马逊云科技解决方案 基于行业客户应用场景及技术领域的解决方案
联系亚马逊云科技专家
亚马逊云科技解决方案
基于行业客户应用场景及技术领域的解决方案
联系专家
0
目录
关闭
contact-us