Develop an Amazon Redshift ETL serverless framework using RSQL, Amazon Batch, and Amazon Step Functions

海外精选
海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时,内容中提到的“AWS” 是 “Amazon Web Services” 的缩写,在此网站不作为商标展示。
0
0
{"value":"[Amazon Redshift RSQL](https://docs.aws.amazon.com/redshift/latest/mgmt/rsql-query-tool.html) is a command-line client for interacting with [Amazon Redshift](https://aws.amazon.com/cn/redshift/) clusters and databases. You can connect to an Amazon Redshift cluster, describe database objects, query data, and view query results in various output formats. You can use enhanced control flow commands to replace existing extract, transform, load (ETL) and automation scripts.\n\nThis post explains how you can create a fully serverless and cost-effective Amazon Redshift ETL orchestration framework. To achieve this, you can use Amazon Redshift RSQL and Amazon Web Services services such as [Amazon Web Services Batch](https://aws.amazon.com/cn/batch/) and [Amazon Web Services Step Functions](https://aws.amazon.com/cn/step-functions/?step-functions.sort-by=item.additionalFields.postDateTime&step-functions.sort-order=desc).\n\n\n#### **Overview of solution**\n\n\nWhen you’re migrating from existing data warehouses to Amazon Redshift, your existing ETL processes are implemented as proprietary scripts. These scripts contain SQL statements and complex business logic such as if-then-else control flow logic, error reporting, and error handling. You can convert all these features to Amazon Redshift RSQL, which you can use to replace existing ETL and other automation scripts. To learn more about Amazon Redshift RSQL features, examples, and use cases, see [Accelerate your data warehouse migration to Amazon Redshift – Part 4](https://aws.amazon.com/cn/blogs/big-data/part-4-accelerate-your-data-warehouse-migration-to-amazon-redshift/).\n\n[Amazon Web Services Schema Conversion Tool](https://aws.amazon.com/cn/dms/schema-conversion-tool/) (Amazon Web Services SCT) can convert proprietary scripts to Amazon Redshift RSQL. Amazon Web Services SCT can automatically convert Teradata BTEQ scripts to Amazon Redshift RSQL. To learn more how to use Amazon Web Services SCT, see [Converting Teradata BTEQ scripts to Amazon Redshift RSQL with Amazon Web Services SCT](https://docs.aws.amazon.com/SchemaConversionTool/latest/userguide/CHAP-converting-bteq-rsql.html).\n\nThe goal of the solution presented in this post is to run complex ETL jobs implemented in Amazon Redshift RSQL scripts in the Amazon Web Services Cloud without having to manage any infrastructure. In addition to meeting functional requirements, this solution also provides full auditing and traceability of all ETL processes that you run.\n\nThe following diagram shows the final architecture.\n\n![image.png](https://dev-media.amazoncloud.cn/6149467d1af244419c3da8906fbcdb95_image.png)\n\nThe deployment is fully automated using[ Amazon Web Services Cloud Development Kit](https://aws.amazon.com/cn/cdk/) (Amazon Web Services CDK) and comprises of the following stacks:\n\n- **EcrRepositoryStack** – Creates a private [Amazon Elastic Container Registry](https://aws.amazon.com/cn/ecr/) (Amazon ECR) repository that hosts our Docker image with Amazon Redshift RSQL\n- **RsqlDockerImageStack** – Builds our Docker image asset and uploads it to the ECR repository\n- **VpcStack** – Creates a VPC with isolated subnets, creates an [Amazon Simple Storage Service](https://aws.amazon.com/cn/s3/) (Amazon S3) [VPC endpoint](https://docs.aws.amazon.com/vpc/latest/privatelink/what-is-privatelink.html) gateway, as well as Amazon ECR, Amazon Redshift, and [Amazon CloudWatch](https://aws.amazon.com/cn/cloudwatch/) VPC endpoint interfaces\n- **RedshiftStack** – Creates an Amazon Redshift cluster, enables encryption, enforces encryption in-transit, enables auditing, and deploys the Amazon Redshift cluster in isolated subnets\n- **BatchStack** – Creates a compute environment (using [Amazon Web Services Fargate](https://aws.amazon.com/cn/fargate/)), job queue, and job definition (using our Docker image with RSQL)\n- **S3Stack** – Creates data, scripts, and logging buckets; enables encryption at-rest; enforces secure transfer; enables object versioning; and disables public access\n- **SnsStack** – Creates an [Amazon Simple Notification Service](https://aws.amazon.com/cn/sns/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc) (Amazon SNS) topic and email subscription (email is passed as a parameter)\n- **StepFunctionsStack** – Creates a state machine to orchestrate serverless RSQL ETL jobs\n- **SampleDataDeploymentStack** – Deploys sample RSQL ETL scripts and sample TPC benchmark datasets\n\n\n#### **Prerequisites**\n\n\nYou should have the following prerequisites:\n\n- An [Amazon Web Services account](https://signin.aws.amazon.com/signin?redirect_uri=https%3A%2F%2Fconsole.aws.amazon.com%2Fconsole%2Fhome%3FhashArgs%3D%2523%26isauthcode%3Dtrue%26state%3DhashArgsFromTB_us-west-1_e15c5cafb69fd66d&client_id=arn%3Aaws%3Asignin%3A%3A%3Aconsole%2Fcanvas&forceMobileApp=0&code_challenge=OpjO0L4OdVuhD856sEcUC6bQLHaIn3s-aiTkQlpLxw4&code_challenge_method=SHA-256)\n- Amazon Linux 2 with [Amazon Web Services CDK](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_prerequisites) and [Docker Engine](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/create-container-image.html#create-container-image-install-docker) installed\n\n\n#### **Deploy Amazon Web Services CDK stacks**\n\n\nTo deploy the serverless RSQL ETL framework solution, use the following code. Replace ```123456789012```with your Amazon Web Services account number, ```eu-west-1```with the Amazon Web Services Region to which you want deploy the solution, and ```your.email@example.com```with your email address to which ETL success and failure notifications are sent.\n```\ngit clone https://github.com/aws-samples/amazon-redshift-serverless-rsql-etl-framework\ncd amazon-redshift-serverless-rsql-etl-framework\nnpm install\n./cdk.sh 123456789012 eu-west-1 bootstrap\n./cdk.sh 123456789012 eu-west-1 deploy --all --parameters SnsStack:EmailAddressSubscription=your.email@example.com\n```\n\nThe whole process takes a few minutes. While Amazon Web Services CDK creates all the stacks, you can continue reading this post.\n\n\n#### **Create the RSQL container image**\n\n\nAmazon Web Services CDK creates an RSQL Docker image. This Docker image is the basic building block of our solution. All ETL processes run inside it. Amazon Web Services CDK creates the Docker image locally using Docker Engine and then uploads it to the Amazon ECR repository.\n\nThe Docker image is based on an Amazon Linux 2 Docker image. It has the following tools installed: the [Amazon Web Services Command Line Interface](https://aws.amazon.com/cn/cli/) (Amazon Web Services CLI), unixODBC, [Amazon Redshift ODBC driver](https://docs.aws.amazon.com/redshift/latest/mgmt/configure-odbc-connection.html), and Amazon Redshift RSQL. It also contains ```.odbc.ini```file, which defines the ```etl```profile, which is used to connect to the Amazon Redshift cluster. See the following code:\n\n```\nFROM amazonlinux:2\n\nENV AMAZON_REDSHIFT_ODBC_VERSION=1.4.52.1000\nENV AMAZON_REDSHIFT_RSQL_VERSION=1.0.4\n\nRUN yum install -y openssl gettext unixODBC awscli && \\\nyum clean all\n\nRUN rpm -i \\\nhttps://s3.amazonaws.com/redshift-downloads/drivers/odbc/${AMAZON_REDSHIFT_ODBC_VERSION}/AmazonRedshiftODBC-64-bit-${AMAZON_REDSHIFT_ODBC_VERSION}-1.x86_64.rpm \\\nhttps://s3.amazonaws.com/redshift-downloads/amazon-redshift-rsql/${AMAZON_REDSHIFT_RSQL_VERSION}/AmazonRedshiftRsql-${AMAZON_REDSHIFT_RSQL_VERSION}-1.x86_64.rpm\n\nCOPY .odbc.ini .odbc.ini\nCOPY fetch_and_run.sh /usr/local/bin/fetch_and_run.sh\n\nENV ODBCINI=.odbc.ini\nENV ODBCSYSINI=/opt/amazon/redshiftodbc/Setup\nENV AMAZONREDSHIFTODBCINI=/opt/amazon/redshiftodbc/lib/64/amazon.redshiftodbc.ini\n\nENTRYPOINT [\"/usr/local/bin/fetch_and_run.sh\"]\n```\n\nThe following code example shows the ```.odbc.ini```file. It defines an ```etl```profile, which uses an [Amazon Web Services Identity and Access Management](https://aws.amazon.com/cn/iam/) (IAM) role to get temporary cluster credentials to connect to Amazon Redshift. Amazon Web Services CDK creates this role for us. Because of this, we don’t need to hard-code credentials in a Docker image. The ```Database```, ```DbUser```, and ```ClusterID``` parameters are set in Amazon Web Services CDK. Also, Amazon Web Services CDK replaces the Region parameter at runtime with the Region to which you deploy the stacks.\n\n\n```\n[ODBC]\nTrace=no\n\n[etl]\nDriver=/opt/amazon/redshiftodbc/lib/64/libamazonredshiftodbc64.so\nDatabase=demo\nDbUser=etl\nClusterID=redshiftblogdemo\nRegion=eu-west-1\nIAM=1\n```\n\nFor more information about connecting to Amazon Redshift clusters with RSQL, see[ Connect to a cluster with Amazon Redshift RSQL](https://docs.aws.amazon.com/redshift/latest/mgmt/rsql-query-tool-starting-tool-connection.html).\n\nOur Docker image implements a well-known fetch and run integration pattern. To learn more about this pattern, see [Creating a Simple “Fetch & Run” Amazon Web Services Batch Job](https://aws.amazon.com/cn/blogs/compute/creating-a-simple-fetch-and-run-aws-batch-job/). The Docker image fetches the ETL script from an external repository, and then runs it. Amazon Web Services CDK passes the information about the ETL script to run to the Docker container at runtime as an Amazon Web Services Batch job parameter. The job parameter is exposed to the container as an environment variable called ```BATCH_SCRIPT_LOCATION```. Our job also expects two other environment variables: ```DATA_BUCKET_NAME```, which is the name of the S3 data bucket, and ```COPY_IAM_ROLE_ARN```, which is the Amazon Redshift IAM role used for the COPY command to load the data into Amazon Redshift. All environment variables are set automatically by Amazon Web Services CDK. The ```fetch_and_run.sh``` script is the entry point of the Docker container. See the following code:\n\n```\n#!/bin/bash\n\n# This script expects the following env variables to be set:\n# BATCH_SCRIPT_LOCATION - full S3 path to RSQL script to run\n# DATA_BUCKET_NAME - S3 bucket name with the data\n# COPY_IAM_ROLE_ARN - IAM role ARN that will be used to copy the data from S3 to Redshift\n\nPATH=\"/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:/usr/local/sbin\"\n\nif [ -z \"${BATCH_SCRIPT_LOCATION}\" ] || [ -z \"${DATA_BUCKET_NAME}\" ] || [ -z \"${COPY_IAM_ROLE_ARN}\" ]; then\n echo \"BATCH_SCRIPT_LOCATION/DATA_BUCKET_NAME/COPY_IAM_ROLE_ARN not set. No script to run.\"\n exit 1\nfi\n\n# download script to a temp file\nTEMP_SCRIPT_FILE=$(mktemp)\naws s3 cp ${BATCH_SCRIPT_LOCATION} ${TEMP_SCRIPT_FILE}\n\n# execute script\n# envsubst will replace ${COPY_IAM_ROLE_ARN} and ${COPY_IAM_ROLE_ARN} placeholders with actual values\nenvsubst < ${TEMP_SCRIPT_FILE} | rsql -D etl\n\nexit $?\n\n```\n\n\n#### **Create Amazon Web Services Batch resources**\n\n\nNext, Amazon Web Services CDK creates the Amazon Web Services Batch compute environment, job queue, and job definition. As a fully managed service, Amazon Web Services Batch helps you run batch computing workloads of any scale. Amazon Web Services CDK creates a Fargate serverless compute environment for us. The compute environment deploys inside the same VPC as the Amazon Redshift cluster, inside the isolated subnets. The job definition uses our Docker image with Amazon Redshift RSQL.\n\nThis step turns Amazon Redshift RSQL into a serverless service. You can build complex ETL workflows based on this generic job.\n\n\n#### **Create a Step Functions state machine**\n\n\nAmazon Web Services CDK then moves to the deployment of the Step Functions state machine. Step Functions enables you to build complex workflows in a visual way directly in your browser. This service supports over 9,000 API actions from over 200 Amazon Web Services services.\n\nYou can use [Amazon States Language](https://states-language.net/) to create a state machine on the Step Functions console. The Amazon States Language is a JSON-based, structured language used to define your state machine. You can also build them programmatically using Amazon Web Services CDK, as I have done for this post.\n\nAfter Amazon Web Services CDK finishes, a new state machine is created in your account called ```ServerlessRSQLETLFramework```. To run it, complete the following steps:\n\n1. Navigate to the Step Functions console.\n2. Choose the function to open the details page.\n3. Choose **Edit**, and then choose **Workflow Studio New**.\nThe following screenshot shows our state machine.\n\n![image.png](https://dev-media.amazoncloud.cn/1db278ba8190452f827f462759a01cb6_image.png)\n\n4. Choose **Cancel** to leave Workflow Studio, then choose **Cancel** again to leave the edit mode.\nYou will be brought back to the details page.\n5. Choose **Start execution**.\nA dialog box appears. By default, the **Name** parameter is set to a random identifier, and the **Input** parameter is set to a sample JSON document.\n6. Delete the **Input** parameter and choose **Start execution** to start the state machine.\n\nThe Graph view on the details page updates in real time. The state machine starts with a parallel state with two branches. In the left branch, the first job loads customer data into staging table, then the second job merges new and existing customer records. In the right branch, two smaller tables for regions and nations are loaded and then merged one after another. The parallel state waits until all branches are complete before moving to the vacuum-analyze state, which runs VACUUM and ANALYZE commands on Amazon Redshift. The sample state machine also implements the Amazon SNS Publish API actions to send notifications about success or failure.\n\nFrom the Graph view, you can check the status of each state by choosing it. Every state that uses an external resource has a link to it on the Details tab. In our example, next to every Amazon Web Services Batch Job state, you can see a link to the Amazon Web Services Batch Job details page. Here, you can view the status, runtime, parameters, IAM roles, link to [Amazon CloudWatch Logs](https://aws.amazon.com/cn/cloudwatch/) with the logs produced by ETL scripts, and more.\n\n![image.png](https://dev-media.amazoncloud.cn/fdc414e770c24fedb14a9dcdf339eca8_image.png)\n\n\n#### **Clean up**\n\n\nTo avoid ongoing charges for the resources that you created, delete them. Amazon Web Services CDK deletes all resources except data resources such as S3 buckets and Amazon ECR repositories.\n\n1. First, delete all Amazon Web Services CDK stacks. In the following code, provide your own Amazon Web Services account and Amazon Web Services Region:\n ```\n ./cdk.sh 123456789012 eu-west-1 destroy --all\n ```\n\n2. On the Amazon S3 console, empty and delete buckets with names starting with:\n a. ```s3stack-rsqletldemodata```\n b. ```s3stack-rsqletldemoscripts```\n c. ```s3stack-rsqletldemologging```\n3. Finally, on the Amazon ECR console, delete repositories with names starting with:\n a. ```ecrrepositorystack-amazonlinuxrsql```\n b. ```cdk-container-assets```\n\n\n#### **Next steps**\n\n\nHere are some ideas of additional enhancements that you can add to the described solution.\n\nYou can break large complex state machines into smaller building blocks by creating self-contained state machines. In our example, you could create state machines for every pair of copy and merge jobs. You could create three such state machines: Copy and Merge Customer, Copy and Merge Region, and Copy and Merge Nation, and then call them from the main state machine. For complex workflows, a different team can work on each sub-state machine in parallel. Also, this pattern promotes reuse of existing components, best practices, and security mechanisms.\n\nYou can use Amazon S3 Object Functions or Amazon S3 EventBridge notifications to start a state machine automatically after you upload a file to an S3 bucket. To learn more about Amazon S3 integration with [Amazon EventBridge](https://aws.amazon.com/cn/eventbridge/), see [Use Amazon S3 Event Notifications with Amazon EventBridge](https://aws.amazon.com/cn/blogs/aws/new-use-amazon-s3-event-notifications-with-amazon-eventbridge/). This way you can achieve a fully event-driven serverless ETL orchestration framework.\n\n\n#### **Summary**\n\n\nYou can use Amazon Redshift RSQL, Amazon Web Services Batch, and Step Functions to create modern, serverless, and cost-effective ETL workflows. There is no infrastructure to manage, and Amazon Redshift RSQL works as a serverless RSQL service. In this post, we demonstrated how to use this serverless RSQL service to build more complex ETL workflows with Step Functions.\n\nStep Functions integrates natively with over 200 Amazon Web Services services. This opens a new world of possibilities to Amazon Web Services customers and partners, who can integrate their processes with other data, analytics, machine learning, and compute services such as Amazon S3, [Amazon DynamoDB](https://aws.amazon.com/cn/dynamodb/), [Amazon Web Services Glue](https://aws.amazon.com/cn/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc), [Amazon OpenSearch Service](https://aws.amazon.com/cn/opensearch-service/) (successor to Amazon Elasticsearch Service), [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/), [Amazon Web Services Lambda](https://aws.amazon.com/cn/lambda/), and more. The additional advantage of Step Functions and Amazon Web Services Batch is that you have full traceability and auditing out of the box. Step Functions shows Graph or Event views together with a complete history for all state machine runs.\n\nIn this post, I used RSQL automation scripts as the building blocks of ETL workflows. Using RSQL is a common integration pattern that we see for customers migrating from Teradata BTEQ scripts. However, if you have simple ETL or ELT processes that can be written as plain SQL, you can invoke the [Amazon Redshift Data API](https://docs.aws.amazon.com/redshift/latest/mgmt/data-api.html) directly from Step Functions. To learn more about this integration pattern, see[ ETL orchestration using the Amazon Redshift Data API and Amazon Web Services Step Functions with Amazon Web Services SDK integration](https://aws.amazon.com/cn/blogs/big-data/etl-orchestration-using-the-amazon-redshift-data-api-and-aws-step-functions-with-aws-sdk-integration/).\n\n\n#### **About the author**\n\n\n![image.png](https://dev-media.amazoncloud.cn/ce9e6e8e9c614408bfa9f7b530520916_image.png)\n\n**Lukasz** is a Principal Software Dev Engineer working in the Amazon Web Services DMA team. Lukasz helps customers move their workloads to Amazon Web Services and specializes in migrating data warehouses and data lakes to Amazon Web Services . In his free time, Lukasz enjoys learning new human languages.\n\n\n\n\n\n\n\n\n","render":"<p><a href=\"https://docs.aws.amazon.com/redshift/latest/mgmt/rsql-query-tool.html\" target=\"_blank\">Amazon Redshift RSQL</a> is a command-line client for interacting with <a href=\"https://aws.amazon.com/cn/redshift/\" target=\"_blank\">Amazon Redshift</a> clusters and databases. You can connect to an Amazon Redshift cluster, describe database objects, query data, and view query results in various output formats. You can use enhanced control flow commands to replace existing extract, transform, load (ETL) and automation scripts.</p>\n<p>This post explains how you can create a fully serverless and cost-effective Amazon Redshift ETL orchestration framework. To achieve this, you can use Amazon Redshift RSQL and Amazon Web Services services such as <a href=\"https://aws.amazon.com/cn/batch/\" target=\"_blank\">Amazon Web Services Batch</a> and <a href=\"https://aws.amazon.com/cn/step-functions/?step-functions.sort-by=item.additionalFields.postDateTime&amp;step-functions.sort-order=desc\" target=\"_blank\">Amazon Web Services Step Functions</a>.</p>\n<h4><a id=\"Overview_of_solution_5\"></a><strong>Overview of solution</strong></h4>\n<p>When you’re migrating from existing data warehouses to Amazon Redshift, your existing ETL processes are implemented as proprietary scripts. These scripts contain SQL statements and complex business logic such as if-then-else control flow logic, error reporting, and error handling. You can convert all these features to Amazon Redshift RSQL, which you can use to replace existing ETL and other automation scripts. To learn more about Amazon Redshift RSQL features, examples, and use cases, see <a href=\"https://aws.amazon.com/cn/blogs/big-data/part-4-accelerate-your-data-warehouse-migration-to-amazon-redshift/\" target=\"_blank\">Accelerate your data warehouse migration to Amazon Redshift – Part 4</a>.</p>\n<p><a href=\"https://aws.amazon.com/cn/dms/schema-conversion-tool/\" target=\"_blank\">Amazon Web Services Schema Conversion Tool</a> (Amazon Web Services SCT) can convert proprietary scripts to Amazon Redshift RSQL. Amazon Web Services SCT can automatically convert Teradata BTEQ scripts to Amazon Redshift RSQL. To learn more how to use Amazon Web Services SCT, see <a href=\"https://docs.aws.amazon.com/SchemaConversionTool/latest/userguide/CHAP-converting-bteq-rsql.html\" target=\"_blank\">Converting Teradata BTEQ scripts to Amazon Redshift RSQL with Amazon Web Services SCT</a>.</p>\n<p>The goal of the solution presented in this post is to run complex ETL jobs implemented in Amazon Redshift RSQL scripts in the Amazon Web Services Cloud without having to manage any infrastructure. In addition to meeting functional requirements, this solution also provides full auditing and traceability of all ETL processes that you run.</p>\n<p>The following diagram shows the final architecture.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/6149467d1af244419c3da8906fbcdb95_image.png\" alt=\"image.png\" /></p>\n<p>The deployment is fully automated using<a href=\"https://aws.amazon.com/cn/cdk/\" target=\"_blank\"> Amazon Web Services Cloud Development Kit</a> (Amazon Web Services CDK) and comprises of the following stacks:</p>\n<ul>\n<li><strong>EcrRepositoryStack</strong> – Creates a private <a href=\"https://aws.amazon.com/cn/ecr/\" target=\"_blank\">Amazon Elastic Container Registry</a> (Amazon ECR) repository that hosts our Docker image with Amazon Redshift RSQL</li>\n<li><strong>RsqlDockerImageStack</strong> – Builds our Docker image asset and uploads it to the ECR repository</li>\n<li><strong>VpcStack</strong> – Creates a VPC with isolated subnets, creates an <a href=\"https://aws.amazon.com/cn/s3/\" target=\"_blank\">Amazon Simple Storage Service</a> (Amazon S3) <a href=\"https://docs.aws.amazon.com/vpc/latest/privatelink/what-is-privatelink.html\" target=\"_blank\">VPC endpoint</a> gateway, as well as Amazon ECR, Amazon Redshift, and <a href=\"https://aws.amazon.com/cn/cloudwatch/\" target=\"_blank\">Amazon CloudWatch</a> VPC endpoint interfaces</li>\n<li><strong>RedshiftStack</strong> – Creates an Amazon Redshift cluster, enables encryption, enforces encryption in-transit, enables auditing, and deploys the Amazon Redshift cluster in isolated subnets</li>\n<li><strong>BatchStack</strong> – Creates a compute environment (using <a href=\"https://aws.amazon.com/cn/fargate/\" target=\"_blank\">Amazon Web Services Fargate</a>), job queue, and job definition (using our Docker image with RSQL)</li>\n<li><strong>S3Stack</strong> – Creates data, scripts, and logging buckets; enables encryption at-rest; enforces secure transfer; enables object versioning; and disables public access</li>\n<li><strong>SnsStack</strong> – Creates an <a href=\"https://aws.amazon.com/cn/sns/?whats-new-cards.sort-by=item.additionalFields.postDateTime&amp;whats-new-cards.sort-order=desc\" target=\"_blank\">Amazon Simple Notification Service</a> (Amazon SNS) topic and email subscription (email is passed as a parameter)</li>\n<li><strong>StepFunctionsStack</strong> – Creates a state machine to orchestrate serverless RSQL ETL jobs</li>\n<li><strong>SampleDataDeploymentStack</strong> – Deploys sample RSQL ETL scripts and sample TPC benchmark datasets</li>\n</ul>\n<h4><a id=\"Prerequisites_31\"></a><strong>Prerequisites</strong></h4>\n<p>You should have the following prerequisites:</p>\n<ul>\n<li>An <a href=\"https://signin.aws.amazon.com/signin?redirect_uri=https%3A%2F%2Fconsole.aws.amazon.com%2Fconsole%2Fhome%3FhashArgs%3D%2523%26isauthcode%3Dtrue%26state%3DhashArgsFromTB_us-west-1_e15c5cafb69fd66d&amp;client_id=arn%3Aaws%3Asignin%3A%3A%3Aconsole%2Fcanvas&amp;forceMobileApp=0&amp;code_challenge=OpjO0L4OdVuhD856sEcUC6bQLHaIn3s-aiTkQlpLxw4&amp;code_challenge_method=SHA-256\" target=\"_blank\">Amazon Web Services account</a></li>\n<li>Amazon Linux 2 with <a href=\"https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_prerequisites\" target=\"_blank\">Amazon Web Services CDK</a> and <a href=\"https://docs.aws.amazon.com/AmazonECS/latest/developerguide/create-container-image.html#create-container-image-install-docker\" target=\"_blank\">Docker Engine</a> installed</li>\n</ul>\n<h4><a id=\"Deploy_Amazon_Web_Services_CDK_stacks_40\"></a><strong>Deploy Amazon Web Services CDK stacks</strong></h4>\n<p>To deploy the serverless RSQL ETL framework solution, use the following code. Replace <code>123456789012</code>with your Amazon Web Services account number, <code>eu-west-1</code>with the Amazon Web Services Region to which you want deploy the solution, and <code>your.email@example.com</code>with your email address to which ETL success and failure notifications are sent.</p>\n<pre><code class=\"lang-\">git clone https://github.com/aws-samples/amazon-redshift-serverless-rsql-etl-framework\ncd amazon-redshift-serverless-rsql-etl-framework\nnpm install\n./cdk.sh 123456789012 eu-west-1 bootstrap\n./cdk.sh 123456789012 eu-west-1 deploy --all --parameters SnsStack:EmailAddressSubscription=your.email@example.com\n</code></pre>\n<p>The whole process takes a few minutes. While Amazon Web Services CDK creates all the stacks, you can continue reading this post.</p>\n<h4><a id=\"Create_the_RSQL_container_image_55\"></a><strong>Create the RSQL container image</strong></h4>\n<p>Amazon Web Services CDK creates an RSQL Docker image. This Docker image is the basic building block of our solution. All ETL processes run inside it. Amazon Web Services CDK creates the Docker image locally using Docker Engine and then uploads it to the Amazon ECR repository.</p>\n<p>The Docker image is based on an Amazon Linux 2 Docker image. It has the following tools installed: the <a href=\"https://aws.amazon.com/cn/cli/\" target=\"_blank\">Amazon Web Services Command Line Interface</a> (Amazon Web Services CLI), unixODBC, <a href=\"https://docs.aws.amazon.com/redshift/latest/mgmt/configure-odbc-connection.html\" target=\"_blank\">Amazon Redshift ODBC driver</a>, and Amazon Redshift RSQL. It also contains <code>.odbc.ini</code>file, which defines the <code>etl</code>profile, which is used to connect to the Amazon Redshift cluster. See the following code:</p>\n<pre><code class=\"lang-\">FROM amazonlinux:2\n\nENV AMAZON_REDSHIFT_ODBC_VERSION=1.4.52.1000\nENV AMAZON_REDSHIFT_RSQL_VERSION=1.0.4\n\nRUN yum install -y openssl gettext unixODBC awscli &amp;&amp; \\\nyum clean all\n\nRUN rpm -i \\\nhttps://s3.amazonaws.com/redshift-downloads/drivers/odbc/${AMAZON_REDSHIFT_ODBC_VERSION}/AmazonRedshiftODBC-64-bit-${AMAZON_REDSHIFT_ODBC_VERSION}-1.x86_64.rpm \\\nhttps://s3.amazonaws.com/redshift-downloads/amazon-redshift-rsql/${AMAZON_REDSHIFT_RSQL_VERSION}/AmazonRedshiftRsql-${AMAZON_REDSHIFT_RSQL_VERSION}-1.x86_64.rpm\n\nCOPY .odbc.ini .odbc.ini\nCOPY fetch_and_run.sh /usr/local/bin/fetch_and_run.sh\n\nENV ODBCINI=.odbc.ini\nENV ODBCSYSINI=/opt/amazon/redshiftodbc/Setup\nENV AMAZONREDSHIFTODBCINI=/opt/amazon/redshiftodbc/lib/64/amazon.redshiftodbc.ini\n\nENTRYPOINT [&quot;/usr/local/bin/fetch_and_run.sh&quot;]\n</code></pre>\n<p>The following code example shows the <code>.odbc.ini</code>file. It defines an <code>etl</code>profile, which uses an <a href=\"https://aws.amazon.com/cn/iam/\" target=\"_blank\">Amazon Web Services Identity and Access Management</a> (IAM) role to get temporary cluster credentials to connect to Amazon Redshift. Amazon Web Services CDK creates this role for us. Because of this, we don’t need to hard-code credentials in a Docker image. The <code>Database</code>, <code>DbUser</code>, and <code>ClusterID</code> parameters are set in Amazon Web Services CDK. Also, Amazon Web Services CDK replaces the Region parameter at runtime with the Region to which you deploy the stacks.</p>\n<pre><code class=\"lang-\">[ODBC]\nTrace=no\n\n[etl]\nDriver=/opt/amazon/redshiftodbc/lib/64/libamazonredshiftodbc64.so\nDatabase=demo\nDbUser=etl\nClusterID=redshiftblogdemo\nRegion=eu-west-1\nIAM=1\n</code></pre>\n<p>For more information about connecting to Amazon Redshift clusters with RSQL, see<a href=\"https://docs.aws.amazon.com/redshift/latest/mgmt/rsql-query-tool-starting-tool-connection.html\" target=\"_blank\"> Connect to a cluster with Amazon Redshift RSQL</a>.</p>\n<p>Our Docker image implements a well-known fetch and run integration pattern. To learn more about this pattern, see <a href=\"https://aws.amazon.com/cn/blogs/compute/creating-a-simple-fetch-and-run-aws-batch-job/\" target=\"_blank\">Creating a Simple “Fetch &amp; Run” Amazon Web Services Batch Job</a>. The Docker image fetches the ETL script from an external repository, and then runs it. Amazon Web Services CDK passes the information about the ETL script to run to the Docker container at runtime as an Amazon Web Services Batch job parameter. The job parameter is exposed to the container as an environment variable called <code>BATCH_SCRIPT_LOCATION</code>. Our job also expects two other environment variables: <code>DATA_BUCKET_NAME</code>, which is the name of the S3 data bucket, and <code>COPY_IAM_ROLE_ARN</code>, which is the Amazon Redshift IAM role used for the COPY command to load the data into Amazon Redshift. All environment variables are set automatically by Amazon Web Services CDK. The <code>fetch_and_run.sh</code> script is the entry point of the Docker container. See the following code:</p>\n<pre><code class=\"lang-\">#!/bin/bash\n\n# This script expects the following env variables to be set:\n# BATCH_SCRIPT_LOCATION - full S3 path to RSQL script to run\n# DATA_BUCKET_NAME - S3 bucket name with the data\n# COPY_IAM_ROLE_ARN - IAM role ARN that will be used to copy the data from S3 to Redshift\n\nPATH=&quot;/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:/usr/local/sbin&quot;\n\nif [ -z &quot;${BATCH_SCRIPT_LOCATION}&quot; ] || [ -z &quot;${DATA_BUCKET_NAME}&quot; ] || [ -z &quot;${COPY_IAM_ROLE_ARN}&quot; ]; then\n echo &quot;BATCH_SCRIPT_LOCATION/DATA_BUCKET_NAME/COPY_IAM_ROLE_ARN not set. No script to run.&quot;\n exit 1\nfi\n\n# download script to a temp file\nTEMP_SCRIPT_FILE=$(mktemp)\naws s3 cp ${BATCH_SCRIPT_LOCATION} ${TEMP_SCRIPT_FILE}\n\n# execute script\n# envsubst will replace ${COPY_IAM_ROLE_ARN} and ${COPY_IAM_ROLE_ARN} placeholders with actual values\nenvsubst &lt; ${TEMP_SCRIPT_FILE} | rsql -D etl\n\nexit $?\n\n</code></pre>\n<h4><a id=\"Create_Amazon_Web_Services_Batch_resources_133\"></a><strong>Create Amazon Web Services Batch resources</strong></h4>\n<p>Next, Amazon Web Services CDK creates the Amazon Web Services Batch compute environment, job queue, and job definition. As a fully managed service, Amazon Web Services Batch helps you run batch computing workloads of any scale. Amazon Web Services CDK creates a Fargate serverless compute environment for us. The compute environment deploys inside the same VPC as the Amazon Redshift cluster, inside the isolated subnets. The job definition uses our Docker image with Amazon Redshift RSQL.</p>\n<p>This step turns Amazon Redshift RSQL into a serverless service. You can build complex ETL workflows based on this generic job.</p>\n<h4><a id=\"Create_a_Step_Functions_state_machine_141\"></a><strong>Create a Step Functions state machine</strong></h4>\n<p>Amazon Web Services CDK then moves to the deployment of the Step Functions state machine. Step Functions enables you to build complex workflows in a visual way directly in your browser. This service supports over 9,000 API actions from over 200 Amazon Web Services services.</p>\n<p>You can use <a href=\"https://states-language.net/\" target=\"_blank\">Amazon States Language</a> to create a state machine on the Step Functions console. The Amazon States Language is a JSON-based, structured language used to define your state machine. You can also build them programmatically using Amazon Web Services CDK, as I have done for this post.</p>\n<p>After Amazon Web Services CDK finishes, a new state machine is created in your account called <code>ServerlessRSQLETLFramework</code>. To run it, complete the following steps:</p>\n<ol>\n<li>Navigate to the Step Functions console.</li>\n<li>Choose the function to open the details page.</li>\n<li>Choose <strong>Edit</strong>, and then choose <strong>Workflow Studio New</strong>.<br />\nThe following screenshot shows our state machine.</li>\n</ol>\n<p><img src=\"https://dev-media.amazoncloud.cn/1db278ba8190452f827f462759a01cb6_image.png\" alt=\"image.png\" /></p>\n<ol start=\"4\">\n<li>Choose <strong>Cancel</strong> to leave Workflow Studio, then choose <strong>Cancel</strong> again to leave the edit mode.<br />\nYou will be brought back to the details page.</li>\n<li>Choose <strong>Start execution</strong>.<br />\nA dialog box appears. By default, the <strong>Name</strong> parameter is set to a random identifier, and the <strong>Input</strong> parameter is set to a sample JSON document.</li>\n<li>Delete the <strong>Input</strong> parameter and choose <strong>Start execution</strong> to start the state machine.</li>\n</ol>\n<p>The Graph view on the details page updates in real time. The state machine starts with a parallel state with two branches. In the left branch, the first job loads customer data into staging table, then the second job merges new and existing customer records. In the right branch, two smaller tables for regions and nations are loaded and then merged one after another. The parallel state waits until all branches are complete before moving to the vacuum-analyze state, which runs VACUUM and ANALYZE commands on Amazon Redshift. The sample state machine also implements the Amazon SNS Publish API actions to send notifications about success or failure.</p>\n<p>From the Graph view, you can check the status of each state by choosing it. Every state that uses an external resource has a link to it on the Details tab. In our example, next to every Amazon Web Services Batch Job state, you can see a link to the Amazon Web Services Batch Job details page. Here, you can view the status, runtime, parameters, IAM roles, link to <a href=\"https://aws.amazon.com/cn/cloudwatch/\" target=\"_blank\">Amazon CloudWatch Logs</a> with the logs produced by ETL scripts, and more.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/fdc414e770c24fedb14a9dcdf339eca8_image.png\" alt=\"image.png\" /></p>\n<h4><a id=\"Clean_up_170\"></a><strong>Clean up</strong></h4>\n<p>To avoid ongoing charges for the resources that you created, delete them. Amazon Web Services CDK deletes all resources except data resources such as S3 buckets and Amazon ECR repositories.</p>\n<ol>\n<li>First, delete all Amazon Web Services CDK stacks. In the following code, provide your own Amazon Web Services account and Amazon Web Services Region:</li>\n</ol>\n<pre><code class=\"lang-\">./cdk.sh 123456789012 eu-west-1 destroy --all\n</code></pre>\n<ol start=\"2\">\n<li>On the Amazon S3 console, empty and delete buckets with names starting with:<br />\na. <code>s3stack-rsqletldemodata</code><br />\nb. <code>s3stack-rsqletldemoscripts</code><br />\nc. <code>s3stack-rsqletldemologging</code></li>\n<li>Finally, on the Amazon ECR console, delete repositories with names starting with:<br />\na. <code>ecrrepositorystack-amazonlinuxrsql</code><br />\nb. <code>cdk-container-assets</code></li>\n</ol>\n<h4><a id=\"Next_steps_189\"></a><strong>Next steps</strong></h4>\n<p>Here are some ideas of additional enhancements that you can add to the described solution.</p>\n<p>You can break large complex state machines into smaller building blocks by creating self-contained state machines. In our example, you could create state machines for every pair of copy and merge jobs. You could create three such state machines: Copy and Merge Customer, Copy and Merge Region, and Copy and Merge Nation, and then call them from the main state machine. For complex workflows, a different team can work on each sub-state machine in parallel. Also, this pattern promotes reuse of existing components, best practices, and security mechanisms.</p>\n<p>You can use Amazon S3 Object Functions or Amazon S3 EventBridge notifications to start a state machine automatically after you upload a file to an S3 bucket. To learn more about Amazon S3 integration with <a href=\"https://aws.amazon.com/cn/eventbridge/\" target=\"_blank\">Amazon EventBridge</a>, see <a href=\"https://aws.amazon.com/cn/blogs/aws/new-use-amazon-s3-event-notifications-with-amazon-eventbridge/\" target=\"_blank\">Use Amazon S3 Event Notifications with Amazon EventBridge</a>. This way you can achieve a fully event-driven serverless ETL orchestration framework.</p>\n<h4><a id=\"Summary_199\"></a><strong>Summary</strong></h4>\n<p>You can use Amazon Redshift RSQL, Amazon Web Services Batch, and Step Functions to create modern, serverless, and cost-effective ETL workflows. There is no infrastructure to manage, and Amazon Redshift RSQL works as a serverless RSQL service. In this post, we demonstrated how to use this serverless RSQL service to build more complex ETL workflows with Step Functions.</p>\n<p>Step Functions integrates natively with over 200 Amazon Web Services services. This opens a new world of possibilities to Amazon Web Services customers and partners, who can integrate their processes with other data, analytics, machine learning, and compute services such as Amazon S3, <a href=\"https://aws.amazon.com/cn/dynamodb/\" target=\"_blank\">Amazon DynamoDB</a>, <a href=\"https://aws.amazon.com/cn/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&amp;whats-new-cards.sort-order=desc\" target=\"_blank\">Amazon Web Services Glue</a>, <a href=\"https://aws.amazon.com/cn/opensearch-service/\" target=\"_blank\">Amazon OpenSearch Service</a> (successor to Amazon Elasticsearch Service), <a href=\"https://aws.amazon.com/cn/sagemaker/\" target=\"_blank\">Amazon SageMaker</a>, <a href=\"https://aws.amazon.com/cn/lambda/\" target=\"_blank\">Amazon Web Services Lambda</a>, and more. The additional advantage of Step Functions and Amazon Web Services Batch is that you have full traceability and auditing out of the box. Step Functions shows Graph or Event views together with a complete history for all state machine runs.</p>\n<p>In this post, I used RSQL automation scripts as the building blocks of ETL workflows. Using RSQL is a common integration pattern that we see for customers migrating from Teradata BTEQ scripts. However, if you have simple ETL or ELT processes that can be written as plain SQL, you can invoke the <a href=\"https://docs.aws.amazon.com/redshift/latest/mgmt/data-api.html\" target=\"_blank\">Amazon Redshift Data API</a> directly from Step Functions. To learn more about this integration pattern, see<a href=\"https://aws.amazon.com/cn/blogs/big-data/etl-orchestration-using-the-amazon-redshift-data-api-and-aws-step-functions-with-aws-sdk-integration/\" target=\"_blank\"> ETL orchestration using the Amazon Redshift Data API and Amazon Web Services Step Functions with Amazon Web Services SDK integration</a>.</p>\n<h4><a id=\"About_the_author_209\"></a><strong>About the author</strong></h4>\n<p><img src=\"https://dev-media.amazoncloud.cn/ce9e6e8e9c614408bfa9f7b530520916_image.png\" alt=\"image.png\" /></p>\n<p><strong>Lukasz</strong> is a Principal Software Dev Engineer working in the Amazon Web Services DMA team. Lukasz helps customers move their workloads to Amazon Web Services and specializes in migrating data warehouses and data lakes to Amazon Web Services . In his free time, Lukasz enjoys learning new human languages.</p>\n"}
目录
亚马逊云科技解决方案 基于行业客户应用场景及技术领域的解决方案
联系亚马逊云科技专家
亚马逊云科技解决方案
基于行业客户应用场景及技术领域的解决方案
联系专家
0
目录
关闭
contact-us