Migrate from Snowflake to Amazon Redshift using Amazon Glue Python shell

海外精选
海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时,内容中提到的“AWS” 是 “Amazon Web Services” 的缩写,在此网站不作为商标展示。
0
0
{"value":"As the most widely used cloud data warehouse, [Amazon Redshift](http://aws.amazon.com/redshift) makes it simple and cost-effective to analyze your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. [Tens of thousands of customers](https://aws.amazon.com/redshift/customer-success/) use Amazon Redshift to analyze exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics without having to manage the data warehouse infrastructure. It natively integrates with other AWS services, facilitating the process of building enterprise-grade analytics applications in a manner that is not only cost-effective, but also avoids point solutions.\n\nWe are continuously innovating and releasing new features of Amazon Redshift, enabling the implementation of a wide range of data use cases and meeting requirements with performance and scale. For example, [Amazon Redshift Serverless](https://aws.amazon.com/redshift/redshift-serverless/) allows you to run and scale analytics workloads without having to provision and manage data warehouse clusters. Other features that help power analytics at scale with Amazon Redshift include automatic concurrency scaling for read and write queries, automatic workload management (WLM) for concurrency scaling, automatic table optimization, the new RA3 instances with managed storage to [scale cloud data warehouses and reduce costs](https://aws.amazon.com/blogs/big-data/scale-your-cloud-data-warehouse-and-reduce-costs-with-the-new-amazon-redshift-ra3-nodes-with-managed-storage/), cross-Region data sharing, data exchange, and the SUPER data type to store semi-structured data or documents as values. For the latest feature releases for Amazon Redshift, see [Amazon Redshift What’s New](https://aws.amazon.com/redshift/whats-new/?amazon-redshift-whats-new.sort-by=item.additionalFields.postDateTime&amazon-redshift-whats-new.sort-order=desc). In addition to improving performance and scale, you can also gain up to [three times better price performance](https://aws.amazon.com/blogs/big-data/get-up-to-3x-better-price-performance-with-amazon-redshift-than-other-cloud-data-warehouses/) with Amazon Redshift than other cloud data warehouses.\n\nTo take advantage of the performance, security, and scale of Amazon Redshift, customers are looking to migrate their data from their existing cloud warehouse in a way that is both cost optimized and performant. This post describes how to migrate a large volume of data from Snowflake to Amazon Redshift using AWS Glue Python shell in a manner that meets both these goals.\n\n[AWS Glue](https://aws.amazon.com/glue) is serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning (ML), and application development. AWS Glue provides all the capabilities needed for data integration, allowing you to analyze your data in minutes instead of weeks or months. AWS Glue supports the ability to use a Python shell job to run Python scripts as a shell, enabling you to author ETL processes in a familiar language. In addition, AWS Glue allows you to manage ETL jobs using [AWS Glue workflows](https://docs.aws.amazon.com/glue/latest/dg/workflows_overview.html), [Amazon Managed Workflows for Apache Airflow](https://docs.aws.amazon.com/mwaa/latest/userguide/what-is-mwaa.html) (Amazon MWAA), and [AWS Step Functions](https://aws.amazon.com/step-functions/), automating and facilitating the orchestration of ETL steps.\n\n\n#### **Solution overview**\n\n\nThe following architecture shows how an AWS Glue Python shell job migrates the data from Snowflake to Amazon Redshift in this solution.\n\n![image.png](1)\n\nThe solution is comprised of two stages:\n\n- **Extract** – The first part of the solution extracts data from Snowflake into an [Amazon Simple Storage Service](https://aws.amazon.com/s3/) (Amazon S3) data lake\n- **Load** – The second part of the solution reads the data from the same S3 bucket and loads it into Amazon Redshift\n\nFor both stages, we connect the AWS Glue Python shell jobs to Snowflake and Amazon Redshift using database connectors for Python. The first AWS Glue Python shell job reads a SQL file from an S3 bucket to run the relevant COPY commands on the Snowflake database using Snowflake compute capacity and parallelism to migrate the data to Amazon S3. When this is complete, the second AWS Glue Python shell job reads another SQL file, and runs the corresponding COPY commands on the Amazon Redshift database using Redshift compute capacity and parallelism to load the data from the same S3 bucket.\n\nBoth jobs are orchestrated using AWS Glue workflows, as shown in the following screenshot. The workflow pushes data processing logic down to the respective data warehouses by running COPY commands on the databases themselves, minimizing the processing capacity required by AWS Glue to just the resources needed to run the Python scripts. The COPY commands load data in parallel both to and from Amazon S3, providing one of the fastest and most scalable mechanisms to transfer data from Snowflake to Amazon Redshift.\n\nBecause all heavy lifting around data processing is pushed down to the data warehouses, this solution is designed to provide a cost-optimized and highly performant mechanism to migrate a large volume of data from Snowflake to Amazon Redshift with ease.\n\n![image.png](2)\n\nThe entire solution is packaged in an [AWS CloudFormation](https://aws.amazon.com/cloudformation/) template for simplicity of deployment and automatic provisioning of most of the required resources and permissions.\n\nThe high-level steps to implement the solution are as follows:\n\n1. Generate the Snowflake SQL file.\n2. Deploy the CloudFormation template to provision the required resources and permissions.\n3. Provide Snowflake access to newly created S3 bucket.\n4. Run the AWS Glue workflow to migrate the data.\n\n\n#### **Prerequisites**\n\n\nBefore you get started, you can optionally build the latest version of the Snowflake Connector for Python package locally and generate the wheel (.whl) package. For instructions, refer to [How to build](https://github.com/snowflakedb/snowflake-connector-python#how-to-build).\n\nIf you don’t provide the latest version of the package, the CloudFormation template uses a pre-built .whl file that may not be on the most current version of Snowflake Connector for Python.\n\nBy default, the CloudFormation template migrates data from all tables in the ```TPCH_SF1``` schema of the ```SNOWFLAKE_SAMPLE_DATA``` database, which is a sample dataset provided by Snowflake when an account is created. The following stored procedure is used to dynamically generate the Snowflake COPY commands required to migrate the dataset to Amazon S3. It accepts the database name, schema name, and stage name as the parameters.\n\n```\nCREATE OR REPLACE PROCEDURE generate_copy(db_name VARCHAR, schema_name VARCHAR, stage_name VARCHAR)\n returns varchar not null\n language javascript\n as\n $$\nvar return_value = \"\";\nvar sql_query = \"select table_catalog, table_schema, lower(table_name) as table_name from \" + DB_NAME + \".information_schema.tables where table_schema = '\" + SCHEMA_NAME + \"'\" ;\n var sql_statement = snowflake.createStatement(\n {\n sqlText: sql_query\n }\n );\n/* Creates result set */\nvar result_scan = sql_statement.execute();\nwhile (result_scan.next()) {\n return_value += \"\\n\";\n return_value += \"COPY INTO @\"\n return_value += STAGE_NAME\n return_value += \"/\"\n return_value += result_scan.getColumnValue(3);\n return_value += \"/\"\n return_value += \"\\n\";\n return_value += \"FROM \";\n return_value += result_scan.getColumnValue(1);\n return_value += \".\" + result_scan.getColumnValue(2);\n return_value += \".\" + result_scan.getColumnValue(3);\n return_value += \"\\n\";\n return_value += \"FILE_FORMAT = (TYPE = CSV FIELD_DELIMITER = '|' COMPRESSION = GZIP)\";\n return_value += \"\\n\";\n return_value += \"OVERWRITE = TRUE;\"\n return_value += \"\\n\";\n }\nreturn return_value;\n$$\n;\n```\n\n\n#### **Deploy the required resources and permissions using AWS CloudFormation**\n\n\nYou can use the provided CloudFormation template to deploy this solution. This template automatically provisions an Amazon Redshift cluster with your desired configuration in a private subnet, maintaining a high standard of security.\n\n1. Sign in to the [AWS Management Console](http://aws.amazon.com/console), preferably as admin user.\n2. Select your desired Region, preferably the same Region where your Snowflake instance is provisioned.\n3. Choose **Launch Stack**:\n\n [![image.png](3)](https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=blog-resources&templateURL=https://aws-big-data-blog.s3.amazonaws.com/artifacts/BDB-2083-Migrate-from-Snowflake-to-Amazon-Redshift/scripts/SFtoRSMigrate.yaml)\n\n4. Choose **Next**.\n5. For **Stack name**, enter a meaningful name for the stack, for example, ```blog-resources```.\n\nThe **Parameters** section is divided into two subsections: **Source Snowflake Infrastructure** and **Target Redshift Configuration**.\n\n6. For **Snowflake Unload SQL Script**, it defaults to S3 location (URI) of a SQL file which migrates the sample data in the ```TPCH_SF1``` schema of the ```SNOWFLAKE_SAMPLE_DATA``` database.\n7. For **Data S3 Bucket**, enter a prefix for the name of the S3 bucket that is automatically provisioned to stage the Snowflake data, for example, ```sf-migrated-data```.\n8. For **Snowflake Driver**, if applicable, enter the S3 location (URI) of the .whl package built earlier as a prerequisite. By default, it uses a pre-built .whl file.\n9. For **Snowflake Account Name**, enter your Snowflake account name.\n\nYou can use the following query in Snowflake to return your Snowflake account name:\n\n```\nSELECT CURRENT_ACCOUNT();\n```\n\n10. For **Snowflake Username**, enter your user name to connect to the Snowflake account.\n11. For **Snowflake Password**, enter the password for the preceding user.\n12. For **Snowflake Warehouse Name**, enter the warehouse name for running the SQL queries.\n\nMake sure the aforementioned user has access to the warehouse.\n\n13. For **Snowflake Database Name**, enter the database name. The default is ```SNOWFLAKE_SAMPLE_DATA```.\n14. For **Snowflake Schema Name**, enter schema name. The default is ```TPCH_SF1```.\n\n![image.png](4)\n\n15. For **VPC CIDR Block**, enter the desired CIDR block of Redshift cluster. The default is ```10.0.0.0/16```.\n16. For **Subnet 1 CIDR Block**, enter the CIDR block of the first subnet. The default is ```10.0.0.0/24```.\n17. For **Subnet 2 CIDR Block**, enter the CIDR block of the first subnet. The default is ```10.0.1.0/24```.\n18. For **Redshift Load SQL Script**, it defaults to S3 location (URI) of a SQL file which migrates the sample data in S3 to Redshift.\n\nThe following database view in Redshift is used to dynamically generate Redshift COPY commands required to migrate the dataset from Amazon S3. It accepts the schema name as the filter criteria.\n\n```\nCREATE OR REPLACE VIEW v_generate_copy\nAS\nSELECT\n schemaname ,\n tablename ,\n seq ,\n ddl\nFROM\n (\n SELECT\n table_id ,\n schemaname ,\n tablename ,\n seq ,\n ddl\n FROM\n (\n --COPY TABLE\n SELECT\n c.oid::bigint as table_id ,\n n.nspname AS schemaname ,\n c.relname AS tablename ,\n 0 AS seq ,\n 'COPY ' + n.nspname + '.' + c.relname + ' FROM ' AS ddl\n FROM\n pg_namespace AS n\n INNER JOIN\n pg_class AS c\n ON\n n.oid = c.relnamespace\n WHERE\n c.relkind = 'r'\n --COPY TABLE continued \n UNION \n SELECT\n c.oid::bigint as table_id ,\n n.nspname AS schemaname ,\n c.relname AS tablename ,\n 2 AS seq ,\n '''${' + '2}' + c.relname + '/'' iam_role ''${' + '1}'' gzip delimiter ''|'' EMPTYASNULL REGION ''us-east-1''' AS ddl\n FROM\n pg_namespace AS n\n INNER JOIN\n pg_class AS c\n ON\n n.oid = c.relnamespace\n WHERE\n c.relkind = 'r'\n --END SEMICOLON \n UNION \n SELECT\n c.oid::bigint as table_id ,\n n.nspname AS schemaname,\n c.relname AS tablename ,\n 600000005 AS seq ,\n ';' AS ddl\n FROM\n pg_namespace AS n\n INNER JOIN\n pg_class AS c\n ON\n n.oid = c.relnamespace\n WHERE\n c.relkind = 'r' \n )\n ORDER BY\n table_id ,\n schemaname,\n tablename ,\n seq \n );\n\nSELECT ddl\nFROM v_generate_copy\nWHERE schemaname = 'tpch_sf1';\n```\n\n19. For **Redshift Database Name**, enter your desired database name, for example, ```dev```.\n20. For **Number of Redshift Nodes**, enter the desired compute nodes, for example, ```2```.\n21. For **Redshift Node Type**, choose the desired node type, for example, **ra3.4xlarge**.\n22. For **Redshift Password**, enter your desired password with the following [constraints](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_USER.html): it must be 8–64 characters in length, and contain at least one uppercase letter, one lowercase letter, and one number.\n23. For **Redshift Port**, enter the Amazon Redshift port number to connect to. The default port is ```5439```.\n\n![image.png](5)\n\n![image.png](6)\n\n24. Choose **Next**.\n25. Review and choose **Create stack**.\n\nIt takes around 5 minutes for the template to finish creating all resources and permissions. Most of the resources have the prefix of the stack name you specified for easy identification of the resources later. For more details on the deployed resources, see the appendix at the end of this post.\n\n\n#### **Create an IAM role and external Amazon S3 stage for Snowflake access to the data S3 bucket**\n\n\nIn order for Snowflake to access the ```TargetDataS3Bucket``` created earlier by CloudFormation template, you must create an [AWS Identity and Access Management](http://aws.amazon.com/iam) (IAM) role and external Amazon S3 stage for Snowflake access to the S3 bucket. For instructions, refer to [Configuring Secure Access to Amazon S3](https://docs.snowflake.com/en/user-guide/data-load-s3-config.html).\n\nWhen you create an external stage in Snowflake, use the value for ```TargetDataS3Bucket``` on the **Outputs** tab of your deployed CloudFormation stack for the Amazon S3 URL of your stage.\n\n![image.png](7)\n\nMake sure to name the external stage ```unload_to_s3``` if you’re migrating the sample data using the default scripts provided in the CloudFormation template.\n\n\n#### **Convert Snowflake tables to Amazon Redshift**\n\n\nYou can simply run the following DDL statements to create TPCH_SF1 schema objects in Amazon Redshift. You can also use [AWS Schema Conversion Tool](https://aws.amazon.com/dms/schema-conversion-tool/) (AWS SCT) to convert Snowflake custom objects to Amazon Redshift. For instructions on converting your schema, refer to [Accelerate Snowflake to Amazon Redshift migration using AWS Schema Conversion Tool](https://aws.amazon.com/blogs/big-data/accelerate-snowflake-to-amazon-redshift-migration-using-aws-schema-conversion-tool/).\n\n```\nCREATE SCHEMA TPCH_SF1;\nSET SEARCH_PATH to TPCH_SF1;\nCREATE TABLE customer (\n c_custkey int8 not null ,\n c_name varchar(25) not null,\n c_address varchar(40) not null,\n c_nationkey int4 not null,\n c_phone char(15) not null,\n c_acctbal numeric(12,2) not null,\n c_mktsegment char(10) not null,\n c_comment varchar(117) not null,\n Primary Key(C_CUSTKEY)\n) ;\n\nCREATE TABLE lineitem (\n l_orderkey int8 not null ,\n l_partkey int8 not null,\n l_suppkey int4 not null,\n l_linenumber int4 not null,\n l_quantity numeric(12,2) not null,\n l_extendedprice numeric(12,2) not null,\n l_discount numeric(12,2) not null,\n l_tax numeric(12,2) not null,\n l_returnflag char(1) not null,\n l_linestatus char(1) not null,\n l_shipdate date not null ,\n l_commitdate date not null,\n l_receiptdate date not null,\n l_shipinstruct char(25) not null,\n l_shipmode char(10) not null,\n l_comment varchar(44) not null,\n Primary Key(L_ORDERKEY, L_LINENUMBER)\n) ;\n\nCREATE TABLE nation (\n n_nationkey int4 not null,\n n_name char(25) not null ,\n n_regionkey int4 not null,\n n_comment varchar(152) not null,\n Primary Key(N_NATIONKEY) \n) ;\n\nCREATE TABLE orders (\n o_orderkey int8 not null,\n o_custkey int8 not null,\n o_orderstatus char(1) not null,\n o_totalprice numeric(12,2) not null,\n o_orderdate date not null,\n o_orderpriority char(15) not null,\n o_clerk char(15) not null,\n o_shippriority int4 not null,\n o_comment varchar(79) not null,\n Primary Key(O_ORDERKEY)\n) ;\n\nCREATE TABLE part (\n p_partkey int8 not null ,\n p_name varchar(55) not null,\n p_mfgr char(25) not null,\n p_brand char(10) not null,\n p_type varchar(25) not null,\n p_size int4 not null,\n p_container char(10) not null,\n p_retailprice numeric(12,2) not null,\n p_comment varchar(23) not null,\n PRIMARY KEY (P_PARTKEY)\n) ;\n\nCREATE TABLE partsupp (\n ps_partkey int8 not null,\n ps_suppkey int4 not null,\n ps_availqty int4 not null,\n ps_supplycost numeric(12,2) not null,\n ps_comment varchar(199) not null,\n Primary Key(PS_PARTKEY, PS_SUPPKEY)\n) ;\n\nCREATE TABLE region (\n r_regionkey int4 not null,\n r_name char(25) not null ,\n r_comment varchar(152) not null,\n Primary Key(R_REGIONKEY) \n) ;\n\nCREATE TABLE supplier (\n s_suppkey int4 not null,\n s_name char(25) not null,\n s_address varchar(40) not null,\n s_nationkey int4 not null,\n s_phone char(15) not null,\n s_acctbal numeric(12,2) not null,\n s_comment varchar(101) not null,\n Primary Key(S_SUPPKEY)\n);\n```\n\n\n#### **Run an AWS Glue workflow for data migration**\n\n\nWhen you’re ready to start the data migration, complete the following steps:\n\n1. On the AWS Glue console, choose **Workflows** in the navigation pane.\n2. Select the workflow to run (*<stack name>*–```snowflake-to-redshift-migration```).\n3. On the **Actions** menu, choose **Run**.\n\n![image.png](8)\n\n4. To check the status of the workflow, choose the workflow and on the **History** tab, select the **Run ID** and choose **View run details**.\n\n![image.png](9)\n\n5. When the workflow is complete, navigate to the Amazon Redshift console and launch the Amazon Redshift query editor v2 to verify the successful migration of data.\n6. Run the following query in Amazon Redshift to get row counts of all tables migrated from Snowflake to Amazon Redshift. Make sure to adjust the ```table_schema``` value accordingly if you’re not migrating the sample data.\n\n```\nSELECT tab.table_schema,\n tab.table_name,\n nvl(tinf.tbl_rows,0) tbl_rows,\n nvl(tinf.size,0) size\nFROM svv_tables tab\nLEFT JOIN svv_table_info tinf \n on tab.table_schema = tinf.schema \n and tab.table_name = tinf.”table”\nWHERE tab.table_type = 'BASE TABLE'\n and tab.table_schema in ('tpch_sf1')\nORDER BY tbl_rows;\n```\n\n![image.png](10)\n\n7. Run the following query in Snowflake to compare and validate the data:\n\n```\nUSE DATABASE snowflake_sample_data;\nSELECT TABLE_CATALOG,\n TABLE_SCHEMA,\n TABLE_NAME,\n ROW_COUNT,\n BYTES AS SIZE,\n COMMENT\nFROM INFORMATION_SCHEMA.TABLES\nWHERE TABLE_SCHEMA = 'TPCH_SF1'\nORDER BY ROW_COUNT;\n```\n\n![image.png](11)\n\n\n#### **Clean up**\n\n\nTo avoid incurring future charges, delete the resources you created as part of the CloudFormation stack by navigating to the AWS CloudFormation console, selecting the stack ```blog-resources```, and choosing **Delete**.\n\n\n#### **Conclusion**\n\n\nIn this post, we discussed how to perform an efficient, fast, and cost-effective migration from Snowflake to Amazon Redshift. Migrations from one data warehouse environment to another can typically be very time-consuming and resource-intensive; this solution uses the power of cloud-based compute by pushing down the processing to the respective warehouses. Orchestrating this migration with the AWS Glue Python shell provides additional cost optimization.\n\nWith this solution, you can facilitate your migration from Snowflake to Amazon Redshift. If you’re interested in further exploring the potential of using Amazon Redshift, please reach out to your [AWS Account Team](https://aws.amazon.com/contact-us/) for a proof of concept.\n\n\n#### **Appendix: Resources deployed by AWS CloudFormation**\n\n\nThe CloudFormation stack deploys the following resources in your AWS account:\n\n- **Networking resources** – [Amazon Virtual Private Cloud](http://aws.amazon.com/vpc) (Amazon VPC), subnets, ACL, and security group.\n- **Amazon S3 bucket** – This is referenced as ```TargetDataS3Bucket``` on the **Outputs** tab of the CloudFormation stack. This bucket holds the data being migrated from Snowflake to Amazon Redshift.\n- **AWS Secrets Manager secrets** – Two secrets in [AWS Secrets Manager](https://aws.amazon.com/secrets-manager/) store credentials for Snowflake and Amazon Redshift.\n- **VPC endpoints** – The two VPC endpoints are deployed to establish a private connection from VPC resources like AWS Glue to services that run outside of the VPC, such as Secrets Manager and Amazon S3.\n- **IAM roles** – IAM roles for AWS Glue, Lambda, and Amazon Redshift. If the CloudFormation template is to be deployed in a production environment, you need to adjust the IAM policies so they’re not as permissive as presented in this post (which were set for simplicity and demonstration). Particularly, AWS Glue and Amazon Redshift don’t require all the actions granted in the ```*FullAccess``` policies, which would be considered overly permissive.\n- **Amazon Redshift cluster** – An Amazon Redshift cluster is created in a private subnet, which isn’t publicly accessible.\n- **AWS Glue connection** – The connection for Amazon Redshift makes sure that the AWS Glue job runs within the same VPC as Amazon Redshift. This also ensures that AWS Glue can access the Amazon Redshift cluster in a private subnet.\n- **AWS Glue jobs** – Two AWS Glue Python shell jobs are created:\n - **<stack name>-glue-snowflake-unload** – The first job runs the SQL scripts in Snowflake to copy data from the source database to Amazon S3. The Python script is available in [S3](https://aws-big-data-blog.s3.amazonaws.com/artifacts/BDB-2083-Migrate-from-Snowflake-to-Amazon-Redshift/scripts/snowflake-python-job.py). The Snowflake job accepts two parameters:\n - **SQLSCRIPT** – The Amazon S3 location of the SQL script to run in Snowflake to migrate data to Amazon S3. This is referenced as the **Snowflake Unload SQL Script** parameter in the input section of the CloudFormation template.\n - **SECRET** – The Secrets Manager ARN that stores Snowflake connection details.\n - **<stack name>-glue-redshift-load** – The second job runs another SQL script in Amazon Redshift to copy data from Amazon S3 to the target Amazon Redshift database. The Python script link is available in [S3](https://aws-big-data-blog.s3.amazonaws.com/artifacts/BDB-2083-Migrate-from-Snowflake-to-Amazon-Redshift/scripts/redshift-python-job.py). The Amazon Redshift job accepts three parameters:\n - **SQLSCRIPT** – The Amazon S3 location of the SQL script to run in Amazon Redshift to migrate data from Amazon S3. If you provide custom SQL script to migrate the Snowflake data to Amazon S3 (as mentioned in the prerequisites), the file location is referenced as LoadFileLocation on the **Outputs** tab of the CloudFormation stack.\n - **SECRET** – The Secrets Manager ARN that stores Amazon Redshift connection details.\n - **PARAMS** – This includes any additional parameters required for the SQL script, including the Amazon Redshift IAM role used in the COPY commands and the S3 bucket staging the Snowflake data. Multiple parameter values can be provided separated by a comma.\n- **AWS Glue workflow** – The orchestration of Snowflake and Amazon Redshift AWS Glue Python shell jobs is managed via an AWS Glue workflow. The workflow *<stack name>*–```snowflake-to-redshift-migration``` runs later for actual migration of data.\n\nAbout the Authors\n\n![image.png](12)\n\n**Raks Khare** is an Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers architect data analytics solutions at scale on the AWS platform.\n\n![image.png](13)\n\n**Julia Beck** is an Analytics Specialist Solutions Architect at AWS. She supports customers in validating analytics solutions by architecting proof of concept workloads designed to meet their specific needs.","render":"<p>As the most widely used cloud data warehouse, <a href=\"http://aws.amazon.com/redshift\" target=\"_blank\">Amazon Redshift</a> makes it simple and cost-effective to analyze your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. <a href=\"https://aws.amazon.com/redshift/customer-success/\" target=\"_blank\">Tens of thousands of customers</a> use Amazon Redshift to analyze exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics without having to manage the data warehouse infrastructure. It natively integrates with other AWS services, facilitating the process of building enterprise-grade analytics applications in a manner that is not only cost-effective, but also avoids point solutions.</p>\n<p>We are continuously innovating and releasing new features of Amazon Redshift, enabling the implementation of a wide range of data use cases and meeting requirements with performance and scale. For example, <a href=\"https://aws.amazon.com/redshift/redshift-serverless/\" target=\"_blank\">Amazon Redshift Serverless</a> allows you to run and scale analytics workloads without having to provision and manage data warehouse clusters. Other features that help power analytics at scale with Amazon Redshift include automatic concurrency scaling for read and write queries, automatic workload management (WLM) for concurrency scaling, automatic table optimization, the new RA3 instances with managed storage to <a href=\"https://aws.amazon.com/blogs/big-data/scale-your-cloud-data-warehouse-and-reduce-costs-with-the-new-amazon-redshift-ra3-nodes-with-managed-storage/\" target=\"_blank\">scale cloud data warehouses and reduce costs</a>, cross-Region data sharing, data exchange, and the SUPER data type to store semi-structured data or documents as values. For the latest feature releases for Amazon Redshift, see <a href=\"https://aws.amazon.com/redshift/whats-new/?amazon-redshift-whats-new.sort-by=item.additionalFields.postDateTime&amp;amazon-redshift-whats-new.sort-order=desc\" target=\"_blank\">Amazon Redshift What’s New</a>. In addition to improving performance and scale, you can also gain up to <a href=\"https://aws.amazon.com/blogs/big-data/get-up-to-3x-better-price-performance-with-amazon-redshift-than-other-cloud-data-warehouses/\" target=\"_blank\">three times better price performance</a> with Amazon Redshift than other cloud data warehouses.</p>\n<p>To take advantage of the performance, security, and scale of Amazon Redshift, customers are looking to migrate their data from their existing cloud warehouse in a way that is both cost optimized and performant. This post describes how to migrate a large volume of data from Snowflake to Amazon Redshift using AWS Glue Python shell in a manner that meets both these goals.</p>\n<p><a href=\"https://aws.amazon.com/glue\" target=\"_blank\">AWS Glue</a> is serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning (ML), and application development. AWS Glue provides all the capabilities needed for data integration, allowing you to analyze your data in minutes instead of weeks or months. AWS Glue supports the ability to use a Python shell job to run Python scripts as a shell, enabling you to author ETL processes in a familiar language. In addition, AWS Glue allows you to manage ETL jobs using <a href=\"https://docs.aws.amazon.com/glue/latest/dg/workflows_overview.html\" target=\"_blank\">AWS Glue workflows</a>, <a href=\"https://docs.aws.amazon.com/mwaa/latest/userguide/what-is-mwaa.html\" target=\"_blank\">Amazon Managed Workflows for Apache Airflow</a> (Amazon MWAA), and <a href=\"https://aws.amazon.com/step-functions/\" target=\"_blank\">AWS Step Functions</a>, automating and facilitating the orchestration of ETL steps.</p>\n<h4><a id=\"Solution_overview_9\"></a><strong>Solution overview</strong></h4>\n<p>The following architecture shows how an AWS Glue Python shell job migrates the data from Snowflake to Amazon Redshift in this solution.</p>\n<p><img src=\"1\" alt=\"image.png\" /></p>\n<p>The solution is comprised of two stages:</p>\n<ul>\n<li><strong>Extract</strong> – The first part of the solution extracts data from Snowflake into an <a href=\"https://aws.amazon.com/s3/\" target=\"_blank\">Amazon Simple Storage Service</a> (Amazon S3) data lake</li>\n<li><strong>Load</strong> – The second part of the solution reads the data from the same S3 bucket and loads it into Amazon Redshift</li>\n</ul>\n<p>For both stages, we connect the AWS Glue Python shell jobs to Snowflake and Amazon Redshift using database connectors for Python. The first AWS Glue Python shell job reads a SQL file from an S3 bucket to run the relevant COPY commands on the Snowflake database using Snowflake compute capacity and parallelism to migrate the data to Amazon S3. When this is complete, the second AWS Glue Python shell job reads another SQL file, and runs the corresponding COPY commands on the Amazon Redshift database using Redshift compute capacity and parallelism to load the data from the same S3 bucket.</p>\n<p>Both jobs are orchestrated using AWS Glue workflows, as shown in the following screenshot. The workflow pushes data processing logic down to the respective data warehouses by running COPY commands on the databases themselves, minimizing the processing capacity required by AWS Glue to just the resources needed to run the Python scripts. The COPY commands load data in parallel both to and from Amazon S3, providing one of the fastest and most scalable mechanisms to transfer data from Snowflake to Amazon Redshift.</p>\n<p>Because all heavy lifting around data processing is pushed down to the data warehouses, this solution is designed to provide a cost-optimized and highly performant mechanism to migrate a large volume of data from Snowflake to Amazon Redshift with ease.</p>\n<p><img src=\"2\" alt=\"image.png\" /></p>\n<p>The entire solution is packaged in an <a href=\"https://aws.amazon.com/cloudformation/\" target=\"_blank\">AWS CloudFormation</a> template for simplicity of deployment and automatic provisioning of most of the required resources and permissions.</p>\n<p>The high-level steps to implement the solution are as follows:</p>\n<ol>\n<li>Generate the Snowflake SQL file.</li>\n<li>Deploy the CloudFormation template to provision the required resources and permissions.</li>\n<li>Provide Snowflake access to newly created S3 bucket.</li>\n<li>Run the AWS Glue workflow to migrate the data.</li>\n</ol>\n<h4><a id=\"Prerequisites_39\"></a><strong>Prerequisites</strong></h4>\n<p>Before you get started, you can optionally build the latest version of the Snowflake Connector for Python package locally and generate the wheel (.whl) package. For instructions, refer to <a href=\"https://github.com/snowflakedb/snowflake-connector-python#how-to-build\" target=\"_blank\">How to build</a>.</p>\n<p>If you don’t provide the latest version of the package, the CloudFormation template uses a pre-built .whl file that may not be on the most current version of Snowflake Connector for Python.</p>\n<p>By default, the CloudFormation template migrates data from all tables in the <code>TPCH_SF1</code> schema of the <code>SNOWFLAKE_SAMPLE_DATA</code> database, which is a sample dataset provided by Snowflake when an account is created. The following stored procedure is used to dynamically generate the Snowflake COPY commands required to migrate the dataset to Amazon S3. It accepts the database name, schema name, and stage name as the parameters.</p>\n<pre><code class=\"lang-\">CREATE OR REPLACE PROCEDURE generate_copy(db_name VARCHAR, schema_name VARCHAR, stage_name VARCHAR)\n returns varchar not null\n language javascript\n as\n $$\nvar return_value = &quot;&quot;;\nvar sql_query = &quot;select table_catalog, table_schema, lower(table_name) as table_name from &quot; + DB_NAME + &quot;.information_schema.tables where table_schema = '&quot; + SCHEMA_NAME + &quot;'&quot; ;\n var sql_statement = snowflake.createStatement(\n {\n sqlText: sql_query\n }\n );\n/* Creates result set */\nvar result_scan = sql_statement.execute();\nwhile (result_scan.next()) {\n return_value += &quot;\\n&quot;;\n return_value += &quot;COPY INTO @&quot;\n return_value += STAGE_NAME\n return_value += &quot;/&quot;\n return_value += result_scan.getColumnValue(3);\n return_value += &quot;/&quot;\n return_value += &quot;\\n&quot;;\n return_value += &quot;FROM &quot;;\n return_value += result_scan.getColumnValue(1);\n return_value += &quot;.&quot; + result_scan.getColumnValue(2);\n return_value += &quot;.&quot; + result_scan.getColumnValue(3);\n return_value += &quot;\\n&quot;;\n return_value += &quot;FILE_FORMAT = (TYPE = CSV FIELD_DELIMITER = '|' COMPRESSION = GZIP)&quot;;\n return_value += &quot;\\n&quot;;\n return_value += &quot;OVERWRITE = TRUE;&quot;\n return_value += &quot;\\n&quot;;\n }\nreturn return_value;\n$$\n;\n</code></pre>\n<h4><a id=\"Deploy_the_required_resources_and_permissions_using_AWS_CloudFormation_87\"></a><strong>Deploy the required resources and permissions using AWS CloudFormation</strong></h4>\n<p>You can use the provided CloudFormation template to deploy this solution. This template automatically provisions an Amazon Redshift cluster with your desired configuration in a private subnet, maintaining a high standard of security.</p>\n<ol>\n<li>\n<p>Sign in to the <a href=\"http://aws.amazon.com/console\" target=\"_blank\">AWS Management Console</a>, preferably as admin user.</p>\n</li>\n<li>\n<p>Select your desired Region, preferably the same Region where your Snowflake instance is provisioned.</p>\n</li>\n<li>\n<p>Choose <strong>Launch Stack</strong>:</p>\n<p><a href=\"https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=blog-resources&amp;templateURL=https://aws-big-data-blog.s3.amazonaws.com/artifacts/BDB-2083-Migrate-from-Snowflake-to-Amazon-Redshift/scripts/SFtoRSMigrate.yaml\" target=\"_blank\"><img src=\"3\" alt=\"image.png\" /></a></p>\n</li>\n<li>\n<p>Choose <strong>Next</strong>.</p>\n</li>\n<li>\n<p>For <strong>Stack name</strong>, enter a meaningful name for the stack, for example, <code>blog-resources</code>.</p>\n</li>\n</ol>\n<p>The <strong>Parameters</strong> section is divided into two subsections: <strong>Source Snowflake Infrastructure</strong> and <strong>Target Redshift Configuration</strong>.</p>\n<ol start=\"6\">\n<li>For <strong>Snowflake Unload SQL Script</strong>, it defaults to S3 location (URI) of a SQL file which migrates the sample data in the <code>TPCH_SF1</code> schema of the <code>SNOWFLAKE_SAMPLE_DATA</code> database.</li>\n<li>For <strong>Data S3 Bucket</strong>, enter a prefix for the name of the S3 bucket that is automatically provisioned to stage the Snowflake data, for example, <code>sf-migrated-data</code>.</li>\n<li>For <strong>Snowflake Driver</strong>, if applicable, enter the S3 location (URI) of the .whl package built earlier as a prerequisite. By default, it uses a pre-built .whl file.</li>\n<li>For <strong>Snowflake Account Name</strong>, enter your Snowflake account name.</li>\n</ol>\n<p>You can use the following query in Snowflake to return your Snowflake account name:</p>\n<pre><code class=\"lang-\">SELECT CURRENT_ACCOUNT();\n</code></pre>\n<ol start=\"10\">\n<li>For <strong>Snowflake Username</strong>, enter your user name to connect to the Snowflake account.</li>\n<li>For <strong>Snowflake Password</strong>, enter the password for the preceding user.</li>\n<li>For <strong>Snowflake Warehouse Name</strong>, enter the warehouse name for running the SQL queries.</li>\n</ol>\n<p>Make sure the aforementioned user has access to the warehouse.</p>\n<ol start=\"13\">\n<li>For <strong>Snowflake Database Name</strong>, enter the database name. The default is <code>SNOWFLAKE_SAMPLE_DATA</code>.</li>\n<li>For <strong>Snowflake Schema Name</strong>, enter schema name. The default is <code>TPCH_SF1</code>.</li>\n</ol>\n<p><img src=\"4\" alt=\"image.png\" /></p>\n<ol start=\"15\">\n<li>For <strong>VPC CIDR Block</strong>, enter the desired CIDR block of Redshift cluster. The default is <code>10.0.0.0/16</code>.</li>\n<li>For <strong>Subnet 1 CIDR Block</strong>, enter the CIDR block of the first subnet. The default is <code>10.0.0.0/24</code>.</li>\n<li>For <strong>Subnet 2 CIDR Block</strong>, enter the CIDR block of the first subnet. The default is <code>10.0.1.0/24</code>.</li>\n<li>For <strong>Redshift Load SQL Script</strong>, it defaults to S3 location (URI) of a SQL file which migrates the sample data in S3 to Redshift.</li>\n</ol>\n<p>The following database view in Redshift is used to dynamically generate Redshift COPY commands required to migrate the dataset from Amazon S3. It accepts the schema name as the filter criteria.</p>\n<pre><code class=\"lang-\">CREATE OR REPLACE VIEW v_generate_copy\nAS\nSELECT\n schemaname ,\n tablename ,\n seq ,\n ddl\nFROM\n (\n SELECT\n table_id ,\n schemaname ,\n tablename ,\n seq ,\n ddl\n FROM\n (\n --COPY TABLE\n SELECT\n c.oid::bigint as table_id ,\n n.nspname AS schemaname ,\n c.relname AS tablename ,\n 0 AS seq ,\n 'COPY ' + n.nspname + '.' + c.relname + ' FROM ' AS ddl\n FROM\n pg_namespace AS n\n INNER JOIN\n pg_class AS c\n ON\n n.oid = c.relnamespace\n WHERE\n c.relkind = 'r'\n --COPY TABLE continued \n UNION \n SELECT\n c.oid::bigint as table_id ,\n n.nspname AS schemaname ,\n c.relname AS tablename ,\n 2 AS seq ,\n '''${' + '2}' + c.relname + '/'' iam_role ''${' + '1}'' gzip delimiter ''|'' EMPTYASNULL REGION ''us-east-1''' AS ddl\n FROM\n pg_namespace AS n\n INNER JOIN\n pg_class AS c\n ON\n n.oid = c.relnamespace\n WHERE\n c.relkind = 'r'\n --END SEMICOLON \n UNION \n SELECT\n c.oid::bigint as table_id ,\n n.nspname AS schemaname,\n c.relname AS tablename ,\n 600000005 AS seq ,\n ';' AS ddl\n FROM\n pg_namespace AS n\n INNER JOIN\n pg_class AS c\n ON\n n.oid = c.relnamespace\n WHERE\n c.relkind = 'r' \n )\n ORDER BY\n table_id ,\n schemaname,\n tablename ,\n seq \n );\n\nSELECT ddl\nFROM v_generate_copy\nWHERE schemaname = 'tpch_sf1';\n</code></pre>\n<ol start=\"19\">\n<li>For <strong>Redshift Database Name</strong>, enter your desired database name, for example, <code>dev</code>.</li>\n<li>For <strong>Number of Redshift Nodes</strong>, enter the desired compute nodes, for example, <code>2</code>.</li>\n<li>For <strong>Redshift Node Type</strong>, choose the desired node type, for example, <strong>ra3.4xlarge</strong>.</li>\n<li>For <strong>Redshift Password</strong>, enter your desired password with the following <a href=\"https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_USER.html\" target=\"_blank\">constraints</a>: it must be 8–64 characters in length, and contain at least one uppercase letter, one lowercase letter, and one number.</li>\n<li>For <strong>Redshift Port</strong>, enter the Amazon Redshift port number to connect to. The default port is <code>5439</code>.</li>\n</ol>\n<p><img src=\"5\" alt=\"image.png\" /></p>\n<p><img src=\"6\" alt=\"image.png\" /></p>\n<ol start=\"24\">\n<li>Choose <strong>Next</strong>.</li>\n<li>Review and choose <strong>Create stack</strong>.</li>\n</ol>\n<p>It takes around 5 minutes for the template to finish creating all resources and permissions. Most of the resources have the prefix of the stack name you specified for easy identification of the resources later. For more details on the deployed resources, see the appendix at the end of this post.</p>\n<h4><a id=\"Create_an_IAM_role_and_external_Amazon_S3_stage_for_Snowflake_access_to_the_data_S3_bucket_226\"></a><strong>Create an IAM role and external Amazon S3 stage for Snowflake access to the data S3 bucket</strong></h4>\n<p>In order for Snowflake to access the <code>TargetDataS3Bucket</code> created earlier by CloudFormation template, you must create an <a href=\"http://aws.amazon.com/iam\" target=\"_blank\">AWS Identity and Access Management</a> (IAM) role and external Amazon S3 stage for Snowflake access to the S3 bucket. For instructions, refer to <a href=\"https://docs.snowflake.com/en/user-guide/data-load-s3-config.html\" target=\"_blank\">Configuring Secure Access to Amazon S3</a>.</p>\n<p>When you create an external stage in Snowflake, use the value for <code>TargetDataS3Bucket</code> on the <strong>Outputs</strong> tab of your deployed CloudFormation stack for the Amazon S3 URL of your stage.</p>\n<p><img src=\"7\" alt=\"image.png\" /></p>\n<p>Make sure to name the external stage <code>unload_to_s3</code> if you’re migrating the sample data using the default scripts provided in the CloudFormation template.</p>\n<h4><a id=\"Convert_Snowflake_tables_to_Amazon_Redshift_238\"></a><strong>Convert Snowflake tables to Amazon Redshift</strong></h4>\n<p>You can simply run the following DDL statements to create TPCH_SF1 schema objects in Amazon Redshift. You can also use <a href=\"https://aws.amazon.com/dms/schema-conversion-tool/\" target=\"_blank\">AWS Schema Conversion Tool</a> (AWS SCT) to convert Snowflake custom objects to Amazon Redshift. For instructions on converting your schema, refer to <a href=\"https://aws.amazon.com/blogs/big-data/accelerate-snowflake-to-amazon-redshift-migration-using-aws-schema-conversion-tool/\" target=\"_blank\">Accelerate Snowflake to Amazon Redshift migration using AWS Schema Conversion Tool</a>.</p>\n<pre><code class=\"lang-\">CREATE SCHEMA TPCH_SF1;\nSET SEARCH_PATH to TPCH_SF1;\nCREATE TABLE customer (\n c_custkey int8 not null ,\n c_name varchar(25) not null,\n c_address varchar(40) not null,\n c_nationkey int4 not null,\n c_phone char(15) not null,\n c_acctbal numeric(12,2) not null,\n c_mktsegment char(10) not null,\n c_comment varchar(117) not null,\n Primary Key(C_CUSTKEY)\n) ;\n\nCREATE TABLE lineitem (\n l_orderkey int8 not null ,\n l_partkey int8 not null,\n l_suppkey int4 not null,\n l_linenumber int4 not null,\n l_quantity numeric(12,2) not null,\n l_extendedprice numeric(12,2) not null,\n l_discount numeric(12,2) not null,\n l_tax numeric(12,2) not null,\n l_returnflag char(1) not null,\n l_linestatus char(1) not null,\n l_shipdate date not null ,\n l_commitdate date not null,\n l_receiptdate date not null,\n l_shipinstruct char(25) not null,\n l_shipmode char(10) not null,\n l_comment varchar(44) not null,\n Primary Key(L_ORDERKEY, L_LINENUMBER)\n) ;\n\nCREATE TABLE nation (\n n_nationkey int4 not null,\n n_name char(25) not null ,\n n_regionkey int4 not null,\n n_comment varchar(152) not null,\n Primary Key(N_NATIONKEY) \n) ;\n\nCREATE TABLE orders (\n o_orderkey int8 not null,\n o_custkey int8 not null,\n o_orderstatus char(1) not null,\n o_totalprice numeric(12,2) not null,\n o_orderdate date not null,\n o_orderpriority char(15) not null,\n o_clerk char(15) not null,\n o_shippriority int4 not null,\n o_comment varchar(79) not null,\n Primary Key(O_ORDERKEY)\n) ;\n\nCREATE TABLE part (\n p_partkey int8 not null ,\n p_name varchar(55) not null,\n p_mfgr char(25) not null,\n p_brand char(10) not null,\n p_type varchar(25) not null,\n p_size int4 not null,\n p_container char(10) not null,\n p_retailprice numeric(12,2) not null,\n p_comment varchar(23) not null,\n PRIMARY KEY (P_PARTKEY)\n) ;\n\nCREATE TABLE partsupp (\n ps_partkey int8 not null,\n ps_suppkey int4 not null,\n ps_availqty int4 not null,\n ps_supplycost numeric(12,2) not null,\n ps_comment varchar(199) not null,\n Primary Key(PS_PARTKEY, PS_SUPPKEY)\n) ;\n\nCREATE TABLE region (\n r_regionkey int4 not null,\n r_name char(25) not null ,\n r_comment varchar(152) not null,\n Primary Key(R_REGIONKEY) \n) ;\n\nCREATE TABLE supplier (\n s_suppkey int4 not null,\n s_name char(25) not null,\n s_address varchar(40) not null,\n s_nationkey int4 not null,\n s_phone char(15) not null,\n s_acctbal numeric(12,2) not null,\n s_comment varchar(101) not null,\n Primary Key(S_SUPPKEY)\n);\n</code></pre>\n<h4><a id=\"Run_an_AWS_Glue_workflow_for_data_migration_341\"></a><strong>Run an AWS Glue workflow for data migration</strong></h4>\n<p>When you’re ready to start the data migration, complete the following steps:</p>\n<ol>\n<li>On the AWS Glue console, choose <strong>Workflows</strong> in the navigation pane.</li>\n<li>Select the workflow to run (<em>&lt;stack name&gt;</em>–<code>snowflake-to-redshift-migration</code>).</li>\n<li>On the <strong>Actions</strong> menu, choose <strong>Run</strong>.</li>\n</ol>\n<p><img src=\"8\" alt=\"image.png\" /></p>\n<ol start=\"4\">\n<li>To check the status of the workflow, choose the workflow and on the <strong>History</strong> tab, select the <strong>Run ID</strong> and choose <strong>View run details</strong>.</li>\n</ol>\n<p><img src=\"9\" alt=\"image.png\" /></p>\n<ol start=\"5\">\n<li>When the workflow is complete, navigate to the Amazon Redshift console and launch the Amazon Redshift query editor v2 to verify the successful migration of data.</li>\n<li>Run the following query in Amazon Redshift to get row counts of all tables migrated from Snowflake to Amazon Redshift. Make sure to adjust the <code>table_schema</code> value accordingly if you’re not migrating the sample data.</li>\n</ol>\n<pre><code class=\"lang-\">SELECT tab.table_schema,\n tab.table_name,\n nvl(tinf.tbl_rows,0) tbl_rows,\n nvl(tinf.size,0) size\nFROM svv_tables tab\nLEFT JOIN svv_table_info tinf \n on tab.table_schema = tinf.schema \n and tab.table_name = tinf.”table”\nWHERE tab.table_type = 'BASE TABLE'\n and tab.table_schema in ('tpch_sf1')\nORDER BY tbl_rows;\n</code></pre>\n<p><img src=\"10\" alt=\"image.png\" /></p>\n<ol start=\"7\">\n<li>Run the following query in Snowflake to compare and validate the data:</li>\n</ol>\n<pre><code class=\"lang-\">USE DATABASE snowflake_sample_data;\nSELECT TABLE_CATALOG,\n TABLE_SCHEMA,\n TABLE_NAME,\n ROW_COUNT,\n BYTES AS SIZE,\n COMMENT\nFROM INFORMATION_SCHEMA.TABLES\nWHERE TABLE_SCHEMA = 'TPCH_SF1'\nORDER BY ROW_COUNT;\n</code></pre>\n<p><img src=\"11\" alt=\"image.png\" /></p>\n<h4><a id=\"Clean_up_393\"></a><strong>Clean up</strong></h4>\n<p>To avoid incurring future charges, delete the resources you created as part of the CloudFormation stack by navigating to the AWS CloudFormation console, selecting the stack <code>blog-resources</code>, and choosing <strong>Delete</strong>.</p>\n<h4><a id=\"Conclusion_399\"></a><strong>Conclusion</strong></h4>\n<p>In this post, we discussed how to perform an efficient, fast, and cost-effective migration from Snowflake to Amazon Redshift. Migrations from one data warehouse environment to another can typically be very time-consuming and resource-intensive; this solution uses the power of cloud-based compute by pushing down the processing to the respective warehouses. Orchestrating this migration with the AWS Glue Python shell provides additional cost optimization.</p>\n<p>With this solution, you can facilitate your migration from Snowflake to Amazon Redshift. If you’re interested in further exploring the potential of using Amazon Redshift, please reach out to your <a href=\"https://aws.amazon.com/contact-us/\" target=\"_blank\">AWS Account Team</a> for a proof of concept.</p>\n<h4><a id=\"Appendix_Resources_deployed_by_AWS_CloudFormation_407\"></a><strong>Appendix: Resources deployed by AWS CloudFormation</strong></h4>\n<p>The CloudFormation stack deploys the following resources in your AWS account:</p>\n<ul>\n<li><strong>Networking resources</strong> – <a href=\"http://aws.amazon.com/vpc\" target=\"_blank\">Amazon Virtual Private Cloud</a> (Amazon VPC), subnets, ACL, and security group.</li>\n<li><strong>Amazon S3 bucket</strong> – This is referenced as <code>TargetDataS3Bucket</code> on the <strong>Outputs</strong> tab of the CloudFormation stack. This bucket holds the data being migrated from Snowflake to Amazon Redshift.</li>\n<li><strong>AWS Secrets Manager secrets</strong> – Two secrets in <a href=\"https://aws.amazon.com/secrets-manager/\" target=\"_blank\">AWS Secrets Manager</a> store credentials for Snowflake and Amazon Redshift.</li>\n<li><strong>VPC endpoints</strong> – The two VPC endpoints are deployed to establish a private connection from VPC resources like AWS Glue to services that run outside of the VPC, such as Secrets Manager and Amazon S3.</li>\n<li><strong>IAM roles</strong> – IAM roles for AWS Glue, Lambda, and Amazon Redshift. If the CloudFormation template is to be deployed in a production environment, you need to adjust the IAM policies so they’re not as permissive as presented in this post (which were set for simplicity and demonstration). Particularly, AWS Glue and Amazon Redshift don’t require all the actions granted in the <code>*FullAccess</code> policies, which would be considered overly permissive.</li>\n<li><strong>Amazon Redshift cluster</strong> – An Amazon Redshift cluster is created in a private subnet, which isn’t publicly accessible.</li>\n<li><strong>AWS Glue connection</strong> – The connection for Amazon Redshift makes sure that the AWS Glue job runs within the same VPC as Amazon Redshift. This also ensures that AWS Glue can access the Amazon Redshift cluster in a private subnet.</li>\n<li><strong>AWS Glue jobs</strong> – Two AWS Glue Python shell jobs are created:\n<ul>\n<li><strong>&lt;stack name&gt;-glue-snowflake-unload</strong> – The first job runs the SQL scripts in Snowflake to copy data from the source database to Amazon S3. The Python script is available in <a href=\"https://aws-big-data-blog.s3.amazonaws.com/artifacts/BDB-2083-Migrate-from-Snowflake-to-Amazon-Redshift/scripts/snowflake-python-job.py\" target=\"_blank\">S3</a>. The Snowflake job accepts two parameters:\n<ul>\n<li><strong>SQLSCRIPT</strong> – The Amazon S3 location of the SQL script to run in Snowflake to migrate data to Amazon S3. This is referenced as the <strong>Snowflake Unload SQL Script</strong> parameter in the input section of the CloudFormation template.</li>\n<li><strong>SECRET</strong> – The Secrets Manager ARN that stores Snowflake connection details.</li>\n</ul>\n</li>\n<li><strong>&lt;stack name&gt;-glue-redshift-load</strong> – The second job runs another SQL script in Amazon Redshift to copy data from Amazon S3 to the target Amazon Redshift database. The Python script link is available in <a href=\"https://aws-big-data-blog.s3.amazonaws.com/artifacts/BDB-2083-Migrate-from-Snowflake-to-Amazon-Redshift/scripts/redshift-python-job.py\" target=\"_blank\">S3</a>. The Amazon Redshift job accepts three parameters:\n<ul>\n<li><strong>SQLSCRIPT</strong> – The Amazon S3 location of the SQL script to run in Amazon Redshift to migrate data from Amazon S3. If you provide custom SQL script to migrate the Snowflake data to Amazon S3 (as mentioned in the prerequisites), the file location is referenced as LoadFileLocation on the <strong>Outputs</strong> tab of the CloudFormation stack.</li>\n<li><strong>SECRET</strong> – The Secrets Manager ARN that stores Amazon Redshift connection details.</li>\n<li><strong>PARAMS</strong> – This includes any additional parameters required for the SQL script, including the Amazon Redshift IAM role used in the COPY commands and the S3 bucket staging the Snowflake data. Multiple parameter values can be provided separated by a comma.</li>\n</ul>\n</li>\n</ul>\n</li>\n<li><strong>AWS Glue workflow</strong> – The orchestration of Snowflake and Amazon Redshift AWS Glue Python shell jobs is managed via an AWS Glue workflow. The workflow <em>&lt;stack name&gt;</em>–<code>snowflake-to-redshift-migration</code> runs later for actual migration of data.</li>\n</ul>\n<p>About the Authors</p>\n<p><img src=\"12\" alt=\"image.png\" /></p>\n<p><strong>Raks Khare</strong> is an Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers architect data analytics solutions at scale on the AWS platform.</p>\n<p><img src=\"13\" alt=\"image.png\" /></p>\n<p><strong>Julia Beck</strong> is an Analytics Specialist Solutions Architect at AWS. She supports customers in validating analytics solutions by architecting proof of concept workloads designed to meet their specific needs.</p>\n"}
目录
亚马逊云科技解决方案 基于行业客户应用场景及技术领域的解决方案
联系亚马逊云科技专家
亚马逊云科技解决方案
基于行业客户应用场景及技术领域的解决方案
联系专家
0
目录
关闭