Accelerate your data warehouse migration to Amazon Redshift – Part 6

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"This is the sixth in a series of posts. We’re excited to share dozens of new features to automate your schema conversion; preserve your investment in existing scripts, reports, and applications; accelerate query performance; and potentially simplify your migrations from legacy data warehouses to [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail).\n\nCheck out all the previous posts in this series:\n\n- [Accelerate your data warehouse migration to Amazon Redshift – Part 1](https://aws.amazon.com/cn/blogs/database/part-1-accelerate-your-data-warehouse-migration-to-amazon-redshift/) to learn more about macro conversion, case-insensitive string comparison, and other new features\n- [Accelerate your data warehouse migration to Amazon Redshift – Part 2](https://aws.amazon.com/cn/blogs/big-data/part-2-accelerate-your-data-warehouse-migration-to-amazon-redshift/) to learn about automation for proprietary data types\n- [Accelerate your data warehouse migration to Amazon Redshift – Part 3](https://aws.amazon.com/cn/blogs/big-data/part-3-accelerate-your-data-warehouse-migration-to-amazon-redshift/) to learn about automation for proprietary SQL statements\n- Accelerate your data warehouse migration to [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) – Part 4 to learn about new options for database scripting\n- [Accelerate your data warehouse migration to Amazon Redshift – Part 5](https://aws.amazon.com/cn/blogs/big-data/part-5-accelerate-your-data-warehouse-migration-to-amazon-redshift/) to learn about automation for SET tables\n- Accelerate your data warehouse migration to [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) – Part 6 to learn how to manage uniqueness constraints like primary keys\n\n[Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) is the cloud data warehouse of choice for [tens of thousands of customers](https://aws.amazon.com/cn/redshift/customer-success/?awsf.customer-references-location=*all&awsf.customer-references-segment=*all&awsf.customer-references-industry=*all) who use it to analyze exabytes of data to gain business insights. With [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail), you can query data across your data warehouse, operational data stores, and data lake using standard SQL. You can also integrate other Amazon Web Services services such as [Amazon EMR](https://aws.amazon.com/cn/emr/), [Amazon Athena](https: //aws.amazon.com/cn/athena/?trk=cndc-detail), [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail), Amazon Web Services Glue, [Amazon Web Services Lake Formation](https://aws.amazon.com/cn/lake-formation/), and [Amazon Kinesis](https://aws.amazon.com/cn/kinesis/) to use all the analytic capabilities in the Amazon Web Services Cloud.\n\nMigrating a data warehouse can be a complex undertaking. Your legacy workload might rely on proprietary features that aren’t directly supported by a modern data warehouse like [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail). For example, some data warehouses enforce primary key constraints, making a tradeoff with DML performance. [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) lets you define a primary key but uses the constraint for query optimization purposes only. If you use [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail), or are migrating to [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail), you may need a mechanism to check that primary key constraints are not being violated by extract, transform, and load (ETL) processes.\n\nIn this post, we describe two design patterns that you can use to accomplish this efficiently. We also show you how to use the Amazon Web Services Schema Conversion Tool (Amazon Web Services SCT) to automatically apply the design patterns to your SQL code.\n\nWe start by defining the semantics to address. Then we describe the design patterns and analyze their performance. We conclude by showing you how Amazon Web Services SCT can automatically convert your code to enforce primary keys.\n\n\n#### **Primary keys**\n\n\nA primary key (PK) is a set of attributes such that no two rows can have the same value in the PK. For example, the following Teradata table has a two-attribute primary key (```emp_id```, ```div_id```). Presumably, employee IDs are unique only within divisions.\n\n```\\nCREATE TABLE testschema.emp ( \\n emp_id INTEGER NOT NULL\\n, name VARCHAR(12) NOT NULL\\n, div_id INTEGER NOT NULL\\n, job_title VARCHAR(12)\\n, salary DECIMAL(8,2)\\n, birthdate DATE NOT NULL ) \\nCONSTRAINT pk_emp_id PRIMARY KEY (emp_id, div_id);\\n```\n\nMost databases require that a primary key satisfy two criteria:\n\n- **Uniqueness** – The PK values are unique over all rows in the table\n- **Not NULL** – The PK attributes don’t accept NULL values\n\nIn this post, we focus on how to support the preceding primary key semantics. We describe two design patterns that you can use to develop SQL applications that respect primary keys in [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail). Our focus is on INSERT-SELECT statements. Customers have told us that INSERT-SELECT operations comprise over 50% of the DML workload against tables with unique constraints. We briefly provide some guidance for other DML statements later in the post.\n\n\n#### **INSERT-SELECT**\n\n\nIn the rest of this post, we dive deep into design patterns for INSERT-SELECT statements. We’re concerned with statements of the following form:\n\n```\\nINSERT INTO <target table> SELECT * FROM <staging table>\\n```\n\nThe schema of the staging table is identical to the target table on a column-by-column basis.\n\nA duplicate PK value can be introduced by two scenarios:\n\n- The staging table contains duplicates, meaning there are two or more rows in the staging data with the same PK value\n- There is a row x in the staging table and a row y in the target table that share the same PK value\n\nNote that these situations are independent. It can be the case that the staging table contains duplicates, the staging table and target table share a duplicate, or both.\n\nIt’s imperative that the staging table doesn’t contain duplicate PK values. To ensure this, you can apply deduplication logic, as described in this post, to the staging table when it’s loaded. Alternatively, if your upstream source can guarantee that duplicates have been eliminated before delivery, you can eliminate this step.\n\n\n#### **Join**\n\n\nThe first design pattern simply joins the staging and target tables. If any rows are returned, then the staging and target tables share a primary key value.\n\nSuppose we have staging and target tables defined as the following:\n```\\nCREATE TABLE stg ( \\n pk_col INTEGER \\n, payload VARCHAR(100) \\n, PRIMARY KEY (pk_col)\\n); \\n\\nCREATE TABLE tgt ( \\n pk_col INTEGER \\n, payload VARCHAR(100) \\n, PRIMARY KEY (pk_col)\\n);\\n```\n\nWe can use the following query to detect any duplicate primary key values:\n\n```\\nSELECT count(1) \\nFROM stg, tgt \\nWHERE tgt.pk_col = stg.pk_col;\\n```\n\nIf the primary key has multiple columns, then the WHERE condition can be extended:\n\n```\\nSELECT count(1)\\nFROM stg, tgt\\nWHERE\\n tgt.pk_col1 = stg.pk_col1\\nAND tgt.pk_col2 = tgt.pk_col2\\nAND …\\n;\\n\\n```\nThere is one complication with this design pattern. If you allow NULL values in the primary key column, then you need to add special code to handle the NULL to NULL matching:\n\n```\\nSELECT count(1)\\nFROM stg, tgt\\nWHERE\\n (tgt.pk_col = stg.pk_col) \\nOR (tgt.pk_col IS NULL AND stg.pk_col IS NULL)\\n;\\n```\n\nThis is the primary disadvantage of this design pattern—the code can be ugly and unintuitive. Furthermore, if you have a multicolumn primary key, then the code becomes even more complicated.\n\n\n#### **INTERSECT**\n\n\nThe second design pattern that we describe uses the [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) INTERSECT operation. INTERSECT is a set-based operation that determines if two queries have any rows in common. You can check out [UNION, INTERSECT, and EXCEPT](https://docs.aws.amazon.com/redshift/latest/dg/r_UNION.html) in the [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) documentation for more information.\n\nWe can determine if the staging and target table have duplicate PK values using the following query:\n\n```\\nSELECT COUNT(1)\\nFROM (\\n SELECT pk_col FROM stg\\n INTERSECT\\n SELECT pk_col FROM tgt\\n) a\\n;\\n```\n\nIf the primary key is composed of more than one column, you can simply modify the subqueries to include the additional columns:\n\n```\\nSELECT COUNT(1)\\nFROM (\\n SELECT pk_col1, pk_col2, …, pk_coln FROM stg\\n INTERSECT\\n SELECT pk_col, pk_col2, …, pk_coln FROM tgt\\n) a\\n;\\n```\n\nThis pattern’s main advantage is its simplicity. The code is easier to understand and validate than the join design pattern. INTERSECT handles the NULL to NULL matching implicitly so you don’t have to write any special code for NULL values.\n\n\n#### **Performance**\n\n\nWe tested both design patterns using an [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) cluster consisting of 12 ra3.4xlarge nodes. Each node contained 12 CPU and 96 GB of memory.\n\nWe created the staging and target tables with the same distribution and sort keys to minimize data redistribution at query time.\n\nWe generated the test data artificially using a custom program. The target dataset contained 1 billion rows of data. We ran 10 trials of both algorithms using staging datasets that ranged from 20–200 million rows, in 20-million-row increments.\n\nIn the following graph, the join design pattern is shown as a blue line. The intersect design pattern is shown as an orange line.\n\n![image.png](https://dev-media.amazoncloud.cn/4f01f0d3b2354ac89b16b8568f510603_image.png)\n\nYou can observe that the performance of both algorithms is excellent. Each is able to detect duplicates in less than 1 second for all trials. The join algorithm outperforms the intersect algorithm, but both have excellent performance.\n\nSo, which algorithm should use you choose? If you’re developing a new application on [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail), the intersect algorithm is probably the best choice. The inherent NULL matching logic and simple intuitive code make this the best choice for new applications.\n\nConversely, if you need to squeeze every bit of performance from your application, then the join algorithm is your best option. In this case, you’ll have to trade complexity and perhaps extra effort in code review to gain the extra performance.\n\n\n#### **Automation**\n\n\nIf you’re migrating an existing application to [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail), you can use Amazon Web Services SCT to automatically convert your SQL code.\n\nLet’s see how this works. Suppose you have the following Teradata table. We use it as the target table in an INSERT-SELECT operation.\n\n```\\nCREATE MULTISET TABLE testschema.test_pk_tgt (\\n pk_col INTEGER NOT NULL\\n, payload VARCHAR(100) NOT NULL\\n, PRIMARY KEY (pk_col)\\n);\\n```\n\nThe staging table is identical to the target table, with the same columns and data types.\n\nNext, we create a procedure to load the target table from the staging table. The procedure contains a single INSERT-SELECT statement:\n\n```\\nREPLACE PROCEDURE testschema.insert_select()\\nBEGIN\\nINSERT INTO testschema.test_pk_tgt (pk_col, payload)\\nSELECT pk_col, payload FROM testschema.test_pk_stg;\\nEND;\\n```\n\nNow we use Amazon Web Services SCT to convert the Teradata stored procedure to [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail). First, open **Settings, Conversion** **settings**, and ensure that you’ve selected the option **Automate Primary key / Unique constraint**. If you don’t select this option, Amazon Web Services SCT won’t add the PK check to the converted code.\n\n![image.png](https://dev-media.amazoncloud.cn/d85e8d29d39f4a64bf33d35bb20d4e3b_image.png)\n\nNext, choose the stored procedure in the source database tree, right-click, and choose **Convert schema**.\n\n![image.png](https://dev-media.amazoncloud.cn/f39d511ff9754f37a7e37540e0d9fb79_image.png)\n\nAmazon Web Services SCT converts the stored procedure (and embedded INSERT-SELECT) using the join rewrite pattern. Because Amazon Web Services SCT performs the conversion for you, it uses the join rewrite pattern to leverage its performance advantage.\n\n![image.png](https://dev-media.amazoncloud.cn/813355aab8164130b3dca8bde63f998f_image.png)\n\nAnd that’s it, it’s that simple. If you’re migrating from Oracle or Teradata, you can use Amazon Web Services SCT to convert your INSERT-SELECT statements now. We’ll be adding support for additional data warehouse engines soon.\n\nIn this post, we focused on INSERT-SELECT statements, but we’re also happy to report that Amazon Web Services SCT can enforce primary key constraints for INSERT-VALUE and UPDATE statements. Amazon Web Services SCT injects the appropriate SELECT statement into your code to determine if the INSERT-VALUE or UPDATE will create duplicate primary key values. Download the [latest version of Amazon Web Services SCT](https://docs.aws.amazon.com/SchemaConversionTool/latest/userguide/CHAP_Installing.html) and give it a try!\n\n\n#### **Conclusion**\n\n\nIn this post, we showed you how to enforce primary keys in [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail). If you’re implementing a new application in [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail), you can use the design patterns in this post to enforce the constraints as part of your ETL stream.\n\nAlso, if you’re migrating from an Oracle or Teradata database, you can use Amazon Web Services SCT to automatically convert your SQL to [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail). Amazon Web Services SCT will inject additional code into your SQL stream to enforce your unique key constraints, and thereby insulate your application code from any related changes.\n\nWe’re happy to share these updates to help you in your data warehouse migration projects. In the meantime, you can learn more about [Amazon Redshift](https://aws.amazon.com/cn/redshift/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc) and Amazon Web Services SCT. Happy migrating!\n\n\n#### **About the authors**\n\n\n![image.png](https://dev-media.amazoncloud.cn/25ee44b2c62b4d0e8d533e5f6b449c63_image.png)\n\n**Michael Soo** is a Principal Database Engineer with the Amazon Web Services Database Migration Service team. He builds products and services that help customers migrate their database workloads to the Amazon Web Services cloud.\n\n![image.png](https://dev-media.amazoncloud.cn/02d27ab8511945a0bcb46cfb5d8738bd_image.png)\n\n**Illia Kravtsov** is a Database Developer with the Amazon Web Services Project Delta Migration team. He has 10+ years experience in data warehouse development with Teradata and other MPP databases.","render":"This is the sixth in a series of posts. We’re excited to share dozens of new features to automate your schema conversion; preserve your investment in existing scripts, reports, and applications; accelerate query performance; and potentially simplify your migrations from legacy data warehouses to Amazon Redshift.\nCheck out all the previous posts in this series:\n<ul>\\n<li><a href=\\"https://aws.amazon.com/cn/blogs/database/part-1-accelerate-your-data-warehouse-migration-to-amazon-redshift/\\" target=\\"_blank\\">Accelerate your data warehouse migration to Amazon Redshift – Part 1</a> to learn more about macro conversion, case-insensitive string comparison, and other new features</li>\\n<li><a href=\\"https://aws.amazon.com/cn/blogs/big-data/part-2-accelerate-your-data-warehouse-migration-to-amazon-redshift/\\" target=\\"_blank\\">Accelerate your data warehouse migration to Amazon Redshift – Part 2</a> to learn about automation for proprietary data types</li>\\n<li><a href=\\"https://aws.amazon.com/cn/blogs/big-data/part-3-accelerate-your-data-warehouse-migration-to-amazon-redshift/\\" target=\\"_blank\\">Accelerate your data warehouse migration to Amazon Redshift – Part 3</a> to learn about automation for proprietary SQL statements</li>\\n<li>Accelerate your data warehouse migration to Amazon Redshift – Part 4 to learn about new options for database scripting</li>\n<li><a href=\\"https://aws.amazon.com/cn/blogs/big-data/part-5-accelerate-your-data-warehouse-migration-to-amazon-redshift/\\" target=\\"_blank\\">Accelerate your data warehouse migration to Amazon Redshift – Part 5</a> to learn about automation for SET tables</li>\\n<li>Accelerate your data warehouse migration to Amazon Redshift – Part 6 to learn how to manage uniqueness constraints like primary keys</li>\n</ul>\\nAmazon Redshift is the cloud data warehouse of choice for <a href=\\"https://aws.amazon.com/cn/redshift/customer-success/?awsf.customer-references-location=*all&awsf.customer-references-segment=*all&awsf.customer-references-industry=*all\\" target=\\"_blank\\">tens of thousands of customers</a> who use it to analyze exabytes of data to gain business insights. With [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail), you can query data across your data warehouse, operational data stores, and data lake using standard SQL. You can also integrate other Amazon Web Services services such as <a href=\\"https://aws.amazon.com/cn/emr/\\" target=\\"_blank\\">Amazon EMR</a>, [Amazon Athena](https: //aws.amazon.com/cn/athena/?trk=cndc-detail), [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail), Amazon Web Services Glue, <a href=\\"https://aws.amazon.com/cn/lake-formation/\\" target=\\"_blank\\">Amazon Web Services Lake Formation</a>, and <a href=\\"https://aws.amazon.com/cn/kinesis/\\" target=\\"_blank\\">Amazon Kinesis</a> to use all the analytic capabilities in the Amazon Web Services Cloud.\\nMigrating a data warehouse can be a complex undertaking. Your legacy workload might rely on proprietary features that aren’t directly supported by a modern data warehouse like Amazon Redshift. For example, some data warehouses enforce primary key constraints, making a tradeoff with DML performance. Amazon Redshift lets you define a primary key but uses the constraint for query optimization purposes only. If you use Amazon Redshift, or are migrating to Amazon Redshift, you may need a mechanism to check that primary key constraints are not being violated by extract, transform, and load (ETL) processes.\nIn this post, we describe two design patterns that you can use to accomplish this efficiently. We also show you how to use the Amazon Web Services Schema Conversion Tool (Amazon Web Services SCT) to automatically apply the design patterns to your SQL code.\nWe start by defining the semantics to address. Then we describe the design patterns and analyze their performance. We conclude by showing you how Amazon Web Services SCT can automatically convert your code to enforce primary keys.\n<h4><a id=\\"Primary_keys_20\\"></a>Primary keys</h4>\\nA primary key (PK) is a set of attributes such that no two rows can have the same value in the PK. For example, the following Teradata table has a two-attribute primary key (<code>emp_id</code>, <code>div_id</code>). Presumably, employee IDs are unique only within divisions.\\n<pre><code class=\\"lang-\\">CREATE TABLE testschema.emp ( \\n emp_id INTEGER NOT NULL\\n, name VARCHAR(12) NOT NULL\\n, div_id INTEGER NOT NULL\\n, job_title VARCHAR(12)\\n, salary DECIMAL(8,2)\\n, birthdate DATE NOT NULL ) \\nCONSTRAINT pk_emp_id PRIMARY KEY (emp_id, div_id);\\n</code></pre>\\nMost databases require that a primary key satisfy two criteria:\n<ul>\\n<li>Uniqueness – The PK values are unique over all rows in the table</li>\\n<li>Not NULL – The PK attributes don’t accept NULL values</li>\\n</ul>\nIn this post, we focus on how to support the preceding primary key semantics. We describe two design patterns that you can use to develop SQL applications that respect primary keys in Amazon Redshift. Our focus is on INSERT-SELECT statements. Customers have told us that INSERT-SELECT operations comprise over 50% of the DML workload against tables with unique constraints. We briefly provide some guidance for other DML statements later in the post.\n<h4><a id=\\"INSERTSELECT_44\\"></a>INSERT-SELECT</h4>\\nIn the rest of this post, we dive deep into design patterns for INSERT-SELECT statements. We’re concerned with statements of the following form:\n<pre><code class=\\"lang-\\">INSERT INTO <target table> SELECT * FROM <staging table>\\n</code></pre>\\nThe schema of the staging table is identical to the target table on a column-by-column basis.\nA duplicate PK value can be introduced by two scenarios:\n<ul>\\n<li>The staging table contains duplicates, meaning there are two or more rows in the staging data with the same PK value</li>\n<li>There is a row x in the staging table and a row y in the target table that share the same PK value</li>\n</ul>\\nNote that these situations are independent. It can be the case that the staging table contains duplicates, the staging table and target table share a duplicate, or both.\nIt’s imperative that the staging table doesn’t contain duplicate PK values. To ensure this, you can apply deduplication logic, as described in this post, to the staging table when it’s loaded. Alternatively, if your upstream source can guarantee that duplicates have been eliminated before delivery, you can eliminate this step.\n<h4><a id=\\"Join_65\\"></a>Join</h4>\\nThe first design pattern simply joins the staging and target tables. If any rows are returned, then the staging and target tables share a primary key value.\nSuppose we have staging and target tables defined as the following:\n<pre><code class=\\"lang-\\">CREATE TABLE stg ( \\n pk_col INTEGER \\n, payload VARCHAR(100) \\n, PRIMARY KEY (pk_col)\\n); \\n\\nCREATE TABLE tgt ( \\n pk_col INTEGER \\n, payload VARCHAR(100) \\n, PRIMARY KEY (pk_col)\\n);\\n</code></pre>\\nWe can use the following query to detect any duplicate primary key values:\n<pre><code class=\\"lang-\\">SELECT count(1) \\nFROM stg, tgt \\nWHERE tgt.pk_col = stg.pk_col;\\n</code></pre>\\nIf the primary key has multiple columns, then the WHERE condition can be extended:\n<pre><code class=\\"lang-\\">SELECT count(1)\\nFROM stg, tgt\\nWHERE\\n tgt.pk_col1 = stg.pk_col1\\nAND tgt.pk_col2 = tgt.pk_col2\\nAND …\\n;\\n\\n</code></pre>\\nThere is one complication with this design pattern. If you allow NULL values in the primary key column, then you need to add special code to handle the NULL to NULL matching:\n<pre><code class=\\"lang-\\">SELECT count(1)\\nFROM stg, tgt\\nWHERE\\n (tgt.pk_col = stg.pk_col) \\nOR (tgt.pk_col IS NULL AND stg.pk_col IS NULL)\\n;\\n</code></pre>\\nThis is the primary disadvantage of this design pattern—the code can be ugly and unintuitive. Furthermore, if you have a multicolumn primary key, then the code becomes even more complicated.\n<h4><a id=\\"INTERSECT_119\\"></a>INTERSECT</h4>\\nThe second design pattern that we describe uses the Amazon Redshift INTERSECT operation. INTERSECT is a set-based operation that determines if two queries have any rows in common. You can check out <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/r_UNION.html\\" target=\\"_blank\\">UNION, INTERSECT, and EXCEPT</a> in the [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) documentation for more information.\\nWe can determine if the staging and target table have duplicate PK values using the following query:\n<pre><code class=\\"lang-\\">SELECT COUNT(1)\\nFROM (\\n SELECT pk_col FROM stg\\n INTERSECT\\n SELECT pk_col FROM tgt\\n) a\\n;\\n</code></pre>\\nIf the primary key is composed of more than one column, you can simply modify the subqueries to include the additional columns:\n<pre><code class=\\"lang-\\">SELECT COUNT(1)\\nFROM (\\n SELECT pk_col1, pk_col2, …, pk_coln FROM stg\\n INTERSECT\\n SELECT pk_col, pk_col2, …, pk_coln FROM tgt\\n) a\\n;\\n</code></pre>\\nThis pattern’s main advantage is its simplicity. The code is easier to understand and validate than the join design pattern. INTERSECT handles the NULL to NULL matching implicitly so you don’t have to write any special code for NULL values.\n<h4><a id=\\"Performance_151\\"></a>Performance</h4>\\nWe tested both design patterns using an Amazon Redshift cluster consisting of 12 ra3.4xlarge nodes. Each node contained 12 CPU and 96 GB of memory.\nWe created the staging and target tables with the same distribution and sort keys to minimize data redistribution at query time.\nWe generated the test data artificially using a custom program. The target dataset contained 1 billion rows of data. We ran 10 trials of both algorithms using staging datasets that ranged from 20–200 million rows, in 20-million-row increments.\nIn the following graph, the join design pattern is shown as a blue line. The intersect design pattern is shown as an orange line.\n<img src=\\"https://dev-media.amazoncloud.cn/4f01f0d3b2354ac89b16b8568f510603_image.png\\" alt=\\"image.png\\" />\nYou can observe that the performance of both algorithms is excellent. Each is able to detect duplicates in less than 1 second for all trials. The join algorithm outperforms the intersect algorithm, but both have excellent performance.\nSo, which algorithm should use you choose? If you’re developing a new application on Amazon Redshift, the intersect algorithm is probably the best choice. The inherent NULL matching logic and simple intuitive code make this the best choice for new applications.\nConversely, if you need to squeeze every bit of performance from your application, then the join algorithm is your best option. In this case, you’ll have to trade complexity and perhaps extra effort in code review to gain the extra performance.\n<h4><a id=\\"Automation_171\\"></a>Automation</h4>\\nIf you’re migrating an existing application to Amazon Redshift, you can use Amazon Web Services SCT to automatically convert your SQL code.\nLet’s see how this works. Suppose you have the following Teradata table. We use it as the target table in an INSERT-SELECT operation.\n<pre><code class=\\"lang-\\">CREATE MULTISET TABLE testschema.test_pk_tgt (\\n pk_col INTEGER NOT NULL\\n, payload VARCHAR(100) NOT NULL\\n, PRIMARY KEY (pk_col)\\n);\\n</code></pre>\\nThe staging table is identical to the target table, with the same columns and data types.\nNext, we create a procedure to load the target table from the staging table. The procedure contains a single INSERT-SELECT statement:\n<pre><code class=\\"lang-\\">REPLACE PROCEDURE testschema.insert_select()\\nBEGIN\\nINSERT INTO testschema.test_pk_tgt (pk_col, payload)\\nSELECT pk_col, payload FROM testschema.test_pk_stg;\\nEND;\\n</code></pre>\\nNow we use Amazon Web Services SCT to convert the Teradata stored procedure to Amazon Redshift. First, open Settings, Conversion settings, and ensure that you’ve selected the option Automate Primary key / Unique constraint. If you don’t select this option, Amazon Web Services SCT won’t add the PK check to the converted code.\\n<img src=\\"https://dev-media.amazoncloud.cn/d85e8d29d39f4a64bf33d35bb20d4e3b_image.png\\" alt=\\"image.png\\" />\nNext, choose the stored procedure in the source database tree, right-click, and choose Convert schema.\\n<img src=\\"https://dev-media.amazoncloud.cn/f39d511ff9754f37a7e37540e0d9fb79_image.png\\" alt=\\"image.png\\" />\nAmazon Web Services SCT converts the stored procedure (and embedded INSERT-SELECT) using the join rewrite pattern. Because Amazon Web Services SCT performs the conversion for you, it uses the join rewrite pattern to leverage its performance advantage.\n<img src=\\"https://dev-media.amazoncloud.cn/813355aab8164130b3dca8bde63f998f_image.png\\" alt=\\"image.png\\" />\nAnd that’s it, it’s that simple. If you’re migrating from Oracle or Teradata, you can use Amazon Web Services SCT to convert your INSERT-SELECT statements now. We’ll be adding support for additional data warehouse engines soon.\nIn this post, we focused on INSERT-SELECT statements, but we’re also happy to report that Amazon Web Services SCT can enforce primary key constraints for INSERT-VALUE and UPDATE statements. Amazon Web Services SCT injects the appropriate SELECT statement into your code to determine if the INSERT-VALUE or UPDATE will create duplicate primary key values. Download the <a href=\\"https://docs.aws.amazon.com/SchemaConversionTool/latest/userguide/CHAP_Installing.html\\" target=\\"_blank\\">latest version of Amazon Web Services SCT</a> and give it a try!\\n<h4><a id=\\"Conclusion_215\\"></a>Conclusion</h4>\\nIn this post, we showed you how to enforce primary keys in Amazon Redshift. If you’re implementing a new application in Amazon Redshift, you can use the design patterns in this post to enforce the constraints as part of your ETL stream.\nAlso, if you’re migrating from an Oracle or Teradata database, you can use Amazon Web Services SCT to automatically convert your SQL to Amazon Redshift. Amazon Web Services SCT will inject additional code into your SQL stream to enforce your unique key constraints, and thereby insulate your application code from any related changes.\nWe’re happy to share these updates to help you in your data warehouse migration projects. In the meantime, you can learn more about <a href=\\"https://aws.amazon.com/cn/redshift/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc\\" target=\\"_blank\\">Amazon Redshift</a> and Amazon Web Services SCT. Happy migrating!\\n<h4><a id=\\"About_the_authors_225\\"></a>About the authors</h4>\\n<img src=\\"https://dev-media.amazoncloud.cn/25ee44b2c62b4d0e8d533e5f6b449c63_image.png\\" alt=\\"image.png\\" />\nMichael Soo is a Principal Database Engineer with the Amazon Web Services Database Migration Service team. He builds products and services that help customers migrate their database workloads to the Amazon Web Services cloud.\\n<img src=\\"https://dev-media.amazoncloud.cn/02d27ab8511945a0bcb46cfb5d8738bd_image.png\\" alt=\\"image.png\\" />\nIllia Kravtsov is a Database Developer with the Amazon Web Services Project Delta Migration team. He has 10+ years experience in data warehouse development with Teradata and other MPP databases.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家