Perform parallel load for partitioned data into Amazon S3 using Amazon DMS

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"With [Amazon Web Services Database Migration Service](https://aws.amazon.com/dms/) (Amazon Web Services DMS), you can migrate data between SQL, NoSQL, and text-based targets. With [Amazon Simple Storage Service](https://aws.amazon.com/s3/) (Amazon S3), you can store and protect any amount of data for virtually any use case, such as data lakes, cloud-native applications, and mobile apps. In this post, we demonstrate how to improve load times for migrating data from [Amazon Relational Database Service (Amazon RDS) for MySQL](https://aws.amazon.com/rds/mysql/) into Amazon S3, by using parallel load for partitioned data in Amazon Web Services DMS v3.4.6 and above. The same principals with [parallel load for partitioned data](https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.S3.html#CHAP_Target.S3.ParallelLoad) can apply to other relational database source engines.\n\n### **Partitioned tables**\n\nTable partitioning is the practice of dividing a database table into independent parts, which can be distributed across a file system. We don’t dive deep into table partitioning in this post, because there are many variations within different database engines, but it’s useful to know some of the benefits, as well as when partitioning a table can be effective.\n\nPartitions allow you to define conditions for the data in your table, and determine which partition data will belong to. For example, consider a table of orders for a busy ecommerce platform. If you partition on the month and year of the order date, all orders placed in January 2022 will be in one partition, and all orders for February 2022 will be in another. When you query order data for February 2022 (with appropriate predicates), data from partitions outside of this range can be excluded, which can result in more optimized queries. Similarly, when the January 2022 data is no longer required and has been archived, dropping the partition is a much cheaper and quicker operation than deleting the data, without affecting data for other months in other partitions.\n\n### **Parallel load for partitioned data**\n\nAmazon Web Services DMS allows you to configure a parallel full load of partitioned data within your migration task, when using Amazon S3 as a target and a [supported database engine as a source](https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.html). During the full load, data is migrated to the target using parallel threads and stored in subfolders mapped to the partitions of the source database objects. [Three parallel load rule types](https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TableMapping.SelectionTransformation.Tablesettings.html) are available. If the source object has partitions defined, you can use partitions-auto and partitions-list to automatically identify the partitions to load from the source metadata. For objects without partitions, but where the data can be defined by ranges or boundaries (for example, using date/time values), you can use partitions-range to specify the segments to be loaded. For more information on using partitions-range, see [Amazon Web Services Database Migration Service improves migration speeds by adding support for parallel full load and new LOB migration mechanisms](https://aws.amazon.com/blogs/database/introducing-aws-dms-replication-engine-version-3-1-2/).\n\n### **Testing performance of parallel load for partitioned data**\n\nAlthough you can partition almost any table with appropriate partition types and values, larger tables with high volumes of data typically see the most benefit. In this post, we use the MySQL [Employees sample database](https://dev.mysql.com/doc/employee/en/employees-introduction.html) as a source, which utilizes RANGE partitioning for the ```salaries```table. For more information on MySQL 5.7 partitioning types, refer to [Partitioning Types](https://dev.mysql.com/doc/refman/5.7/en/partitioning-types.html).\n\nAdditional data has been added within the ```salaries```table, bringing the total row count to just over 364 million rows and the table size to just over 25 GB. This database is running on an Amazon RDS t3.large instance. To perform the data migration, a t3.medium replication instance has been provisioned using Amazon Web Services DMS. A source endpoint has been created to connect to the source MySQL database, and a target endpoint has been created to connect to a target bucket in Amazon S3. For more information on setting up Amazon Web Services DMS replication instances and replication tasks, refer to the [Amazon Web Services DMS User Guide](https://docs.aws.amazon.com/dms/latest/userguide/Welcome.html).\n\nThis exercise consists of four separate runs of a full load task in Amazon Web Services DMS. The first two tasks are run on comparatively low-powered hardware. The first is run without parallel load to establish a baseline, and the second enables this feature in order to compare performance. The next two tasks are run on comparatively high-powered hardware, first without parallel load and then with, in order to understand the impact of both the setting itself and the hardware used.\n\nIn our first example, we use a database migration task to load the data from our source ```salaries```table into our target S3 bucket without utilizing parallel load. This task has been configured with all settings as default, with the exception of the [Commit rate during full load](https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TaskSettings.FullLoad.html) setting, which has been changed to 50,000 rows from the default of 10,000 rows.\n\nThe following image shows the full load tuning settings via the Amazon Web Services Console.\n\n![image.png](https://dev-media.amazoncloud.cn/74b9e22f1d8b4c5ca6bbed1306b4b711_image.png)\nThe following image shows the output for the DMS task without parallel load, which completes in 19 minutes and 7 seconds.\n\n![image.png](https://dev-media.amazoncloud.cn/d507e058cb61494997ff168f3dd27c80_image.png)\n\nNext, we repeat the process using the same resources, but with a database migration task configured to utilize parallel load for partitioned data. The following JSON describes the task configuration for Amazon Web Services DMS task mapping to specify data on source:\n\n```\n{\n \"rules\": [\n {\n \"rule-type\": \"selection\",\n \"rule-id\": \"373834480\",\n \"rule-name\": \"373834480\",\n \"object-locator\": {\n \"schema-name\": \"employees_partitioned\",\n \"table-name\": \"salaries\"\n },\n \"rule-action\": \"include\",\n \"filters\": []\n },\n {\n \"rule-type\": \"table-settings\",\n \"rule-id\": \"2\",\n \"rule-name\": \"2\",\n \"object-locator\": {\n \"schema-name\": \"employees_partitioned\",\n \"table-name\": \"salaries\"\n },\n \"parallel-load\": {\n \"type\": \"partitions-auto\"\n }\n }\n ]\n}\n```\n\nThe following image shows the output for the DMS task with parallel load, which completes in 17 minutes and 29 seconds.\n\n![image.png](https://dev-media.amazoncloud.cn/73e2d05c63a24416bc324293a517a973_image.png)\n\nWith parallel load enabled, the task runtime is reduced by 8.5%. The following screenshots are [Amazon CloudWatch](http://aws.amazon.com/cloudwatch) metrics showing an increase in average CPU utilization and network receive throughput on the Amazon Web Services DMS replication instance, as well as a higher average CPU utilization on the RDS DB instance.\n\nThe following CloudWatch metric shows the CPU utilization for the Amazon Web Services DMS instance.\n\n![image.png](https://dev-media.amazoncloud.cn/51f45c860be04516b9fece746f520560_image.png)\n\nThe following CloudWatch metric shows the network receive throughput for the Amazon Web Services DMS instance.\n\n![image.png](https://dev-media.amazoncloud.cn/8c1c5673f63846f99c27f2cbd4fbf1d1_image.png)\n\nThe following CloudWatch metric shows the CPU utilization for the RDS DB instance.\n\n![image.png](https://dev-media.amazoncloud.cn/0aab4b86530645788bed779aa90140e7_image.png)\n\nTo understand the impact of the chosen hardware, and the considerations to make when using the parallel load setting, we now repeat the same test using more powerful instances. The RDS DB instance is modified to an r5.8xlarge, and the Amazon Web Services DMS replication instance is modified to a c5.12xlarge.\n\nWhen we run a repeat of our first example on the new hardware, without parallel load, the task completes in 6 minutes and 37 seconds. The increased hardware capability has led to a reduction in load time against the first test of approximately 65%.\n\nThe following image shows the output for the DMS task without parallel load.\n\n![image.png](https://dev-media.amazoncloud.cn/71eefa4d9d74447fa8499b0f7463887b_image.png)\n\nLastly, we repeat our second example and utilize the parallel load feature. This task completes in 3 minutes and 36 seconds, with a reduction in load time of approximately 45% when compared with the previous task run on equivalent hardware.\n\nThe following image shows the output for the DMS task with parallel load.\n\n![image.png](https://dev-media.amazoncloud.cn/b2f52d2b20f24f87bea2f1bc3d36362a_image.png)\n\nThe following CloudWatch metrics show increases in network transmit throughput and CPU utilization on the Amazon Web Services DMS replication instance, as well as a higher number of connections to the source RDS DB instance, when the parallel load setting is enabled.\n\nThe following CloudWatch metric shows the network transmit throughput on the Amazon Web Services DMS instance.\n\n![image.png](https://dev-media.amazoncloud.cn/7c4f6b91528549c9955cb48ea9680460_image.png)\n\nThe following CloudWatch metric shows CPU utilization on the Amazon Web Services DMS instance.\n\n![image.png](https://dev-media.amazoncloud.cn/c078a7c78a464afe94fe8bc358858751_image.png)\n\n\nThe following CloudWatch metric shows the database connections on the RDS DB instance.\n\n![image.png](https://dev-media.amazoncloud.cn/1984339b80cb4e15bfddbc38bafbab95_image.png)\n\nThe following table summarizes our results.\n\n![image.png](https://dev-media.amazoncloud.cn/25d76bf7a7c34e378a7bc4579a25e8c8_image.png)\n\n### **Conclusion**\n\nIn this post, we demonstrated how to use parallel load for partitioned data in Amazon Web Services DMS version 3.4.6 to improve data migration times into Amazon S3. We also highlighted how the chosen hardware can affect performance, with more improvement realized on comparatively more powerful hardware. We recommend performing sufficient testing or proof of concept evaluations in a separate environment prior to modifying or enabling any settings or features for a production migration.\n\nTo learn more about using parallel load for partitioned data please see the [documentation](https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.S3.html#CHAP_Target.S3.ParallelLoad).\n\nDisclaimer: The testing done in this post is not a real performance test and your own workload may vary.\n\n#### **About the authors**\n\n![image.png](https://dev-media.amazoncloud.cn/55a7fe96032a4d4a8b0f7e9c406a6fd0_image.png)\n\n**Robert Daly** is a Senior Database Specialist Solutions architect at Amazon Web Services, focusing on RDS, Aurora and DMS Services. He has helped multiple enterprise customers move their databases to Amazon Web Services, providing assistance on performance and best practices. Robert enjoys helping customers build technical solutions to solve business problems.\n\n![image.png](https://dev-media.amazoncloud.cn/aa9806db2464407487d8aa060daa0e87_image.png)\n\n**Lawrence Britt** is a Database Specialist Solutions Architect at Amazon Web Services based in London. In his role he helps customers across the UK and Ireland to migrate, modernise, and optimise database solutions on Amazon Web Services.","render":"With <a href=\"https://aws.amazon.com/dms/\" target=\"_blank\">Amazon Web Services Database Migration Service</a> (Amazon Web Services DMS), you can migrate data between SQL, NoSQL, and text-based targets. With <a href=\"https://aws.amazon.com/s3/\" target=\"_blank\">Amazon Simple Storage Service</a> (Amazon S3), you can store and protect any amount of data for virtually any use case, such as data lakes, cloud-native applications, and mobile apps. In this post, we demonstrate how to improve load times for migrating data from <a href=\"https://aws.amazon.com/rds/mysql/\" target=\"_blank\">Amazon Relational Database Service (Amazon RDS) for MySQL</a> into Amazon S3, by using parallel load for partitioned data in Amazon Web Services DMS v3.4.6 and above. The same principals with <a href=\"https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.S3.html#CHAP_Target.S3.ParallelLoad\" target=\"_blank\">parallel load for partitioned data</a> can apply to other relational database source engines.\n<h3><a id=\"Partitioned_tables_2\"></a>Partitioned tables</h3>\nTable partitioning is the practice of dividing a database table into independent parts, which can be distributed across a file system. We don’t dive deep into table partitioning in this post, because there are many variations within different database engines, but it’s useful to know some of the benefits, as well as when partitioning a table can be effective.\nPartitions allow you to define conditions for the data in your table, and determine which partition data will belong to. For example, consider a table of orders for a busy ecommerce platform. If you partition on the month and year of the order date, all orders placed in January 2022 will be in one partition, and all orders for February 2022 will be in another. When you query order data for February 2022 (with appropriate predicates), data from partitions outside of this range can be excluded, which can result in more optimized queries. Similarly, when the January 2022 data is no longer required and has been archived, dropping the partition is a much cheaper and quicker operation than deleting the data, without affecting data for other months in other partitions.\n<h3><a id=\"Parallel_load_for_partitioned_data_8\"></a>Parallel load for partitioned data</h3>\nAmazon Web Services DMS allows you to configure a parallel full load of partitioned data within your migration task, when using Amazon S3 as a target and a <a href=\"https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.html\" target=\"_blank\">supported database engine as a source</a>. During the full load, data is migrated to the target using parallel threads and stored in subfolders mapped to the partitions of the source database objects. <a href=\"https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TableMapping.SelectionTransformation.Tablesettings.html\" target=\"_blank\">Three parallel load rule types</a> are available. If the source object has partitions defined, you can use partitions-auto and partitions-list to automatically identify the partitions to load from the source metadata. For objects without partitions, but where the data can be defined by ranges or boundaries (for example, using date/time values), you can use partitions-range to specify the segments to be loaded. For more information on using partitions-range, see <a href=\"https://aws.amazon.com/blogs/database/introducing-aws-dms-replication-engine-version-3-1-2/\" target=\"_blank\">Amazon Web Services Database Migration Service improves migration speeds by adding support for parallel full load and new LOB migration mechanisms</a>.\n<h3><a id=\"Testing_performance_of_parallel_load_for_partitioned_data_12\"></a>Testing performance of parallel load for partitioned data</h3>\nAlthough you can partition almost any table with appropriate partition types and values, larger tables with high volumes of data typically see the most benefit. In this post, we use the MySQL <a href=\"https://dev.mysql.com/doc/employee/en/employees-introduction.html\" target=\"_blank\">Employees sample database</a> as a source, which utilizes RANGE partitioning for the <code>salaries</code>table. For more information on MySQL 5.7 partitioning types, refer to <a href=\"https://dev.mysql.com/doc/refman/5.7/en/partitioning-types.html\" target=\"_blank\">Partitioning Types</a>.\nAdditional data has been added within the <code>salaries</code>table, bringing the total row count to just over 364 million rows and the table size to just over 25 GB. This database is running on an Amazon RDS t3.large instance. To perform the data migration, a t3.medium replication instance has been provisioned using Amazon Web Services DMS. A source endpoint has been created to connect to the source MySQL database, and a target endpoint has been created to connect to a target bucket in Amazon S3. For more information on setting up Amazon Web Services DMS replication instances and replication tasks, refer to the <a href=\"https://docs.aws.amazon.com/dms/latest/userguide/Welcome.html\" target=\"_blank\">Amazon Web Services DMS User Guide</a>.\nThis exercise consists of four separate runs of a full load task in Amazon Web Services DMS. The first two tasks are run on comparatively low-powered hardware. The first is run without parallel load to establish a baseline, and the second enables this feature in order to compare performance. The next two tasks are run on comparatively high-powered hardware, first without parallel load and then with, in order to understand the impact of both the setting itself and the hardware used.\nIn our first example, we use a database migration task to load the data from our source <code>salaries</code>table into our target S3 bucket without utilizing parallel load. This task has been configured with all settings as default, with the exception of the <a href=\"https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TaskSettings.FullLoad.html\" target=\"_blank\">Commit rate during full load</a> setting, which has been changed to 50,000 rows from the default of 10,000 rows.\nThe following image shows the full load tuning settings via the Amazon Web Services Console.\n<img src=\"https://dev-media.amazoncloud.cn/74b9e22f1d8b4c5ca6bbed1306b4b711_image.png\" alt=\"image.png\" /> \nThe following image shows the output for the DMS task without parallel load, which completes in 19 minutes and 7 seconds.\n<img src=\"https://dev-media.amazoncloud.cn/d507e058cb61494997ff168f3dd27c80_image.png\" alt=\"image.png\" />\nNext, we repeat the process using the same resources, but with a database migration task configured to utilize parallel load for partitioned data. The following JSON describes the task configuration for Amazon Web Services DMS task mapping to specify data on source:\n<pre><code class=\"lang-\">{\n "rules": [\n {\n "rule-type": "selection",\n "rule-id": "373834480",\n "rule-name": "373834480",\n "object-locator": {\n "schema-name": "employees_partitioned",\n "table-name": "salaries"\n },\n "rule-action": "include",\n "filters": []\n },\n {\n "rule-type": "table-settings",\n "rule-id": "2",\n "rule-name": "2",\n "object-locator": {\n "schema-name": "employees_partitioned",\n "table-name": "salaries"\n },\n "parallel-load": {\n "type": "partitions-auto"\n }\n }\n ]\n}\n</code></pre>\nThe following image shows the output for the DMS task with parallel load, which completes in 17 minutes and 29 seconds.\n<img src=\"https://dev-media.amazoncloud.cn/73e2d05c63a24416bc324293a517a973_image.png\" alt=\"image.png\" />\nWith parallel load enabled, the task runtime is reduced by 8.5%. The following screenshots are <a href=\"http://aws.amazon.com/cloudwatch\" target=\"_blank\">Amazon CloudWatch</a> metrics showing an increase in average CPU utilization and network receive throughput on the Amazon Web Services DMS replication instance, as well as a higher average CPU utilization on the RDS DB instance.\nThe following CloudWatch metric shows the CPU utilization for the Amazon Web Services DMS instance.\n<img src=\"https://dev-media.amazoncloud.cn/51f45c860be04516b9fece746f520560_image.png\" alt=\"image.png\" />\nThe following CloudWatch metric shows the network receive throughput for the Amazon Web Services DMS instance.\n<img src=\"https://dev-media.amazoncloud.cn/8c1c5673f63846f99c27f2cbd4fbf1d1_image.png\" alt=\"image.png\" />\nThe following CloudWatch metric shows the CPU utilization for the RDS DB instance.\n<img src=\"https://dev-media.amazoncloud.cn/0aab4b86530645788bed779aa90140e7_image.png\" alt=\"image.png\" />\nTo understand the impact of the chosen hardware, and the considerations to make when using the parallel load setting, we now repeat the same test using more powerful instances. The RDS DB instance is modified to an r5.8xlarge, and the Amazon Web Services DMS replication instance is modified to a c5.12xlarge.\nWhen we run a repeat of our first example on the new hardware, without parallel load, the task completes in 6 minutes and 37 seconds. The increased hardware capability has led to a reduction in load time against the first test of approximately 65%.\nThe following image shows the output for the DMS task without parallel load.\n<img src=\"https://dev-media.amazoncloud.cn/71eefa4d9d74447fa8499b0f7463887b_image.png\" alt=\"image.png\" />\nLastly, we repeat our second example and utilize the parallel load feature. This task completes in 3 minutes and 36 seconds, with a reduction in load time of approximately 45% when compared with the previous task run on equivalent hardware.\nThe following image shows the output for the DMS task with parallel load.\n<img src=\"https://dev-media.amazoncloud.cn/b2f52d2b20f24f87bea2f1bc3d36362a_image.png\" alt=\"image.png\" />\nThe following CloudWatch metrics show increases in network transmit throughput and CPU utilization on the Amazon Web Services DMS replication instance, as well as a higher number of connections to the source RDS DB instance, when the parallel load setting is enabled.\nThe following CloudWatch metric shows the network transmit throughput on the Amazon Web Services DMS instance.\n<img src=\"https://dev-media.amazoncloud.cn/7c4f6b91528549c9955cb48ea9680460_image.png\" alt=\"image.png\" />\nThe following CloudWatch metric shows CPU utilization on the Amazon Web Services DMS instance.\n<img src=\"https://dev-media.amazoncloud.cn/c078a7c78a464afe94fe8bc358858751_image.png\" alt=\"image.png\" />\nThe following CloudWatch metric shows the database connections on the RDS DB instance.\n<img src=\"https://dev-media.amazoncloud.cn/1984339b80cb4e15bfddbc38bafbab95_image.png\" alt=\"image.png\" />\nThe following table summarizes our results.\n<img src=\"https://dev-media.amazoncloud.cn/25d76bf7a7c34e378a7bc4579a25e8c8_image.png\" alt=\"image.png\" />\n<h3><a id=\"Conclusion_112\"></a>Conclusion</h3>\nIn this post, we demonstrated how to use parallel load for partitioned data in Amazon Web Services DMS version 3.4.6 to improve data migration times into Amazon S3. We also highlighted how the chosen hardware can affect performance, with more improvement realized on comparatively more powerful hardware. We recommend performing sufficient testing or proof of concept evaluations in a separate environment prior to modifying or enabling any settings or features for a production migration.\nTo learn more about using parallel load for partitioned data please see the <a href=\"https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.S3.html#CHAP_Target.S3.ParallelLoad\" target=\"_blank\">documentation</a>.\nDisclaimer: The testing done in this post is not a real performance test and your own workload may vary.\n<h4><a id=\"About_the_authors_120\"></a>About the authors</h4>\n<img src=\"https://dev-media.amazoncloud.cn/55a7fe96032a4d4a8b0f7e9c406a6fd0_image.png\" alt=\"image.png\" />\nRobert Daly is a Senior Database Specialist Solutions architect at Amazon Web Services, focusing on RDS, Aurora and DMS Services. He has helped multiple enterprise customers move their databases to Amazon Web Services, providing assistance on performance and best practices. Robert enjoys helping customers build technical solutions to solve business problems.\n<img src=\"https://dev-media.amazoncloud.cn/aa9806db2464407487d8aa060daa0e87_image.png\" alt=\"image.png\" />\nLawrence Britt is a Database Specialist Solutions Architect at Amazon Web Services based in London. In his role he helps customers across the UK and Ireland to migrate, modernise, and optimise database solutions on Amazon Web Services.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家