Improve performance of your bulk data import to Amazon RDS for MySQL

海外精选
海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时,内容中提到的“AWS” 是 “Amazon Web Services” 的缩写,在此网站不作为商标展示。
0
0
{"value":"We use bulk data loading to move data from a source to a destination database for migration or load testing. Bulk data importing into MySQL databases can take several minutes to hours and sometimes even days depending upon the amount of data that needs to be loaded. Creating a dump file takes a small amount of time, but loading the same into the database takes an extended period of time. This in turn causes a delay in any further activities that might have been planned to be carried out on the specific data or database.\n\nThere are several techniques that you can use to import data into [Amazon Relational Database Service (Amazon RDS) for MySQL](https://aws.amazon.com/rds/mysql/). The best approach depends on the source of the data, the amount of data, whether the import is done one time or is ongoing, and the amount of downtime a business can afford. For a list of techniques available to import data into Amazon RDS for MySQL, see [Importing data into a MySQL DB instance.](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/MySQL.Procedural.Importing.Other.html)\n\nIn this post, I discuss how to improve the speed of loading data into an RDS for MySQL instance. The recommendations are divided into three sections:\n\n- What to consider while creating the backup of the source database and loading the data into an RDS for MySQL instance\n- Database parameters that you can modify to improve load performance\n- Changes you can make on the infrastructure level\n\nAll the recommendations given in this post apply to Amazon RDS for MySQL 5.7 and 8.0.\n\nWhile this post is mostly focusing on importing into a new database instance, some of the recommendations can also be used for batch jobs, bulk writes, or ETL. The source can be a different database server, the same database server, or a set of flat files.\n\n### **Considerations when creating a backup and importing data**\n\nThe following are recommendations to consider while creating a backup of data from the source database and also while loading the data into the target database.\n\n#### **Physical backup vs. logical backup**\n\nYou can choose to create a physical or logical backup of your data depending on your use case and requirements. Here I discuss which backup is suitable in which scenario. For more information regarding backup types, refer to [Backup and Recovery Types.](https://dev.mysql.com/doc/refman/8.0/en/backup-types.html)\n\nPhysical backups consist of raw copies of the directories and files that store database contents. This type of backup is often used for larger datasets as it can allow for faster, less intrusive backup and faster recovery. You can take physical backups of a self-managed MySQL database running on premises or on [Amazon Elastic Compute Cloud](http://aws.amazon.com/ec2) (Amazon EC2) instances using the tool [Percona XtraBackup](https://www.percona.com/doc/percona-xtrabackup/LATEST/index.html). You can’t create physical backups of your RDS for MySQL database instances using Percona XtraBackup.\n\nLogical backups save information represented as a logical database structure (CREATE DATABASE, CREATE TABLE statements) and content (INSERT statements or delimited-text files). This type of backup is suitable for smaller amounts of data where you might edit the data values or table structure, or recreate the data on a different machine architecture. You can take logical backups using tools like [mysqldump](https://dev.mysql.com/doc/refman/8.0/en/mysqldump.html) or [MyDumper](https://github.com/maxbube/mydumper/releases).\n\nIf you choose to proceed with a physical backup, you can create backup of your database using Percona XtraBackup and restore it as an RDS for MySQL instance using the [Restore from Amazon S3](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/MySQL.Procedural.Importing.html#MySQL.Procedural.Importing.PerformingImport) option. However, with this option, you can’t restore data to an already existing RDS for MySQL instance. Also, the parameter changes suggested in this post aren’t relevant for improving performance during restoring physical backups because this method involves restoring data files on the server.\n\nFor detailed information and steps for doing a physical backup and restore to an RDS for MySQL instance, refer to [Restoring a backup into a MySQL DB instance.](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/MySQL.Procedural.Importing.html)\n\nIf you choose to proceed with logical backups, you can consider the following recommendations:\n\n- **Use flat files** – Importing data from flat files can be significantly faster compared to SQL dump files. You can load flat times into the database using the [LOAD DATA LOCAL INFILE](https://dev.mysql.com/doc/refman/8.0/en/load-data.html) statement. This statement reads rows from a text file into a table at a very high speed. Additionally, the flat files are created on a per-table basis, which directly helps in parallel data loading to the database. You can create flat files from on-premises or self-managed databases using ```SELECT …INTO```statements or by using mysqldump along with the ```--tab```option. However, the ```SELECT …INTO```statement as well as mysqldump with the ```--tab```option don’t work for RDS for MySQL databases because, being a managed service, the access to the underlying server of the database is restricted. For RDS MySQL, you can consider using alternatives such as the following:\n\n- **[Amazon Web Services Database Migration Service (Amazon Web Services DMS)](https://aws.amazon.com/dms/)** – You can migrate data to [Amazon Simple Storage Service (Amazon S3)](https://aws.amazon.com/s3/) using Amazon Web Services DMS from RDS for MySQL database instance. When using Amazon S3 as a target in an Amazon Web Services DMS task, both full load and change data capture (CDC) data is written to comma-separated value (.csv) format by default. For more details, refer to [Using Amazon S3 as a target for Amazon Web Services Database Migration Service](https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.S3.html)\n- **MySQL Command Line Client** – You can use MySQL command line client to create flat files. Sample command:\n\n```\nmysql -h <endpoint> -u <username> -p --silent --batch --quick --raw -e \"select * from <database_name>.<table_name> limit 10\" > <filename>.csv\n```\n- **Create multiple files that can be loaded in parallel instead of using a single file** – Consider dividing the backup into multiple files (one file per table) instead of creating a single large file with all the databases and tables. Also, if the tables are large, you can split the table data into more than one file. Typically, a high watermark of 500 MB is recommended to keep the transaction size manageable. This also allows you to import files in parallel, and this approach helps you significantly reduce the time of data import to your databases, ensures faster rollback, and is easier to resume in case of any issues. For example, you can create separate dump files for each table using the following command:\n\n```\nmysqldump -h <endpoint> -u <username> -p <database_name> <tablename> > <dumpfilename.sql>\n```\nYou can import the file into the database using the following command:\n\n```\nmysql -h <endpoint> -u <username> -p <database_name> < <dumpfilename.sql>\n```\nAs mentioned in the previous section, you can use the ```SELECT …INTO```statement on your on-premises or self-managed databases to create separate flat files for each table. For example:\n\n```\nSELECT * INTO OUTFILE '/path/<tablename>.csv' FROM <databasename>.<tablename>;\n```\nYou can import this file into the database using the following command:\n\n```\nLOAD DATA LOCAL INFILE '/path/<tablename>.csv' INTO TABLE <databasename>.<tablename>;\n```\nYou can also use third-party tools like mydumper, which has parameters such as ```--rows```or ```--chunk-filesize```that you can use to split files when dumping them. For more details, refer to [Mydumper Usage.](https://github.com/mydumper/mydumper/blob/master/docs/mydumper_usage.rst)\n\n- **Load data in primary key order** – Loading data in primary key order is a faster approach for inserting rows and is particularly important for tables that don’t fit entirely within the buffer pool. To achieve this, you need to take the backup in primary key order. For example:\n\n```\nmysqldump -h <endpoint> -u <username> -p --order-by-primary <database_name> <tablename> > <dumpfilename.sql>\n```\nThe ```--order-by-primary```option dumps each table’s rows sorted by its primary key, or by its first unique index.\n\n- **Load data without a secondary and full-text index** – For large datasets, it’s much faster to load your data into a table that has no secondary and/or full-text index and then create the index after that, rather than load data into a table that has an existing index. In the case of a full-text index, when a document is inserted, it’s tokenized, and the individual words and associated data are inserted into the full-text index. This process can result in numerous small insertions into the auxiliary index tables, making concurrent access to these tables a point of contention and therefore slow down the loading of data into tables.\n\n### **Parameter configurations to improve load performance**\n\nThe following is a list of parameters that you can modify to provide better performance while loading data into an RDS for MySQL instance. You can modify some of these via [parameter groups](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithParamGroups.html), and several of them can be modified at the session level.\n\nMake sure that after the data import activity is complete, you revert the values of the following parameters to default values or already set values. The suggested values might not be suitable when no data import is ongoing.\n\n#### **Batch commits**\n\nAll user activity in InnoDB happens within a transaction. With ```autocommit```enabled, such as ```autocommit = 1```(which is the default setting in MySQL), each SQL statement is treated as a separate transaction. MySQL performs a commit after each SQL statement if that statement didn’t return an error. At every transaction commit, the redo logs are flushed to the disk, which can cause increased runtime overhead during the import operation.\n\nTo overcome this performance penalty, consider batching your commits. Batching the statements improves throughput and therefore helps the performance of the data load. You can achieve this using bulk commit methods, such as the following:\n\n- **Use INSERT statements with multiple values** – This can be considerably faster compared to using single-row INSERT statements. See the following syntax:\n\n```\nINSERT INTO <tablename> VALUES (value1, value2), (value3, value4),…..;\n```\n\n- **Batching flat files with LOAD DATA** – You can load flat files into the database using the [LOAD DATA LOCAL INFILE](https://dev.mysql.com/doc/refman/8.0/en/load-data.html) statement. The flat files are created on a per-table basis, which directly helps in parallel data loading to the database.\n- **Disable autocommit** – When you change the value for [autocommit](https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_autocommit) by making changes in the custom parameter group associated to the RDS instance, the changes take place at a global level. You can instead consider changing the value for this parameter at the session level. You can envelope your SQL import statements as follows:\n\n\n```\nSET autocommit=0;\n... SQL import statements ...\nSET autocommit=1;\n```\n\nAll these cases lead to a reduced number of commits. Although this can be beneficial, it’s also important to avoid making transactions too big. A single transaction should not be greater than a few 100 MBs. So if you have a large transaction, you should split it into 250 MB chunks and load them sequentially. This allows you to keep transactions small and also reduces rollback time, should there be an issue.\n\n#### **Increase the size of innodb_log_file_size**\n\nAll the data changes are logged in the redo logs, and these logs are flushed to the disk once in a while. When the logs get flushed depends on adaptive flushing, the percentage of dirty pages reaching the [innodb_max_dirty_pages_pct_lwm value](https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_max_dirty_pages_pct_lwm), and the [innodb_lru_scan_depth](https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_lru_scan_depth) values indicating how far to look for dirty pages for a disk flush. For more details regarding buffer pool flushing, refer to [Configuring Buffer Pool Flushing.](https://dev.mysql.com/doc/refman/8.0/en/innodb-buffer-pool-flushing.html)\n\n[innodb_log_file_size](https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_log_file_size) defines the checkpointing age. Small redo log files cause a lot of disk writes. Therefore, increasing the size of the log file leads to less disk I/O.\n\nThe default value of ```innodb_log_file_size```for Amazon RDS for MySQL is 128MiB.\n\nPlease note that a larger log file can increase the crash recovery time. The combined size of log files (```innodb_log_file_size * innodb_log_files_in_group```) can’t exceed a maximum value that is slightly less than 512 GB. Generally, the combined size of the log files should be large enough that the server can smooth out peaks and troughs in workload activity, which often means that there is enough redo log space to handle more than an hour of write activity. For more information, refer to [How to calculate a good InnoDB log file size.](https://www.percona.com/blog/2008/11/21/how-to-calculate-a-good-innodb-log-file-size/)\n\nYou can modify the value for this parameter in the custom parameter group associated to the RDS instance. This is a static parameter so a reboot is required.\n\nModification of this parameter helps keep the Write IOPS from becoming a bottleneck and therefore helps the performance of the data load.\n\n#### **Increase the size of innodb_log_buffer_size**\n\nFor the duration of importing data into the database, consider disabling [foreign key](https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_foreign_key_checks) and [unique key](https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_unique_checks) checks. Make sure that you don’t have any duplicate keys in your dataset and the data being inserted does not violate any referential constraints as disabling these will disable the associated checks. Disabling foreign and unique key checks can save a lot of disk I/O for big tables. You can’t change the values for these parameters via parameter groups. To change the value of these system variables, you can envelope your SQL import statements as follows. Note that the statements ```SET unique_checks=1```and ```SET foreign_key_checks=1```indicates that the checks are reactivated post data load:\n\n```\nSET unique_checks=0;\n\n... SQL import statements ...\n\nSET unique_checks=1;\n\nSET foreign_key_checks=0;\n\n... SQL import statements ...\n\nSET foreign_key_checks=1;\n```\n\nModification of this parameter helps keep the write IOPS from becoming a bottleneck and therefore aids the performance of the data load.\n\n### **Infrastructure configurations to improve load performance**\n\nIn this section, we discuss considerations regarding Amazon RDS infrastructure that can help improve data import performance.\n\n#### **Source configuration**\n\nIf you’re working with data that exists on an EC2 instance, we highly recommend having this source EC2 instance and target RDS instance in the same Region and same Availability Zone. This helps reduce the network overhead, and also the data transfer within the same Availability Zone is free.\n\nIf you’re loading data from an external on-premises source, it’s recommended that, if possible, you should upload the dataset first to Amazon Web Services and then load the data to the RDS instance. This can help reduce latency.\n\nThe source should be adequately sized in terms of I/O and network bandwidth. This is to ensure that resource contention on the source doesn’t become a bottleneck and impact the load performance.\n\nMake sure that while dumping and loading data, you’re using the latest version of the MySQL client, because the newer version comes with improvements and bug fixes.\n\n#### **Scale the instance**\n\nConsider scaling up the RDS database instance for the duration of the import, because importing data can require a higher-than-usual amount of resources like CPU utilization, EBS bandwidth, and memory. After the import completes successfully, you can scale down the instance to the required instance type. For more information, refer to [Modifying an Amazon RDS DB instance.](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Overview.DBInstance.Modifying.html)\n\nPlease note that scaling the instance type requires downtime. The downtime for a Single-AZ instance can be higher because the amount of downtime is equal to the amount of time required to replace the underlying host. In the case of Multi-AZ instance, the downtime is equal to the time required for a Multi-AZ failover to complete.\n\nIf modification of the above suggested parameters doesn’t provide the desired results and you observe bottlenecks because of resources like CPU utilization, EBS bandwidth, and memory, consider scaling your instance to a type with more resources that will aid the performance of data load.\n\n#### **Disable automated backups**\n\nDisabling automated backup on an instance removes any existing automated backups present for the instance.\n\nDisabling automated backups means disabling binary logging on RDS for MySQL database instances. This will disable point-in-time recovery. Also, you cannot disable automated backups if you have [read replicas](https://aws.amazon.com/rds/features/read-replicas/) associated to your RDS instance.\n\nMySQL writes to binary logs immediately after a statement or transaction is complete. During bulk loading of data, disabling binary logging can help reduce the overhead on the resources and therefore improve the load performance.\n\nIf you disable automated backups, we highly recommend taking a manual snapshot of the RDS instance before starting the import operation. This way, if required, you can restore the database to a state before the data import started. You can disable automated backups on RDS instances by setting the backup retention period to 0 days. For more information, see [Disabling automated backups](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithAutomatedBackups.html#USER_WorkingWithAutomatedBackups.Disabling). Also, ensure that after the import is complete, you enable backups on your database instance by increasing the backup retention period to a value greater than 0 as per your compliance requirements.\n\nBinary logging can cause increased writes to the disk. Disabling binary logging can assist with keeping the write IOPS from becoming a bottleneck and therefore aid the performance of the data load.\n\n#### **Disable Multi-AZ**\n\nBy disabling Multi-AZ, you’re no longer running your DB instance with high availability.\n\nIn a Multi-AZ setup, all the data written to the primary instance is synchronously replicated to the secondary instance. DB instances using Multi-AZ deployments can have increased write and commit latency compared to a Single-AZ deployment, due to the synchronous data replication that occurs. Disabling Multi-AZ when you’re working with very large datasets gives improvements in performance while bulk loading data.\n\nAdditionally, converting an instance from Single-AZ to Multi-AZ or vice versa is an online operation and requires no downtime. However, if your system is under heavy load (write heavy), the conversion can take time and therefore the instance may remain in modifying state for a long time because the primary and standby needs to be in sync before the instance can be set to available.\n\nDisabling Multi-AZ can assist with reducing the write latencies on the database and therefore aid the performance of the data load.\n\n#### **Read Replica**\n\nAny updates that are made to the primary DB instance are asynchronously replicated to the read replica. You can consider creating read replica after importing your data into the database. This will help save time on changes being replicated after being written to the source database. For example, when a change is made on the primary instance and it takes an hour to run, then the replication lag is one hour. Because the change might also take one hour to complete on the replica, by the time the change is complete, the total lag is approximately two hours. For more information regarding replica lag and its causes, refer to [How can I troubleshoot high replica lag with Amazon RDS for MySQL?](https://aws.amazon.com/premiumsupport/knowledge-center/rds-mysql-high-replica-lag/)\n\n### **Conclusion**\n\nBulk data loading into MySQL can be an expensive operation in terms of the time required to complete the process. In this post, I walked you through different considerations while bulk importing data into an RDS for MySQL instance.\n\nIf you have comments about this post, submit them in the comments section.\n\n#### **About the author**\n\n![image.png](https://dev-media.amazoncloud.cn/03100e3aeb774c96bb59f237f5e4a0d5_image.png)\n\n**Asmita Varma** is a Technical Account Manager with Amazon Web Services. She helps customers navigate the challenges and complexities of the Amazon Web Services Cloud to attain operational excellence. Her focus areas have been MySQL and Amazon Aurora MySQL-Compatible Edition.","render":"<p>We use bulk data loading to move data from a source to a destination database for migration or load testing. Bulk data importing into MySQL databases can take several minutes to hours and sometimes even days depending upon the amount of data that needs to be loaded. Creating a dump file takes a small amount of time, but loading the same into the database takes an extended period of time. This in turn causes a delay in any further activities that might have been planned to be carried out on the specific data or database.</p>\n<p>There are several techniques that you can use to import data into <a href=\"https://aws.amazon.com/rds/mysql/\" target=\"_blank\">Amazon Relational Database Service (Amazon RDS) for MySQL</a>. The best approach depends on the source of the data, the amount of data, whether the import is done one time or is ongoing, and the amount of downtime a business can afford. For a list of techniques available to import data into Amazon RDS for MySQL, see <a href=\"https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/MySQL.Procedural.Importing.Other.html\" target=\"_blank\">Importing data into a MySQL DB instance.</a></p>\n<p>In this post, I discuss how to improve the speed of loading data into an RDS for MySQL instance. The recommendations are divided into three sections:</p>\n<ul>\n<li>What to consider while creating the backup of the source database and loading the data into an RDS for MySQL instance</li>\n<li>Database parameters that you can modify to improve load performance</li>\n<li>Changes you can make on the infrastructure level</li>\n</ul>\n<p>All the recommendations given in this post apply to Amazon RDS for MySQL 5.7 and 8.0.</p>\n<p>While this post is mostly focusing on importing into a new database instance, some of the recommendations can also be used for batch jobs, bulk writes, or ETL. The source can be a different database server, the same database server, or a set of flat files.</p>\n<h3><a id=\"Considerations_when_creating_a_backup_and_importing_data_14\"></a><strong>Considerations when creating a backup and importing data</strong></h3>\n<p>The following are recommendations to consider while creating a backup of data from the source database and also while loading the data into the target database.</p>\n<h4><a id=\"Physical_backup_vs_logical_backup_18\"></a><strong>Physical backup vs. logical backup</strong></h4>\n<p>You can choose to create a physical or logical backup of your data depending on your use case and requirements. Here I discuss which backup is suitable in which scenario. For more information regarding backup types, refer to <a href=\"https://dev.mysql.com/doc/refman/8.0/en/backup-types.html\" target=\"_blank\">Backup and Recovery Types.</a></p>\n<p>Physical backups consist of raw copies of the directories and files that store database contents. This type of backup is often used for larger datasets as it can allow for faster, less intrusive backup and faster recovery. You can take physical backups of a self-managed MySQL database running on premises or on <a href=\"http://aws.amazon.com/ec2\" target=\"_blank\">Amazon Elastic Compute Cloud</a> (Amazon EC2) instances using the tool <a href=\"https://www.percona.com/doc/percona-xtrabackup/LATEST/index.html\" target=\"_blank\">Percona XtraBackup</a>. You can’t create physical backups of your RDS for MySQL database instances using Percona XtraBackup.</p>\n<p>Logical backups save information represented as a logical database structure (CREATE DATABASE, CREATE TABLE statements) and content (INSERT statements or delimited-text files). This type of backup is suitable for smaller amounts of data where you might edit the data values or table structure, or recreate the data on a different machine architecture. You can take logical backups using tools like <a href=\"https://dev.mysql.com/doc/refman/8.0/en/mysqldump.html\" target=\"_blank\">mysqldump</a> or <a href=\"https://github.com/maxbube/mydumper/releases\" target=\"_blank\">MyDumper</a>.</p>\n<p>If you choose to proceed with a physical backup, you can create backup of your database using Percona XtraBackup and restore it as an RDS for MySQL instance using the <a href=\"https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/MySQL.Procedural.Importing.html#MySQL.Procedural.Importing.PerformingImport\" target=\"_blank\">Restore from Amazon S3</a> option. However, with this option, you can’t restore data to an already existing RDS for MySQL instance. Also, the parameter changes suggested in this post aren’t relevant for improving performance during restoring physical backups because this method involves restoring data files on the server.</p>\n<p>For detailed information and steps for doing a physical backup and restore to an RDS for MySQL instance, refer to <a href=\"https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/MySQL.Procedural.Importing.html\" target=\"_blank\">Restoring a backup into a MySQL DB instance.</a></p>\n<p>If you choose to proceed with logical backups, you can consider the following recommendations:</p>\n<ul>\n<li>\n<p><strong>Use flat files</strong> – Importing data from flat files can be significantly faster compared to SQL dump files. You can load flat times into the database using the <a href=\"https://dev.mysql.com/doc/refman/8.0/en/load-data.html\" target=\"_blank\">LOAD DATA LOCAL INFILE</a> statement. This statement reads rows from a text file into a table at a very high speed. Additionally, the flat files are created on a per-table basis, which directly helps in parallel data loading to the database. You can create flat files from on-premises or self-managed databases using <code>SELECT …INTO</code>statements or by using mysqldump along with the <code>--tab</code>option. However, the <code>SELECT …INTO</code>statement as well as mysqldump with the <code>--tab</code>option don’t work for RDS for MySQL databases because, being a managed service, the access to the underlying server of the database is restricted. For RDS MySQL, you can consider using alternatives such as the following:</p>\n</li>\n<li>\n<p><strong><a href=\"https://aws.amazon.com/dms/\" target=\"_blank\">Amazon Web Services Database Migration Service (Amazon Web Services DMS)</a></strong> – You can migrate data to <a href=\"https://aws.amazon.com/s3/\" target=\"_blank\">Amazon Simple Storage Service (Amazon S3)</a> using Amazon Web Services DMS from RDS for MySQL database instance. When using Amazon S3 as a target in an Amazon Web Services DMS task, both full load and change data capture (CDC) data is written to comma-separated value (.csv) format by default. For more details, refer to <a href=\"https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.S3.html\" target=\"_blank\">Using Amazon S3 as a target for Amazon Web Services Database Migration Service</a></p>\n</li>\n<li>\n<p><strong>MySQL Command Line Client</strong> – You can use MySQL command line client to create flat files. Sample command:</p>\n</li>\n</ul>\n<pre><code class=\"lang-\">mysql -h &lt;endpoint&gt; -u &lt;username&gt; -p --silent --batch --quick --raw -e &quot;select * from &lt;database_name&gt;.&lt;table_name&gt; limit 10&quot; &gt; &lt;filename&gt;.csv\n</code></pre>\n<ul>\n<li><strong>Create multiple files that can be loaded in parallel instead of using a single file</strong> – Consider dividing the backup into multiple files (one file per table) instead of creating a single large file with all the databases and tables. Also, if the tables are large, you can split the table data into more than one file. Typically, a high watermark of 500 MB is recommended to keep the transaction size manageable. This also allows you to import files in parallel, and this approach helps you significantly reduce the time of data import to your databases, ensures faster rollback, and is easier to resume in case of any issues. For example, you can create separate dump files for each table using the following command:</li>\n</ul>\n<pre><code class=\"lang-\">mysqldump -h &lt;endpoint&gt; -u &lt;username&gt; -p &lt;database_name&gt; &lt;tablename&gt; &gt; &lt;dumpfilename.sql&gt;\n</code></pre>\n<p>You can import the file into the database using the following command:</p>\n<pre><code class=\"lang-\">mysql -h &lt;endpoint&gt; -u &lt;username&gt; -p &lt;database_name&gt; &lt; &lt;dumpfilename.sql&gt;\n</code></pre>\n<p>As mentioned in the previous section, you can use the <code>SELECT …INTO</code>statement on your on-premises or self-managed databases to create separate flat files for each table. For example:</p>\n<pre><code class=\"lang-\">SELECT * INTO OUTFILE '/path/&lt;tablename&gt;.csv' FROM &lt;databasename&gt;.&lt;tablename&gt;;\n</code></pre>\n<p>You can import this file into the database using the following command:</p>\n<pre><code class=\"lang-\">LOAD DATA LOCAL INFILE '/path/&lt;tablename&gt;.csv' INTO TABLE &lt;databasename&gt;.&lt;tablename&gt;;\n</code></pre>\n<p>You can also use third-party tools like mydumper, which has parameters such as <code>--rows</code>or <code>--chunk-filesize</code>that you can use to split files when dumping them. For more details, refer to <a href=\"https://github.com/mydumper/mydumper/blob/master/docs/mydumper_usage.rst\" target=\"_blank\">Mydumper Usage.</a></p>\n<ul>\n<li><strong>Load data in primary key order</strong> – Loading data in primary key order is a faster approach for inserting rows and is particularly important for tables that don’t fit entirely within the buffer pool. To achieve this, you need to take the backup in primary key order. For example:</li>\n</ul>\n<pre><code class=\"lang-\">mysqldump -h &lt;endpoint&gt; -u &lt;username&gt; -p --order-by-primary &lt;database_name&gt; &lt;tablename&gt; &gt; &lt;dumpfilename.sql&gt;\n</code></pre>\n<p>The <code>--order-by-primary</code>option dumps each table’s rows sorted by its primary key, or by its first unique index.</p>\n<ul>\n<li><strong>Load data without a secondary and full-text index</strong> – For large datasets, it’s much faster to load your data into a table that has no secondary and/or full-text index and then create the index after that, rather than load data into a table that has an existing index. In the case of a full-text index, when a document is inserted, it’s tokenized, and the individual words and associated data are inserted into the full-text index. This process can result in numerous small insertions into the auxiliary index tables, making concurrent access to these tables a point of contention and therefore slow down the loading of data into tables.</li>\n</ul>\n<h3><a id=\"Parameter_configurations_to_improve_load_performance_71\"></a><strong>Parameter configurations to improve load performance</strong></h3>\n<p>The following is a list of parameters that you can modify to provide better performance while loading data into an RDS for MySQL instance. You can modify some of these via <a href=\"https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithParamGroups.html\" target=\"_blank\">parameter groups</a>, and several of them can be modified at the session level.</p>\n<p>Make sure that after the data import activity is complete, you revert the values of the following parameters to default values or already set values. The suggested values might not be suitable when no data import is ongoing.</p>\n<h4><a id=\"Batch_commits_77\"></a><strong>Batch commits</strong></h4>\n<p>All user activity in InnoDB happens within a transaction. With <code>autocommit</code>enabled, such as <code>autocommit = 1</code>(which is the default setting in MySQL), each SQL statement is treated as a separate transaction. MySQL performs a commit after each SQL statement if that statement didn’t return an error. At every transaction commit, the redo logs are flushed to the disk, which can cause increased runtime overhead during the import operation.</p>\n<p>To overcome this performance penalty, consider batching your commits. Batching the statements improves throughput and therefore helps the performance of the data load. You can achieve this using bulk commit methods, such as the following:</p>\n<ul>\n<li><strong>Use INSERT statements with multiple values</strong> – This can be considerably faster compared to using single-row INSERT statements. See the following syntax:</li>\n</ul>\n<pre><code class=\"lang-\">INSERT INTO &lt;tablename&gt; VALUES (value1, value2), (value3, value4),…..;\n</code></pre>\n<ul>\n<li><strong>Batching flat files with LOAD DATA</strong> – You can load flat files into the database using the <a href=\"https://dev.mysql.com/doc/refman/8.0/en/load-data.html\" target=\"_blank\">LOAD DATA LOCAL INFILE</a> statement. The flat files are created on a per-table basis, which directly helps in parallel data loading to the database.</li>\n<li><strong>Disable autocommit</strong> – When you change the value for <a href=\"https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_autocommit\" target=\"_blank\">autocommit</a> by making changes in the custom parameter group associated to the RDS instance, the changes take place at a global level. You can instead consider changing the value for this parameter at the session level. You can envelope your SQL import statements as follows:</li>\n</ul>\n<pre><code class=\"lang-\">SET autocommit=0;\n... SQL import statements ...\nSET autocommit=1;\n</code></pre>\n<p>All these cases lead to a reduced number of commits. Although this can be beneficial, it’s also important to avoid making transactions too big. A single transaction should not be greater than a few 100 MBs. So if you have a large transaction, you should split it into 250 MB chunks and load them sequentially. This allows you to keep transactions small and also reduces rollback time, should there be an issue.</p>\n<h4><a id=\"Increase_the_size_of_innodb_log_file_size_101\"></a><strong>Increase the size of innodb_log_file_size</strong></h4>\n<p>All the data changes are logged in the redo logs, and these logs are flushed to the disk once in a while. When the logs get flushed depends on adaptive flushing, the percentage of dirty pages reaching the <a href=\"https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_max_dirty_pages_pct_lwm\" target=\"_blank\">innodb_max_dirty_pages_pct_lwm value</a>, and the <a href=\"https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_lru_scan_depth\" target=\"_blank\">innodb_lru_scan_depth</a> values indicating how far to look for dirty pages for a disk flush. For more details regarding buffer pool flushing, refer to <a href=\"https://dev.mysql.com/doc/refman/8.0/en/innodb-buffer-pool-flushing.html\" target=\"_blank\">Configuring Buffer Pool Flushing.</a></p>\n<p><a href=\"https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_log_file_size\" target=\"_blank\">innodb_log_file_size</a> defines the checkpointing age. Small redo log files cause a lot of disk writes. Therefore, increasing the size of the log file leads to less disk I/O.</p>\n<p>The default value of <code>innodb_log_file_size</code>for Amazon RDS for MySQL is 128MiB.</p>\n<p>Please note that a larger log file can increase the crash recovery time. The combined size of log files (<code>innodb_log_file_size * innodb_log_files_in_group</code>) can’t exceed a maximum value that is slightly less than 512 GB. Generally, the combined size of the log files should be large enough that the server can smooth out peaks and troughs in workload activity, which often means that there is enough redo log space to handle more than an hour of write activity. For more information, refer to <a href=\"https://www.percona.com/blog/2008/11/21/how-to-calculate-a-good-innodb-log-file-size/\" target=\"_blank\">How to calculate a good InnoDB log file size.</a></p>\n<p>You can modify the value for this parameter in the custom parameter group associated to the RDS instance. This is a static parameter so a reboot is required.</p>\n<p>Modification of this parameter helps keep the Write IOPS from becoming a bottleneck and therefore helps the performance of the data load.</p>\n<h4><a id=\"Increase_the_size_of_innodb_log_buffer_size_115\"></a><strong>Increase the size of innodb_log_buffer_size</strong></h4>\n<p>For the duration of importing data into the database, consider disabling <a href=\"https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_foreign_key_checks\" target=\"_blank\">foreign key</a> and <a href=\"https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_unique_checks\" target=\"_blank\">unique key</a> checks. Make sure that you don’t have any duplicate keys in your dataset and the data being inserted does not violate any referential constraints as disabling these will disable the associated checks. Disabling foreign and unique key checks can save a lot of disk I/O for big tables. You can’t change the values for these parameters via parameter groups. To change the value of these system variables, you can envelope your SQL import statements as follows. Note that the statements <code>SET unique_checks=1</code>and <code>SET foreign_key_checks=1</code>indicates that the checks are reactivated post data load:</p>\n<pre><code class=\"lang-\">SET unique_checks=0;\n\n... SQL import statements ...\n\nSET unique_checks=1;\n\nSET foreign_key_checks=0;\n\n... SQL import statements ...\n\nSET foreign_key_checks=1;\n</code></pre>\n<p>Modification of this parameter helps keep the write IOPS from becoming a bottleneck and therefore aids the performance of the data load.</p>\n<h3><a id=\"Infrastructure_configurations_to_improve_load_performance_135\"></a><strong>Infrastructure configurations to improve load performance</strong></h3>\n<p>In this section, we discuss considerations regarding Amazon RDS infrastructure that can help improve data import performance.</p>\n<h4><a id=\"Source_configuration_139\"></a><strong>Source configuration</strong></h4>\n<p>If you’re working with data that exists on an EC2 instance, we highly recommend having this source EC2 instance and target RDS instance in the same Region and same Availability Zone. This helps reduce the network overhead, and also the data transfer within the same Availability Zone is free.</p>\n<p>If you’re loading data from an external on-premises source, it’s recommended that, if possible, you should upload the dataset first to Amazon Web Services and then load the data to the RDS instance. This can help reduce latency.</p>\n<p>The source should be adequately sized in terms of I/O and network bandwidth. This is to ensure that resource contention on the source doesn’t become a bottleneck and impact the load performance.</p>\n<p>Make sure that while dumping and loading data, you’re using the latest version of the MySQL client, because the newer version comes with improvements and bug fixes.</p>\n<h4><a id=\"Scale_the_instance_149\"></a><strong>Scale the instance</strong></h4>\n<p>Consider scaling up the RDS database instance for the duration of the import, because importing data can require a higher-than-usual amount of resources like CPU utilization, EBS bandwidth, and memory. After the import completes successfully, you can scale down the instance to the required instance type. For more information, refer to <a href=\"https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Overview.DBInstance.Modifying.html\" target=\"_blank\">Modifying an Amazon RDS DB instance.</a></p>\n<p>Please note that scaling the instance type requires downtime. The downtime for a Single-AZ instance can be higher because the amount of downtime is equal to the amount of time required to replace the underlying host. In the case of Multi-AZ instance, the downtime is equal to the time required for a Multi-AZ failover to complete.</p>\n<p>If modification of the above suggested parameters doesn’t provide the desired results and you observe bottlenecks because of resources like CPU utilization, EBS bandwidth, and memory, consider scaling your instance to a type with more resources that will aid the performance of data load.</p>\n<h4><a id=\"Disable_automated_backups_157\"></a><strong>Disable automated backups</strong></h4>\n<p>Disabling automated backup on an instance removes any existing automated backups present for the instance.</p>\n<p>Disabling automated backups means disabling binary logging on RDS for MySQL database instances. This will disable point-in-time recovery. Also, you cannot disable automated backups if you have <a href=\"https://aws.amazon.com/rds/features/read-replicas/\" target=\"_blank\">read replicas</a> associated to your RDS instance.</p>\n<p>MySQL writes to binary logs immediately after a statement or transaction is complete. During bulk loading of data, disabling binary logging can help reduce the overhead on the resources and therefore improve the load performance.</p>\n<p>If you disable automated backups, we highly recommend taking a manual snapshot of the RDS instance before starting the import operation. This way, if required, you can restore the database to a state before the data import started. You can disable automated backups on RDS instances by setting the backup retention period to 0 days. For more information, see <a href=\"https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithAutomatedBackups.html#USER_WorkingWithAutomatedBackups.Disabling\" target=\"_blank\">Disabling automated backups</a>. Also, ensure that after the import is complete, you enable backups on your database instance by increasing the backup retention period to a value greater than 0 as per your compliance requirements.</p>\n<p>Binary logging can cause increased writes to the disk. Disabling binary logging can assist with keeping the write IOPS from becoming a bottleneck and therefore aid the performance of the data load.</p>\n<h4><a id=\"Disable_MultiAZ_169\"></a><strong>Disable Multi-AZ</strong></h4>\n<p>By disabling Multi-AZ, you’re no longer running your DB instance with high availability.</p>\n<p>In a Multi-AZ setup, all the data written to the primary instance is synchronously replicated to the secondary instance. DB instances using Multi-AZ deployments can have increased write and commit latency compared to a Single-AZ deployment, due to the synchronous data replication that occurs. Disabling Multi-AZ when you’re working with very large datasets gives improvements in performance while bulk loading data.</p>\n<p>Additionally, converting an instance from Single-AZ to Multi-AZ or vice versa is an online operation and requires no downtime. However, if your system is under heavy load (write heavy), the conversion can take time and therefore the instance may remain in modifying state for a long time because the primary and standby needs to be in sync before the instance can be set to available.</p>\n<p>Disabling Multi-AZ can assist with reducing the write latencies on the database and therefore aid the performance of the data load.</p>\n<h4><a id=\"Read_Replica_179\"></a><strong>Read Replica</strong></h4>\n<p>Any updates that are made to the primary DB instance are asynchronously replicated to the read replica. You can consider creating read replica after importing your data into the database. This will help save time on changes being replicated after being written to the source database. For example, when a change is made on the primary instance and it takes an hour to run, then the replication lag is one hour. Because the change might also take one hour to complete on the replica, by the time the change is complete, the total lag is approximately two hours. For more information regarding replica lag and its causes, refer to <a href=\"https://aws.amazon.com/premiumsupport/knowledge-center/rds-mysql-high-replica-lag/\" target=\"_blank\">How can I troubleshoot high replica lag with Amazon RDS for MySQL?</a></p>\n<h3><a id=\"Conclusion_183\"></a><strong>Conclusion</strong></h3>\n<p>Bulk data loading into MySQL can be an expensive operation in terms of the time required to complete the process. In this post, I walked you through different considerations while bulk importing data into an RDS for MySQL instance.</p>\n<p>If you have comments about this post, submit them in the comments section.</p>\n<h4><a id=\"About_the_author_189\"></a><strong>About the author</strong></h4>\n<p><img src=\"https://dev-media.amazoncloud.cn/03100e3aeb774c96bb59f237f5e4a0d5_image.png\" alt=\"image.png\" /></p>\n<p><strong>Asmita Varma</strong> is a Technical Account Manager with Amazon Web Services. She helps customers navigate the challenges and complexities of the Amazon Web Services Cloud to attain operational excellence. Her focus areas have been MySQL and Amazon Aurora MySQL-Compatible Edition.</p>\n"}
目录
亚马逊云科技解决方案 基于行业客户应用场景及技术领域的解决方案
联系亚马逊云科技专家
亚马逊云科技解决方案
基于行业客户应用场景及技术领域的解决方案
联系专家
0
目录
关闭