Accelerate Amazon DynamoDB data access in Amazon Glue jobs using the new Amazon Glue DynamoDB Export connector

海外精选
海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时,内容中提到的“AWS” 是 “Amazon Web Services” 的缩写,在此网站不作为商标展示。
0
0
{"value":"[Modern data architectures ](https://aws.amazon.com/big-data/datalakes-and-analytics/modern-data-architecture/)encourage the integration of data lakes, data warehouses, and purpose-built data stores, enabling unified governance and easy data movement. With a modern data architecture on AWS, you can store data in a data lake and use a ring of purpose-built data services around the lake, allowing you to make decisions with speed and agility.\n\nTo achieve a modern data architecture, [AWS Glue](https://aws.amazon.com/glue) is the key service that integrates data over a data lake, data warehouse, and purpose-built data stores. [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) simplifies data movement like inside-out, outside-in, or around the perimeter. A powerful purpose-built data store is [Amazon DynamoDB](https://aws.amazon.com/dynamodb/), which is widely used by hundreds of thousands of companies, including Amazon.com. It’s common to move data from DynamoDB to a data lake built on top of [Amazon Simple Storage Service](http://aws.amazon.com/s3) ([Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail)). Many customers move data from DynamoDB to [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) using [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) extract, transform, and load (ETL) jobs.\n\n![image.png](1)\n\nToday, we’re pleased to announce the general availability of a new [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) DynamoDB export connector. It’s built on top of the [DynamoDB table export feature](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.html). It’s a scalable and cost-efficient way to read large DynamoDB table data in [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) ETL jobs. This post describes the benefit of this new export connector and its use cases.\n\nThe following are typical use cases to read from DynamoDB tables using [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) ETL jobs:\n\n- Move the data from DynamoDB tables to different data stores\n- Integrate the data with other services and applications\n- Retain historical snapshots for auditing\n- Build an S3 data lake from the DynamoDB data and analyze the data from various services, such as [Amazon Athena](http://aws.amazon.com/athena), [Amazon Redshift](http://aws.amazon.com/redshift), and [Amazon SageMaker](https://aws.amazon.com/sagemaker/)\n\n\n#### **The new [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) DynamoDB export connector**\n\n\nThe old version of the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) DynamoDB connector reads DynamoDB tables through the [DynamoDB Scan API](https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Scan.html). Instead, the new [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) DynamoDB export connector reads DynamoDB data from the snapshot, which is exported from DynamoDB tables. This approach has following benefits:\n\n- It doesn’t consume read capacity units of the source DynamoDB tables\n- The read performance is consistent for large DynamoDB tables\n\nEspecially for large DynamoDB tables more than 100 GB, this new connector is significantly faster than the traditional connector.\n\nTo use this new export connector, you need to enable [point-in-time recovery](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/PointInTimeRecovery.html) (PITR) for the source DynamoDB table in advance.\n\n#### **How to use the new connector on [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Studio Visual Editor**\n\n[AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Studio Visual Editor is a graphical interface that makes it easy to create, run, and monitor [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) ETL jobs in [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail). The new DynamoDB export connector is available on [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Studio Visual Editor. You can choose [Amazon DynamoDB](https://aws.amazon.com/cn/dynamodb/?trk=cndc-detail) as the source.\n\n![image.png](2)\n\nAfter you choose **Create**, you see the visual Directed Acyclic Graph (DAG). Here, you can choose your DynamoDB table that exists in this account or Region. This allows you to select DynamoDB tables (with PITR enabled) directly as a source in [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Studio. This provides a one-click export from any of your DynamoDB tables to [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail). You can also easily add any data sources and targets or transformations to the DAG. For example, it allows you to join two different DynamoDB tables and export the result to [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail), as shown in the following screenshot.\n\n![image.png](3)\n\nThe following two connection options are automatically added. This location is used to store temporary data during the DynamoDB export phase. You can set [S3 bucket lifecycle policies](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) to expire temporary data.\n\n- **dynamodb.s3.bucket** – The S3 bucket to store temporary data during DynamoDB export\n- **dynamodb.s3.prefix** – The S3 prefix to store temporary data during DynamoDB export\n\n\n#### **How to use the new connector on the job script code**\n\nYou can use the new export connector when you create an [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) DynamicFrame in the job script code by configuring the following connection options:\n\n- **dynamodb.export** – (Required) You need to set this to ddb or s3\n- **dynamodb.tableArn** – (Required) Your source DynamoDB table ARN\n- **dynamodb.unnestDDBJson** – (Optional) If set to true, performs an unnest transformation of the DynamoDB JSON structure that is present in exports. The default value is false.\n- **dynamodb.s3.bucket** – (Optional) The S3 bucket to store temporary data during DynamoDB export\n- **dynamodb.s3.prefix** – (Optional) The S3 prefix to store temporary data during DynamoDB export\n\nThe following is the sample Python code to create a DynamicFrame using the new export connector:\n\n```\\ndyf = glue_context.create_dynamic_frame.from_options(\\n connection_type=\\"dynamodb\\",\\n connection_options={\\n \\"dynamodb.export\\": \\"ddb\\",\\n \\"dynamodb.tableArn\\": \\"test_source\\",\\n \\"dynamodb.unnestDDBJson\\": True,\\n \\"dynamodb.s3.bucket\\": \\"bucket name\\",\\n \\"dynamodb.s3.prefix\\": \\"bucket prefix\\"\\n }\\n)\\n```\n\nThe new export connector doesn’t require configurations related to [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) job parallelism, unlike the old connector. Now you no longer need to change the configuration when you scale out the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) job. It also doesn’t require any configuration regarding DynamoDB table read/write capacity and its capacity mode (on demand or provisioned).\n\n\n#### **DynamoDB table schema handling**\n\n\nBy default, the new export connector reads data in DynamoDB JSON structure that is present in exports. The following is an example schema of the frame using the [Amazon Customer Review Dataset](https://s3.amazonaws.com/amazon-reviews-pds/readme.html):\n\n```\\nroot\\n|-- Item: struct (nullable = true)\\n| |-- product_id: struct (nullable = true)\\n| | |-- S: string (nullable = true)\\n| |-- review_id: struct (nullable = true)\\n| | |-- S: string (nullable = true)\\n| |-- total_votes: struct (nullable = true)\\n| | |-- N: string (nullable = true)\\n| |-- product_title: struct (nullable = true)\\n| | |-- S: string (nullable = true)\\n| |-- star_rating: struct (nullable = true)\\n| | |-- N: string (nullable = true)\\n| |-- customer_id: struct (nullable = true)\\n| | |-- S: string (nullable = true)\\n| |-- marketplace: struct (nullable = true)\\n| | |-- S: string (nullable = true)\\n| |-- helpful_votes: struct (nullable = true)\\n| | |-- N: string (nullable = true)\\n| |-- review_headline: struct (nullable = true)\\n| | |-- S: string (nullable = true)\\n| | |-- NULL: boolean (nullable = true)\\n| |-- review_date: struct (nullable = true)\\n| | |-- S: string (nullable = true)\\n| |-- vine: struct (nullable = true)\\n| | |-- S: string (nullable = true)\\n| |-- review_body: struct (nullable = true)\\n| | |-- S: string (nullable = true)\\n| | |-- NULL: boolean (nullable = true)\\n| |-- verified_purchase: struct (nullable = true)\\n| | |-- S: string (nullable = true)\\n| |-- product_category: struct (nullable = true)\\n| | |-- S: string (nullable = true)\\n| |-- year: struct (nullable = true)\\n| | |-- N: string (nullable = true)\\n| |-- product_parent: struct (nullable = true)\\n| | |-- S: string (nullable = true)\\n```\n\nTo read DynamoDB item columns without handling nested data, you can set **dynamodb.unnestDDBJson** to True. The following is an example of the schema of the same data where **dynamodb.unnestDDBJson** is set to ```True```:\n\n```\\nroot\\n|-- product_id: string (nullable = true)\\n|-- review_id: string (nullable = true)\\n|-- total_votes: string (nullable = true)\\n|-- product_title: string (nullable = true)\\n|-- star_rating: string (nullable = true)\\n|-- customer_id: string (nullable = true)\\n|-- marketplace: string (nullable = true)\\n|-- helpful_votes: string (nullable = true)\\n|-- review_headline: string (nullable = true)\\n|-- review_date: string (nullable = true)\\n|-- vine: string (nullable = true)\\n|-- review_body: string (nullable = true)\\n|-- verified_purchase: string (nullable = true)\\n|-- product_category: string (nullable = true)\\n|-- year: string (nullable = true)\\n|-- product_parent: string (nullable = true)\\n```\n\n#### **Data freshness**\n\n\nData freshness is the measure of staleness of the data from the live tables in the original source. In the new export connecor, the option ```dynamodb.export``` impacts data freshness.\n\nWhen **dynamodb.export** is set to ```ddb```, the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) job invokes a new export and then reads the export placed in an S3 bucket into DynamicFrame. It reads exports of the live table, so data can be fresh. On the other hand, when **dynamodb.export** is set to ```s3```, the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) job skips invoking a new export and directly reads an export already placed in an S3 bucket. It reads exports of the past table, so data can be stale, but you can reduce overhead to trigger the exports.\n\nThe following table explains the data freshness and pros and cons of each option.\n\n![image.png](4)\n\n\n#### **Performance**\n\n\nThe following benchmark shows the performance improvements between the old version of the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) DynamoDB connector and the new export connector. The comparison uses the DynamoDB tables storing [the TPC-DS benchmark dataset](http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.5.0.pdf) with different scales from 10 MB to 2 TB. The sample Spark job reads from the DynamoDB table and calculates the count of the items. All the Spark jobs are run on [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) 3.0, G.2X, 60 workers.\n\nThe following chart compares [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) job duration between the old connector and the new export connector. For small DynamoDB tables, the old connector is faster. For large tables more than 80 GB, the new export connector is faster. In other words, the DynamoDB export connector is recommended for jobs that take the old connector more than 5–10 minutes to run. Also, the chart shows that the duration of the new export connector increases slowly as data size increases, although the duration of the old connector increases rapidly as data size increases. This means that the new export connector is suitable especially for larger tables.\n\n![image.png](5)\n\n#### **With [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Auto Scaling**\n\n[AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Auto Scaling is a new feature to automatically resize computing resources for better performance at lower cost. You can take advantage of [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Auto Scaling with the new DynamoDB export connector.\n\nAs the following chart shows, with [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Auto Scaling, the duration of the new export connector is shorter than the old connector when the size of the source DynamoDB table is 100 GB or more. It shows a similar trend without [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Auto Scaling.\n\n![image.png](6)\n\nYou get the cost benefits as only Spark driver is active for most of the time duration during the DynamoDB export (which is nearly 30% of the total job duration time with the old scan-based connector).\n\n#### **Conclusion**\n\n[AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) is a key service to integrate with multiple data stores. At AWS, we keep improving the performance and cost-efficiency of our services. In this post, we announced the availability of the new [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) DynamoDB export connector. With this new connector, you can easily integrate your large data on DynamoDB tables with different data stores. It helps you read the large tables faster from [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) jobs at lower cost.\n\nThe new [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) DynamoDB export connector is now generally available in all supported Glue Regions. Let’s start using the new [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) DynamoDB export connector today! We are looking forward to your feedback and stories on how you utilize the connector for your needs.\n\n#### **About the Authors**\n\n![image.png](7)\n\n**Noritaka Sekiyama** is a Principal Big Data Architect on the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) team. He is responsible for building software artifacts that help customers build data lakes on the cloud.\n\n![image.png](8)\n\n**Neil Gupta** is a Software Development Engineer on the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) team. He enjoys tackling big data problems and learning more about distributed systems.\n\n![image.png](9)\n\n**Andrew Kim** is a Software Development Engineer on the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) team. His passion is to build scalable and effective solutions to challenging problems and working with distributed systems.\n\n![image.png](10)\n\n**Savio Dsouza** is a Software Development Manager on the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) team. His team works on distributed systems for efficiently managing data lakes on AWS and optimizing Apache Spark for performance and reliability.","render":"<p><a href=\\"https://aws.amazon.com/big-data/datalakes-and-analytics/modern-data-architecture/\\" target=\\"_blank\\">Modern data architectures </a>encourage the integration of data lakes, data warehouses, and purpose-built data stores, enabling unified governance and easy data movement. With a modern data architecture on AWS, you can store data in a data lake and use a ring of purpose-built data services around the lake, allowing you to make decisions with speed and agility.</p>\\n<p>To achieve a modern data architecture, <a href=\\"https://aws.amazon.com/glue\\" target=\\"_blank\\">AWS Glue</a> is the key service that integrates data over a data lake, data warehouse, and purpose-built data stores. [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) simplifies data movement like inside-out, outside-in, or around the perimeter. A powerful purpose-built data store is <a href=\\"https://aws.amazon.com/dynamodb/\\" target=\\"_blank\\">Amazon DynamoDB</a>, which is widely used by hundreds of thousands of companies, including Amazon.com. It’s common to move data from DynamoDB to a data lake built on top of <a href=\\"http://aws.amazon.com/s3\\" target=\\"_blank\\">Amazon Simple Storage Service</a> ([Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail)). Many customers move data from DynamoDB to [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) using [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) extract, transform, and load (ETL) jobs.</p>\\n<p><img src=\\"1\\" alt=\\"image.png\\" /></p>\n<p>Today, we’re pleased to announce the general availability of a new AWS Glue DynamoDB export connector. It’s built on top of the <a href=\\"https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.html\\" target=\\"_blank\\">DynamoDB table export feature</a>. It’s a scalable and cost-efficient way to read large DynamoDB table data in [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) ETL jobs. This post describes the benefit of this new export connector and its use cases.</p>\\n<p>The following are typical use cases to read from DynamoDB tables using AWS Glue ETL jobs:</p>\n<ul>\\n<li>Move the data from DynamoDB tables to different data stores</li>\n<li>Integrate the data with other services and applications</li>\n<li>Retain historical snapshots for auditing</li>\n<li>Build an S3 data lake from the DynamoDB data and analyze the data from various services, such as <a href=\\"http://aws.amazon.com/athena\\" target=\\"_blank\\">Amazon Athena</a>, <a href=\\"http://aws.amazon.com/redshift\\" target=\\"_blank\\">Amazon Redshift</a>, and <a href=\\"https://aws.amazon.com/sagemaker/\\" target=\\"_blank\\">Amazon SageMaker</a></li>\\n</ul>\n<h4><a id=\\"The_new_AWS_Glue_DynamoDB_export_connector_16\\"></a><strong>The new AWS Glue DynamoDB export connector</strong></h4>\\n<p>The old version of the AWS Glue DynamoDB connector reads DynamoDB tables through the <a href=\\"https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Scan.html\\" target=\\"_blank\\">DynamoDB Scan API</a>. Instead, the new [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) DynamoDB export connector reads DynamoDB data from the snapshot, which is exported from DynamoDB tables. This approach has following benefits:</p>\\n<ul>\\n<li>It doesn’t consume read capacity units of the source DynamoDB tables</li>\n<li>The read performance is consistent for large DynamoDB tables</li>\n</ul>\\n<p>Especially for large DynamoDB tables more than 100 GB, this new connector is significantly faster than the traditional connector.</p>\n<p>To use this new export connector, you need to enable <a href=\\"https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/PointInTimeRecovery.html\\" target=\\"_blank\\">point-in-time recovery</a> (PITR) for the source DynamoDB table in advance.</p>\\n<h4><a id=\\"How_to_use_the_new_connector_on_AWS_Glue_Studio_Visual_Editor_28\\"></a><strong>How to use the new connector on AWS Glue Studio Visual Editor</strong></h4>\\n<p>AWS Glue Studio Visual Editor is a graphical interface that makes it easy to create, run, and monitor AWS Glue ETL jobs in AWS Glue. The new DynamoDB export connector is available on AWS Glue Studio Visual Editor. You can choose Amazon DynamoDB as the source.</p>\n<p><img src=\\"2\\" alt=\\"image.png\\" /></p>\n<p>After you choose <strong>Create</strong>, you see the visual Directed Acyclic Graph (DAG). Here, you can choose your DynamoDB table that exists in this account or Region. This allows you to select DynamoDB tables (with PITR enabled) directly as a source in [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Studio. This provides a one-click export from any of your DynamoDB tables to [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail). You can also easily add any data sources and targets or transformations to the DAG. For example, it allows you to join two different DynamoDB tables and export the result to [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail), as shown in the following screenshot.</p>\\n<p><img src=\\"3\\" alt=\\"image.png\\" /></p>\n<p>The following two connection options are automatically added. This location is used to store temporary data during the DynamoDB export phase. You can set <a href=\\"https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html\\" target=\\"_blank\\">S3 bucket lifecycle policies</a> to expire temporary data.</p>\\n<ul>\\n<li><strong>dynamodb.s3.bucket</strong> – The S3 bucket to store temporary data during DynamoDB export</li>\\n<li><strong>dynamodb.s3.prefix</strong> – The S3 prefix to store temporary data during DynamoDB export</li>\\n</ul>\n<h4><a id=\\"How_to_use_the_new_connector_on_the_job_script_code_44\\"></a><strong>How to use the new connector on the job script code</strong></h4>\\n<p>You can use the new export connector when you create an AWS Glue DynamicFrame in the job script code by configuring the following connection options:</p>\n<ul>\\n<li><strong>dynamodb.export</strong> – (Required) You need to set this to ddb or s3</li>\\n<li><strong>dynamodb.tableArn</strong> – (Required) Your source DynamoDB table ARN</li>\\n<li><strong>dynamodb.unnestDDBJson</strong> – (Optional) If set to true, performs an unnest transformation of the DynamoDB JSON structure that is present in exports. The default value is false.</li>\\n<li><strong>dynamodb.s3.bucket</strong> – (Optional) The S3 bucket to store temporary data during DynamoDB export</li>\\n<li><strong>dynamodb.s3.prefix</strong> – (Optional) The S3 prefix to store temporary data during DynamoDB export</li>\\n</ul>\n<p>The following is the sample Python code to create a DynamicFrame using the new export connector:</p>\n<pre><code class=\\"lang-\\">dyf = glue_context.create_dynamic_frame.from_options(\\n connection_type=&quot;dynamodb&quot;,\\n connection_options={\\n &quot;dynamodb.export&quot;: &quot;ddb&quot;,\\n &quot;dynamodb.tableArn&quot;: &quot;test_source&quot;,\\n &quot;dynamodb.unnestDDBJson&quot;: True,\\n &quot;dynamodb.s3.bucket&quot;: &quot;bucket name&quot;,\\n &quot;dynamodb.s3.prefix&quot;: &quot;bucket prefix&quot;\\n }\\n)\\n</code></pre>\\n<p>The new export connector doesn’t require configurations related to AWS Glue job parallelism, unlike the old connector. Now you no longer need to change the configuration when you scale out the AWS Glue job. It also doesn’t require any configuration regarding DynamoDB table read/write capacity and its capacity mode (on demand or provisioned).</p>\n<h4><a id=\\"DynamoDB_table_schema_handling_72\\"></a><strong>DynamoDB table schema handling</strong></h4>\\n<p>By default, the new export connector reads data in DynamoDB JSON structure that is present in exports. The following is an example schema of the frame using the <a href=\\"https://s3.amazonaws.com/amazon-reviews-pds/readme.html\\" target=\\"_blank\\">Amazon Customer Review Dataset</a>:</p>\\n<pre><code class=\\"lang-\\">root\\n|-- Item: struct (nullable = true)\\n| |-- product_id: struct (nullable = true)\\n| | |-- S: string (nullable = true)\\n| |-- review_id: struct (nullable = true)\\n| | |-- S: string (nullable = true)\\n| |-- total_votes: struct (nullable = true)\\n| | |-- N: string (nullable = true)\\n| |-- product_title: struct (nullable = true)\\n| | |-- S: string (nullable = true)\\n| |-- star_rating: struct (nullable = true)\\n| | |-- N: string (nullable = true)\\n| |-- customer_id: struct (nullable = true)\\n| | |-- S: string (nullable = true)\\n| |-- marketplace: struct (nullable = true)\\n| | |-- S: string (nullable = true)\\n| |-- helpful_votes: struct (nullable = true)\\n| | |-- N: string (nullable = true)\\n| |-- review_headline: struct (nullable = true)\\n| | |-- S: string (nullable = true)\\n| | |-- NULL: boolean (nullable = true)\\n| |-- review_date: struct (nullable = true)\\n| | |-- S: string (nullable = true)\\n| |-- vine: struct (nullable = true)\\n| | |-- S: string (nullable = true)\\n| |-- review_body: struct (nullable = true)\\n| | |-- S: string (nullable = true)\\n| | |-- NULL: boolean (nullable = true)\\n| |-- verified_purchase: struct (nullable = true)\\n| | |-- S: string (nullable = true)\\n| |-- product_category: struct (nullable = true)\\n| | |-- S: string (nullable = true)\\n| |-- year: struct (nullable = true)\\n| | |-- N: string (nullable = true)\\n| |-- product_parent: struct (nullable = true)\\n| | |-- S: string (nullable = true)\\n</code></pre>\\n<p>To read DynamoDB item columns without handling nested data, you can set <strong>dynamodb.unnestDDBJson</strong> to True. The following is an example of the schema of the same data where <strong>dynamodb.unnestDDBJson</strong> is set to <code>True</code>:</p>\\n<pre><code class=\\"lang-\\">root\\n|-- product_id: string (nullable = true)\\n|-- review_id: string (nullable = true)\\n|-- total_votes: string (nullable = true)\\n|-- product_title: string (nullable = true)\\n|-- star_rating: string (nullable = true)\\n|-- customer_id: string (nullable = true)\\n|-- marketplace: string (nullable = true)\\n|-- helpful_votes: string (nullable = true)\\n|-- review_headline: string (nullable = true)\\n|-- review_date: string (nullable = true)\\n|-- vine: string (nullable = true)\\n|-- review_body: string (nullable = true)\\n|-- verified_purchase: string (nullable = true)\\n|-- product_category: string (nullable = true)\\n|-- year: string (nullable = true)\\n|-- product_parent: string (nullable = true)\\n</code></pre>\\n<h4><a id=\\"Data_freshness_138\\"></a><strong>Data freshness</strong></h4>\\n<p>Data freshness is the measure of staleness of the data from the live tables in the original source. In the new export connecor, the option <code>dynamodb.export</code> impacts data freshness.</p>\\n<p>When <strong>dynamodb.export</strong> is set to <code>ddb</code>, the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) job invokes a new export and then reads the export placed in an S3 bucket into DynamicFrame. It reads exports of the live table, so data can be fresh. On the other hand, when <strong>dynamodb.export</strong> is set to <code>s3</code>, the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) job skips invoking a new export and directly reads an export already placed in an S3 bucket. It reads exports of the past table, so data can be stale, but you can reduce overhead to trigger the exports.</p>\\n<p>The following table explains the data freshness and pros and cons of each option.</p>\n<p><img src=\\"4\\" alt=\\"image.png\\" /></p>\n<h4><a id=\\"Performance_150\\"></a><strong>Performance</strong></h4>\\n<p>The following benchmark shows the performance improvements between the old version of the AWS Glue DynamoDB connector and the new export connector. The comparison uses the DynamoDB tables storing <a href=\\"http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.5.0.pdf\\" target=\\"_blank\\">the TPC-DS benchmark dataset</a> with different scales from 10 MB to 2 TB. The sample Spark job reads from the DynamoDB table and calculates the count of the items. All the Spark jobs are run on [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) 3.0, G.2X, 60 workers.</p>\\n<p>The following chart compares AWS Glue job duration between the old connector and the new export connector. For small DynamoDB tables, the old connector is faster. For large tables more than 80 GB, the new export connector is faster. In other words, the DynamoDB export connector is recommended for jobs that take the old connector more than 5–10 minutes to run. Also, the chart shows that the duration of the new export connector increases slowly as data size increases, although the duration of the old connector increases rapidly as data size increases. This means that the new export connector is suitable especially for larger tables.</p>\n<p><img src=\\"5\\" alt=\\"image.png\\" /></p>\n<h4><a id=\\"With_AWS_Glue_Auto_Scaling_159\\"></a><strong>With AWS Glue Auto Scaling</strong></h4>\\n<p>AWS Glue Auto Scaling is a new feature to automatically resize computing resources for better performance at lower cost. You can take advantage of AWS Glue Auto Scaling with the new DynamoDB export connector.</p>\n<p>As the following chart shows, with AWS Glue Auto Scaling, the duration of the new export connector is shorter than the old connector when the size of the source DynamoDB table is 100 GB or more. It shows a similar trend without AWS Glue Auto Scaling.</p>\n<p><img src=\\"6\\" alt=\\"image.png\\" /></p>\n<p>You get the cost benefits as only Spark driver is active for most of the time duration during the DynamoDB export (which is nearly 30% of the total job duration time with the old scan-based connector).</p>\n<h4><a id=\\"Conclusion_169\\"></a><strong>Conclusion</strong></h4>\\n<p>AWS Glue is a key service to integrate with multiple data stores. At AWS, we keep improving the performance and cost-efficiency of our services. In this post, we announced the availability of the new AWS Glue DynamoDB export connector. With this new connector, you can easily integrate your large data on DynamoDB tables with different data stores. It helps you read the large tables faster from AWS Glue jobs at lower cost.</p>\n<p>The new AWS Glue DynamoDB export connector is now generally available in all supported Glue Regions. Let’s start using the new AWS Glue DynamoDB export connector today! We are looking forward to your feedback and stories on how you utilize the connector for your needs.</p>\n<h4><a id=\\"About_the_Authors_175\\"></a><strong>About the Authors</strong></h4>\\n<p><img src=\\"7\\" alt=\\"image.png\\" /></p>\n<p><strong>Noritaka Sekiyama</strong> is a Principal Big Data Architect on the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) team. He is responsible for building software artifacts that help customers build data lakes on the cloud.</p>\\n<p><img src=\\"8\\" alt=\\"image.png\\" /></p>\n<p><strong>Neil Gupta</strong> is a Software Development Engineer on the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) team. He enjoys tackling big data problems and learning more about distributed systems.</p>\\n<p><img src=\\"9\\" alt=\\"image.png\\" /></p>\n<p><strong>Andrew Kim</strong> is a Software Development Engineer on the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) team. His passion is to build scalable and effective solutions to challenging problems and working with distributed systems.</p>\\n<p><img src=\\"10\\" alt=\\"image.png\\" /></p>\n<p><strong>Savio Dsouza</strong> is a Software Development Manager on the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) team. His team works on distributed systems for efficiently managing data lakes on AWS and optimizing Apache Spark for performance and reliability.</p>\n"}
目录
亚马逊云科技解决方案 基于行业客户应用场景及技术领域的解决方案
联系亚马逊云科技专家
亚马逊云科技解决方案
基于行业客户应用场景及技术领域的解决方案
联系专家
0
目录
关闭