Backfilling an Amazon DynamoDB Time to Live attribute using Amazon EMR: Part 2

海外精选
海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时,内容中提到的“AWS” 是 “Amazon Web Services” 的缩写,在此网站不作为商标展示。
0
0
{"value":"[Amazon DynamoDB](https://aws.amazon.com/dynamodb/) is a fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale. DynamoDB can store and retrieve any amount of data. As the data grows and becomes cold, many use cases such as session management or order management require archival of the older unneeded items. DynamoDB has a feature called [Time to Live (TTL)](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/TTL.html), which expires and deletes items without consuming write capacity units (WCU). Managing item expiry with TTL can lower your storage and WCU (deletes) costs.\n\nWe suggest choosing a TTL attribute before writing data to your DynamoDB table. Sometimes, customers start using TTL with tables that already contain data. To efficiently manage backfilling DynamoDB TTL attributes, we recommended using [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail) because it’s a highly scalable solution with built-in functionality for connecting with DynamoDB. You can run an [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail) job after you modify your application to add a TTL attribute for all new items.\n\n[Part 1](https://aws.amazon.com/blogs/database/backfilling-an-amazon-dynamodb-time-to-live-ttl-attribute-with-amazon-emr/) of this series shows you how to create an [Amazon EMR](http://aws.amazon.com/emr) cluster and run a [Hive query](https://cwiki.apache.org/confluence/display/Hive/LanguageManual) in the cluster to backfill a TTL attribute in the items that are missing it. The method works when there are no complex or collection data types such as maps or lists in the data. In this post, I show you how to backfill TTL attributes in items when data has map, list, Boolean, or null data types, which aren’t natively supported by DynamoDBStorageHandler in Hive. The steps in the post work whether the set of data attributes for each item is uniform or discrete.\n\nThis technique works best for insert-only workloads or during periods of maintenance when updates and deletes aren’t allowed. If data is being updated or deleted during the process, it may be unintentionally deleted.\n\n#### **DynamoDB schema**\n\nAlthough DynamoDB only defines and enforces primary key attributes, data for most applications has some form of schema. In this scenario, the current schema contains the following attributes:\n\n- **order_id** – The partition key, which is a string in universally unique identifier (UUID) form\n- **creation_timestamp** – A string that represents the item’s creation timestamp in ISO 8601 format\n- **delivery_address** – A map that represents the order’s delivery address\n- **is_cod** – A Boolean value that represents if the payment mode is cash on delivery\n- **item_list** – A list of ordered product codes\n\nThis post uses a table called ```Orders``` with 2 million items that are missing the ```expiration_timestamp``` attribute. Figure 1 that follows is a screenshot showing a sample of the items in the ```Orders``` table.\n\n![image.png](https://dev-media.amazoncloud.cn/af7b176be96249c4918d7fbba6465b70_image.png)\nFigure 1: Sample of the Orders table\n\nFetching one of the items from this table using the [Amazon Web Services Command Line Interface (Amazon Web Services CLI)](http://aws.amazon.com/cli)[ get-item call](https://docs.aws.amazon.com/cli/latest/reference/dynamodb/get-item.html), you can see that there is no ```expiration_timestamp``` attribute present:\n\n```\\naws dynamodb get-item --table-name Orders --key '{\\"order_id\\":{\\"S\\":\\"e9bba98e-d579-43bb-a571-93ccdb32c960\\"}}'\\n{\\n \\"Item\\": {\\n \\"order_id\\": {\\n \\"S\\": \\"e9bba98e-d579-43bb-a571-93ccdb32c960\\"\\n }, \\n \\"delivery_address\\": {\\n \\"M\\": {\\n \\"city\\": {\\n \\"S\\": \\"Jonathan\\"\\n }, \\n \\"state\\": {\\n \\"S\\": \\"VA\\"\\n }, \\n \\"door\\": {\\n \\"S\\": \\"5482\\"\\n }, \\n \\"pin\\": {\\n \\"S\\": \\"28458\\"\\n }\\n }\\n }, \\n \\"creation_timestamp\\": {\\n \\"S\\": \\"2022-02-01T20:23:25Z\\"\\n }, \\n \\"is_cod\\": {\\n \\"BOOL\\": true\\n }, \\n \\"item_list\\": {\\n \\"L\\": [\\n {\\n \\"S\\": \\"pglyg\\"\\n }\\n ]\\n }\\n }\\n}\\n```\n\nYou add a TTL attribute called ```expiration_timestamp```\n to each of these items using Hive commands. In this case, you want to delete the order 180 days after it was created, so the ```expiration_timestamp``` attribute contains a number that represents the item’s expiration time in seconds since the epoch, which is 180 days after the ```creation_timestamp```.\n\nAfter the backfill process, you can use the same call to verify if the TTL attribute has been added to this item.\n\n#### **Running Hive CLI commands**\n\nYou can calculate the new TTL attribute on a per-item basis using another timestamp attribute that already exists in each item. For more information about creating an [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail) cluster, connecting to it through SSH, and sizing considerations, refer to [Part 1](https://aws.amazon.com/blogs/database/backfilling-an-amazon-dynamodb-time-to-live-ttl-attribute-with-amazon-emr/) of this series.\n\n[Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail) cluster version 5.34 has been used for this post. You create this cluster with five nodes (one master and four core nodes) of type c5.4xlarge.\n\n1. Log in to the master node of the cluster and log in to the Hive CLI by entering hive on the master node terminal. Create a database and log in to the database using the following commands:\n\n```\\nhive> show databases;\\nOK\\ndefault\\nTime taken: 0.632 seconds, Fetched: 1 row(s)\\n\\nhive> create database dynamodb_ttl;\\nOK\\nTime taken: 0.221 seconds\\n\\nhive> use dynamodb_ttl;\\nOK\\nTime taken: 0.039 seconds\\n```\n\n2. Now create a wrapper over the DynamoDB table so that Hive queries can be performed against it. In the ```DynamoDBStorageHandler```, there are no column mappings for map, list, and Boolean data types, which the DynamoDB data contains. So, create a single entity called ```item``` that represents the entire DynamoDB item as a map of strings for both keys and values in the map. Include another column called ```expiration_timestamp```\nof type ```bigint```, where TTL values will actually be backfilled. See the following Hive query:\n\n```\\nhive> CREATE EXTERNAL TABLE Orders(item map<string,string>, expiration_timestamp bigint COMMENT 'from deserializer')\\nSTORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'\\nTBLPROPERTIES ('dynamodb.table.name'='Orders', \\"dynamodb.column.mapping\\" = \\"expiration_timestamp:expiration_timestamp\\");\\n```\n- The backfill time largely depends on two factors: the DynamoDB capacity available for the process, and the number of nodes in the [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail) cluster. For this test, the table is in the provisioned throughput mode with the ```ProvisionedReadCapacity``` and ```ProvisionedReadCapacity```set to ```40000``` each.\n\n3. With some [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail) versions (5.25.0 to 5.34.0), you must set the ```dynamodb.throughput.write``` and ```dynamodb.throughput.read``` parameters on Hive.\n\n\n```\\nhive> SET dynamodb.throughput.write=40000;\\n\\nhive> SET dynamodb.throughput.read=40000;\\n\\n```\n\n4. Now run the following command to find the number of items in the table without ```expiration_timestamp```:\n\n```\\nhive> select count(*) from Orders where item[\\"expiration_timestamp\\"] IS NULL;\\n\\nQuery ID = hadoop_20220328145453_32f2eb87-825e-4838-9aeb-845f4e339cc8\\nTotal jobs = 1\\nLaunching Job 1 out of 1\\nStatus: Running (Executing on YARN cluster with App id application_1648478810565_0001)\\n\\n----------------------------------------------------------------------------------------------\\nVERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED \\n----------------------------------------------------------------------------------------------\\nMap 1 .......... container SUCCEEDED 62 62 0 0 0 0 \\nReducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 \\n----------------------------------------------------------------------------------------------\\nVERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 19.88 s \\n----------------------------------------------------------------------------------------------\\nOK\\n2000000\\nTime taken: 20.597 seconds, Fetched: 1 row(s)\\n\\n```\n\n5. As shown in the preceding output, there are 2 million items without the TTL attribute. Now run an INSERT OVERWRITE command with the ```regexp_replace()```\n function to carry out the calculation and backfill in place. If data is being updated or deleted between the time when the SELECT part of the query runs and when the INSERT OVERWITE part runs, you could end up with incorrect results. This is why we caution you to only use this technique for insert-only workloads or during a maintenance window when no updates or deletes are occurring. To calculate the ```expiration_timestamp```\n value, first parse the ```creation_timestamp```\n, which is an ISO 8601 formatted string, into a format expected by Hive’s ```unix_timestamp``` function. Then replace the parsed value with an integer timestamp, which is measured in seconds since the epoch as required by the [DynamoDB TTL functionality](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/time-to-live-ttl-before-you-start.html#time-to-live-ttl-before-you-start-formatting). In this case, items need to be expired 180 days later than their ```creation_timestamp```. See the following Hive query:\n\n```\\nhive> INSERT OVERWRITE TABLE Orders SELECT item,(unix_timestamp(regexp_replace(item[\\"creation_timestamp\\"], '\\\\\\\\{\\\\\\"s\\\\\\":\\\\\\"(.+?)T(.+?)Z\\\\\\"\\\\\\\\}\$','\$1 \$2')) + (60*60*24*180)) FROM Orders;\\nQuery ID = hadoop_20220329170732_cf1168d9-4604-49c9-9a14-db0acdd7cb27\\nTotal jobs = 1\\nLaunching Job 1 out of 1\\nStatus: Running (Executing on YARN cluster with App id application_1648571110202_0004)\\n----------------------------------------------------------------------------------------------\\n VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED \\n----------------------------------------------------------------------------------------------\\nMap 1 .......... container SUCCEEDED 62 62 0 0 0 0 \\n----------------------------------------------------------------------------------------------\\nVERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 124.17 s \\n----------------------------------------------------------------------------------------------\\nOK\\nTime taken: 125.11 seconds\\n```\n\n6. When you switch the table to on-demand throughput mode, the peak value is now set to 40,000 reads and 40,000 writes. When you run another test, you can observe that the time is similar to the earlier test when the table was using provisioned capacity:\n\n ```\\nhive> INSERT OVERWRITE TABLE Orders SELECT item,(unix_timestamp(regexp_replace(item[\\"creation_timestamp\\"], '\\\\\\\\{\\\\\\"s\\\\\\":\\\\\\"(.+?)T(.+?)Z\\\\\\"\\\\\\\\}\$','\$1 \$2')) + (60*60*24*180)) FROM Orders;\\nQuery ID = hadoop_20220329170155_047dee68-e499-4383-ba08-2d6d77029d47\\nTotal jobs = 1\\nLaunching Job 1 out of 1\\nStatus: Running (Executing on YARN cluster with App id application_1648571110202_0004)\\n----------------------------------------------------------------------------------------------\\n VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED \\n----------------------------------------------------------------------------------------------\\nMap 1 .......... container SUCCEEDED 62 62 0 0 0 0 \\n----------------------------------------------------------------------------------------------\\nVERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 124.34 s \\n----------------------------------------------------------------------------------------------\\nOK\\nTime taken: 125.337 seconds\\n```\n\n7. To verify if all items now have the ```expiration_timestamp```, run the following Hive query again:\n\n```\\nhive> select count(*) from Orders where item[\\"expiration_timestamp\\"] IS NULL;\\n\\nQuery ID = hadoop_20220313115251_ed5d2e68-b855-4cb1-a6d3-dd1efb63b36a\\nTotal jobs = 1\\nLaunching Job 1 out of 1\\nStatus: Running (Executing on YARN cluster with App id application_1647166731942_0011)\\n\\n----------------------------------------------------------------------------------------------\\n VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED \\n----------------------------------------------------------------------------------------------\\nMap 1 .......... container SUCCEEDED 62 62 0 0 0 0 \\nReducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 \\n----------------------------------------------------------------------------------------------\\nVERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 21.06 s \\n----------------------------------------------------------------------------------------------\\nOK\\n0\\nTime taken: 21.849 seconds, Fetched: 1 row(s)\\n```\nAs shown in the output, there are no items where the TTL attribute value is absent anymore. To look at the expiration timestamp for the earlier item, run the same ```get-item``` call:\n\n```\\naws dynamodb get-item --table-name Orders --key '{\\"order_id\\":{\\"S\\":\\"e9bba98e-d579-43bb-a571-93ccdb32c960\\"}}'\\n{\\n \\"Item\\": {\\n \\"delivery_address\\": {\\n \\"M\\": {\\n \\"city\\": {\\n \\"S\\": \\"Jonathan\\"\\n }, \\n \\"state\\": {\\n \\"S\\": \\"VA\\"\\n }, \\n \\"door\\": {\\n \\"S\\": \\"5482\\"\\n }, \\n \\"pin\\": {\\n \\"S\\": \\"28458\\"\\n }\\n }\\n }, \\n \\"is_cod\\": {\\n \\"BOOL\\": true\\n }, \\n \\"item_list\\": {\\n \\"L\\": [\\n {\\n \\"S\\": \\"pglyg\\"\\n }\\n ]\\n }, \\n \\"order_id\\": {\\n \\"S\\": \\"e9bba98e-d579-43bb-a571-93ccdb32c960\\"\\n }, \\n \\"creation_timestamp\\": {\\n \\"S\\": \\"2022-02-01T20:23:25Z\\"\\n }, \\n \\"expiration_timestamp\\": {\\n \\"N\\": \\"1659299005\\"\\n }\\n }\\n} \\n\\n```\n\nThe epoch ```1659299005```corresponds to```2022-07-31T20:23:25Z``` as per [Epoch Converter](https://www.epochconverter.com/), which is 180 days after the ```creation_timestamp``` value of ```2022-02-01T20:23:25Z``` for this item.\n\n#### **Clean up**\n\nIf you don’t need the [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail) cluster anymore, [delete](https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_TerminateJobFlow.html) it after the backfill is complete to avoid unnecessary costs.\n\n#### **Conclusion**\n\nIn this post, I showed you how to backfill the TTL attribute in a DynamoDB table with complex data types using Hive queries. Hive offers a flexible way to detect, calculate, and overwrite data based on conditions such as missing attributes, along with rate-limiting the throughput consumption for backfill.\n\nAdditional reading:\n\n- [Getting started with DynamoDB](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GettingStartedDynamoDB.html)\n- [Hive tutorial](https://cwiki.apache.org/confluence/display/Hive/Tutorial)\n- [Creating an External Table in Hive](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.ExternalTableForDDB.html)\n- [Date Functions in the Apache Hive](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions)\n\n#### **About the author**\n\n[![image.png](https://dev-media.amazoncloud.cn/21b59e6ea949442fbaf0dd6599b366ae_image.png)](https://d2908q01vomqb2.cloudfront.net/887309d048beef83ad3eabf2a79a64a389ab1c9f/2022/07/01/juhi-patil.png)\n\n**Juhi Patil** is a London-based DynamoDB Specialist Solutions Architect with a background in big data technologies. In her current role, she helps customers design, evaluate, and optimize their DynamoDB-based solutions.\n\n","render":"<p><a href=\\"https://aws.amazon.com/dynamodb/\\" target=\\"_blank\\">Amazon DynamoDB</a> is a fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale. DynamoDB can store and retrieve any amount of data. As the data grows and becomes cold, many use cases such as session management or order management require archival of the older unneeded items. DynamoDB has a feature called <a href=\\"https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/TTL.html\\" target=\\"_blank\\">Time to Live (TTL)</a>, which expires and deletes items without consuming write capacity units (WCU). Managing item expiry with TTL can lower your storage and WCU (deletes) costs.</p>\\n<p>We suggest choosing a TTL attribute before writing data to your DynamoDB table. Sometimes, customers start using TTL with tables that already contain data. To efficiently manage backfilling DynamoDB TTL attributes, we recommended using Amazon EMR because it’s a highly scalable solution with built-in functionality for connecting with DynamoDB. You can run an Amazon EMR job after you modify your application to add a TTL attribute for all new items.</p>\n<p><a href=\\"https://aws.amazon.com/blogs/database/backfilling-an-amazon-dynamodb-time-to-live-ttl-attribute-with-amazon-emr/\\" target=\\"_blank\\">Part 1</a> of this series shows you how to create an <a href=\\"http://aws.amazon.com/emr\\" target=\\"_blank\\">Amazon EMR</a> cluster and run a <a href=\\"https://cwiki.apache.org/confluence/display/Hive/LanguageManual\\" target=\\"_blank\\">Hive query</a> in the cluster to backfill a TTL attribute in the items that are missing it. The method works when there are no complex or collection data types such as maps or lists in the data. In this post, I show you how to backfill TTL attributes in items when data has map, list, Boolean, or null data types, which aren’t natively supported by DynamoDBStorageHandler in Hive. The steps in the post work whether the set of data attributes for each item is uniform or discrete.</p>\\n<p>This technique works best for insert-only workloads or during periods of maintenance when updates and deletes aren’t allowed. If data is being updated or deleted during the process, it may be unintentionally deleted.</p>\n<h4><a id=\\"DynamoDB_schema_8\\"></a><strong>DynamoDB schema</strong></h4>\\n<p>Although DynamoDB only defines and enforces primary key attributes, data for most applications has some form of schema. In this scenario, the current schema contains the following attributes:</p>\n<ul>\\n<li><strong>order_id</strong> – The partition key, which is a string in universally unique identifier (UUID) form</li>\\n<li><strong>creation_timestamp</strong> – A string that represents the item’s creation timestamp in ISO 8601 format</li>\\n<li><strong>delivery_address</strong> – A map that represents the order’s delivery address</li>\\n<li><strong>is_cod</strong> – A Boolean value that represents if the payment mode is cash on delivery</li>\\n<li><strong>item_list</strong> – A list of ordered product codes</li>\\n</ul>\n<p>This post uses a table called <code>Orders</code> with 2 million items that are missing the <code>expiration_timestamp</code> attribute. Figure 1 that follows is a screenshot showing a sample of the items in the <code>Orders</code> table.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/af7b176be96249c4918d7fbba6465b70_image.png\\" alt=\\"image.png\\" /><br />\\nFigure 1: Sample of the Orders table</p>\n<p>Fetching one of the items from this table using the <a href=\\"http://aws.amazon.com/cli\\" target=\\"_blank\\">Amazon Web Services Command Line Interface (Amazon Web Services CLI)</a><a href=\\"https://docs.aws.amazon.com/cli/latest/reference/dynamodb/get-item.html\\" target=\\"_blank\\"> get-item call</a>, you can see that there is no <code>expiration_timestamp</code> attribute present:</p>\\n<pre><code class=\\"lang-\\">aws dynamodb get-item --table-name Orders --key '{&quot;order_id&quot;:{&quot;S&quot;:&quot;e9bba98e-d579-43bb-a571-93ccdb32c960&quot;}}'\\n{\\n &quot;Item&quot;: {\\n &quot;order_id&quot;: {\\n &quot;S&quot;: &quot;e9bba98e-d579-43bb-a571-93ccdb32c960&quot;\\n }, \\n &quot;delivery_address&quot;: {\\n &quot;M&quot;: {\\n &quot;city&quot;: {\\n &quot;S&quot;: &quot;Jonathan&quot;\\n }, \\n &quot;state&quot;: {\\n &quot;S&quot;: &quot;VA&quot;\\n }, \\n &quot;door&quot;: {\\n &quot;S&quot;: &quot;5482&quot;\\n }, \\n &quot;pin&quot;: {\\n &quot;S&quot;: &quot;28458&quot;\\n }\\n }\\n }, \\n &quot;creation_timestamp&quot;: {\\n &quot;S&quot;: &quot;2022-02-01T20:23:25Z&quot;\\n }, \\n &quot;is_cod&quot;: {\\n &quot;BOOL&quot;: true\\n }, \\n &quot;item_list&quot;: {\\n &quot;L&quot;: [\\n {\\n &quot;S&quot;: &quot;pglyg&quot;\\n }\\n ]\\n }\\n }\\n}\\n</code></pre>\\n<p>You add a TTL attribute called <code>expiration_timestamp</code><br />\\nto each of these items using Hive commands. In this case, you want to delete the order 180 days after it was created, so the <code>expiration_timestamp</code> attribute contains a number that represents the item’s expiration time in seconds since the epoch, which is 180 days after the <code>creation_timestamp</code>.</p>\\n<p>After the backfill process, you can use the same call to verify if the TTL attribute has been added to this item.</p>\n<h4><a id=\\"Running_Hive_CLI_commands_70\\"></a><strong>Running Hive CLI commands</strong></h4>\\n<p>You can calculate the new TTL attribute on a per-item basis using another timestamp attribute that already exists in each item. For more information about creating an Amazon EMR cluster, connecting to it through SSH, and sizing considerations, refer to <a href=\\"https://aws.amazon.com/blogs/database/backfilling-an-amazon-dynamodb-time-to-live-ttl-attribute-with-amazon-emr/\\" target=\\"_blank\\">Part 1</a> of this series.</p>\\n<p>Amazon EMR cluster version 5.34 has been used for this post. You create this cluster with five nodes (one master and four core nodes) of type c5.4xlarge.</p>\n<ol>\\n<li>Log in to the master node of the cluster and log in to the Hive CLI by entering hive on the master node terminal. Create a database and log in to the database using the following commands:</li>\n</ol>\\n<pre><code class=\\"lang-\\">hive&gt; show databases;\\nOK\\ndefault\\nTime taken: 0.632 seconds, Fetched: 1 row(s)\\n\\nhive&gt; create database dynamodb_ttl;\\nOK\\nTime taken: 0.221 seconds\\n\\nhive&gt; use dynamodb_ttl;\\nOK\\nTime taken: 0.039 seconds\\n</code></pre>\\n<ol start=\\"2\\">\\n<li>Now create a wrapper over the DynamoDB table so that Hive queries can be performed against it. In the <code>DynamoDBStorageHandler</code>, there are no column mappings for map, list, and Boolean data types, which the DynamoDB data contains. So, create a single entity called <code>item</code> that represents the entire DynamoDB item as a map of strings for both keys and values in the map. Include another column called <code>expiration_timestamp</code><br />\\nof type <code>bigint</code>, where TTL values will actually be backfilled. See the following Hive query:</li>\\n</ol>\n<pre><code class=\\"lang-\\">hive&gt; CREATE EXTERNAL TABLE Orders(item map&lt;string,string&gt;, expiration_timestamp bigint COMMENT 'from deserializer')\\nSTORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'\\nTBLPROPERTIES ('dynamodb.table.name'='Orders', &quot;dynamodb.column.mapping&quot; = &quot;expiration_timestamp:expiration_timestamp&quot;);\\n</code></pre>\\n<ul>\\n<li>The backfill time largely depends on two factors: the DynamoDB capacity available for the process, and the number of nodes in the Amazon EMR cluster. For this test, the table is in the provisioned throughput mode with the <code>ProvisionedReadCapacity</code> and <code>ProvisionedReadCapacity</code>set to <code>40000</code> each.</li>\\n</ul>\n<ol start=\\"3\\">\\n<li>With some Amazon EMR versions (5.25.0 to 5.34.0), you must set the <code>dynamodb.throughput.write</code> and <code>dynamodb.throughput.read</code> parameters on Hive.</li>\\n</ol>\n<pre><code class=\\"lang-\\">hive&gt; SET dynamodb.throughput.write=40000;\\n\\nhive&gt; SET dynamodb.throughput.read=40000;\\n\\n</code></pre>\\n<ol start=\\"4\\">\\n<li>Now run the following command to find the number of items in the table without <code>expiration_timestamp</code>:</li>\\n</ol>\n<pre><code class=\\"lang-\\">hive&gt; select count(*) from Orders where item[&quot;expiration_timestamp&quot;] IS NULL;\\n\\nQuery ID = hadoop_20220328145453_32f2eb87-825e-4838-9aeb-845f4e339cc8\\nTotal jobs = 1\\nLaunching Job 1 out of 1\\nStatus: Running (Executing on YARN cluster with App id application_1648478810565_0001)\\n\\n----------------------------------------------------------------------------------------------\\nVERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED \\n----------------------------------------------------------------------------------------------\\nMap 1 .......... container SUCCEEDED 62 62 0 0 0 0 \\nReducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 \\n----------------------------------------------------------------------------------------------\\nVERTICES: 02/02 [==========================&gt;&gt;] 100% ELAPSED TIME: 19.88 s \\n----------------------------------------------------------------------------------------------\\nOK\\n2000000\\nTime taken: 20.597 seconds, Fetched: 1 row(s)\\n\\n</code></pre>\\n<ol start=\\"5\\">\\n<li>As shown in the preceding output, there are 2 million items without the TTL attribute. Now run an INSERT OVERWRITE command with the <code>regexp_replace()</code><br />\\nfunction to carry out the calculation and backfill in place. If data is being updated or deleted between the time when the SELECT part of the query runs and when the INSERT OVERWITE part runs, you could end up with incorrect results. This is why we caution you to only use this technique for insert-only workloads or during a maintenance window when no updates or deletes are occurring. To calculate the <code>expiration_timestamp</code><br />\\nvalue, first parse the <code>creation_timestamp</code><br />\\n, which is an ISO 8601 formatted string, into a format expected by Hive’s <code>unix_timestamp</code> function. Then replace the parsed value with an integer timestamp, which is measured in seconds since the epoch as required by the <a href=\\"https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/time-to-live-ttl-before-you-start.html#time-to-live-ttl-before-you-start-formatting\\" target=\\"_blank\\">DynamoDB TTL functionality</a>. In this case, items need to be expired 180 days later than their <code>creation_timestamp</code>. See the following Hive query:</li>\\n</ol>\n<pre><code class=\\"lang-\\">hive&gt; INSERT OVERWRITE TABLE Orders SELECT item,(unix_timestamp(regexp_replace(item[&quot;creation_timestamp&quot;], '\\\\\\\\{\\\\&quot;s\\\\&quot;:\\\\&quot;(.+?)T(.+?)Z\\\\&quot;\\\\\\\\}\$','\$1 \$2')) + (60*60*24*180)) FROM Orders;\\nQuery ID = hadoop_20220329170732_cf1168d9-4604-49c9-9a14-db0acdd7cb27\\nTotal jobs = 1\\nLaunching Job 1 out of 1\\nStatus: Running (Executing on YARN cluster with App id application_1648571110202_0004)\\n----------------------------------------------------------------------------------------------\\n VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED \\n----------------------------------------------------------------------------------------------\\nMap 1 .......... container SUCCEEDED 62 62 0 0 0 0 \\n----------------------------------------------------------------------------------------------\\nVERTICES: 01/01 [==========================&gt;&gt;] 100% ELAPSED TIME: 124.17 s \\n----------------------------------------------------------------------------------------------\\nOK\\nTime taken: 125.11 seconds\\n</code></pre>\\n<ol start=\\"6\\">\\n<li>When you switch the table to on-demand throughput mode, the peak value is now set to 40,000 reads and 40,000 writes. When you run another test, you can observe that the time is similar to the earlier test when the table was using provisioned capacity:</li>\n</ol>\\n<pre><code class=\\"lang-\\">hive&gt; INSERT OVERWRITE TABLE Orders SELECT item,(unix_timestamp(regexp_replace(item[&quot;creation_timestamp&quot;], '\\\\\\\\{\\\\&quot;s\\\\&quot;:\\\\&quot;(.+?)T(.+?)Z\\\\&quot;\\\\\\\\}\$','\$1 \$2')) + (60*60*24*180)) FROM Orders;\\nQuery ID = hadoop_20220329170155_047dee68-e499-4383-ba08-2d6d77029d47\\nTotal jobs = 1\\nLaunching Job 1 out of 1\\nStatus: Running (Executing on YARN cluster with App id application_1648571110202_0004)\\n----------------------------------------------------------------------------------------------\\n VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED \\n----------------------------------------------------------------------------------------------\\nMap 1 .......... container SUCCEEDED 62 62 0 0 0 0 \\n----------------------------------------------------------------------------------------------\\nVERTICES: 01/01 [==========================&gt;&gt;] 100% ELAPSED TIME: 124.34 s \\n----------------------------------------------------------------------------------------------\\nOK\\nTime taken: 125.337 seconds\\n</code></pre>\\n<ol start=\\"7\\">\\n<li>To verify if all items now have the <code>expiration_timestamp</code>, run the following Hive query again:</li>\\n</ol>\n<pre><code class=\\"lang-\\">hive&gt; select count(*) from Orders where item[&quot;expiration_timestamp&quot;] IS NULL;\\n\\nQuery ID = hadoop_20220313115251_ed5d2e68-b855-4cb1-a6d3-dd1efb63b36a\\nTotal jobs = 1\\nLaunching Job 1 out of 1\\nStatus: Running (Executing on YARN cluster with App id application_1647166731942_0011)\\n\\n----------------------------------------------------------------------------------------------\\n VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED \\n----------------------------------------------------------------------------------------------\\nMap 1 .......... container SUCCEEDED 62 62 0 0 0 0 \\nReducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 \\n----------------------------------------------------------------------------------------------\\nVERTICES: 02/02 [==========================&gt;&gt;] 100% ELAPSED TIME: 21.06 s \\n----------------------------------------------------------------------------------------------\\nOK\\n0\\nTime taken: 21.849 seconds, Fetched: 1 row(s)\\n</code></pre>\\n<p>As shown in the output, there are no items where the TTL attribute value is absent anymore. To look at the expiration timestamp for the earlier item, run the same <code>get-item</code> call:</p>\\n<pre><code class=\\"lang-\\">aws dynamodb get-item --table-name Orders --key '{&quot;order_id&quot;:{&quot;S&quot;:&quot;e9bba98e-d579-43bb-a571-93ccdb32c960&quot;}}'\\n{\\n &quot;Item&quot;: {\\n &quot;delivery_address&quot;: {\\n &quot;M&quot;: {\\n &quot;city&quot;: {\\n &quot;S&quot;: &quot;Jonathan&quot;\\n }, \\n &quot;state&quot;: {\\n &quot;S&quot;: &quot;VA&quot;\\n }, \\n &quot;door&quot;: {\\n &quot;S&quot;: &quot;5482&quot;\\n }, \\n &quot;pin&quot;: {\\n &quot;S&quot;: &quot;28458&quot;\\n }\\n }\\n }, \\n &quot;is_cod&quot;: {\\n &quot;BOOL&quot;: true\\n }, \\n &quot;item_list&quot;: {\\n &quot;L&quot;: [\\n {\\n &quot;S&quot;: &quot;pglyg&quot;\\n }\\n ]\\n }, \\n &quot;order_id&quot;: {\\n &quot;S&quot;: &quot;e9bba98e-d579-43bb-a571-93ccdb32c960&quot;\\n }, \\n &quot;creation_timestamp&quot;: {\\n &quot;S&quot;: &quot;2022-02-01T20:23:25Z&quot;\\n }, \\n &quot;expiration_timestamp&quot;: {\\n &quot;N&quot;: &quot;1659299005&quot;\\n }\\n }\\n} \\n\\n</code></pre>\\n<p>The epoch <code>1659299005</code>corresponds to<code>2022-07-31T20:23:25Z</code> as per <a href=\\"https://www.epochconverter.com/\\" target=\\"_blank\\">Epoch Converter</a>, which is 180 days after the <code>creation_timestamp</code> value of <code>2022-02-01T20:23:25Z</code> for this item.</p>\\n<h4><a id=\\"Clean_up_248\\"></a><strong>Clean up</strong></h4>\\n<p>If you don’t need the Amazon EMR cluster anymore, <a href=\\"https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_TerminateJobFlow.html\\" target=\\"_blank\\">delete</a> it after the backfill is complete to avoid unnecessary costs.</p>\\n<h4><a id=\\"Conclusion_252\\"></a><strong>Conclusion</strong></h4>\\n<p>In this post, I showed you how to backfill the TTL attribute in a DynamoDB table with complex data types using Hive queries. Hive offers a flexible way to detect, calculate, and overwrite data based on conditions such as missing attributes, along with rate-limiting the throughput consumption for backfill.</p>\n<p>Additional reading:</p>\n<ul>\\n<li><a href=\\"https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GettingStartedDynamoDB.html\\" target=\\"_blank\\">Getting started with DynamoDB</a></li>\\n<li><a href=\\"https://cwiki.apache.org/confluence/display/Hive/Tutorial\\" target=\\"_blank\\">Hive tutorial</a></li>\\n<li><a href=\\"https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.ExternalTableForDDB.html\\" target=\\"_blank\\">Creating an External Table in Hive</a></li>\\n<li><a href=\\"https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions\\" target=\\"_blank\\">Date Functions in the Apache Hive</a></li>\\n</ul>\n<h4><a id=\\"About_the_author_263\\"></a><strong>About the author</strong></h4>\\n<p><a href=\\"https://d2908q01vomqb2.cloudfront.net/887309d048beef83ad3eabf2a79a64a389ab1c9f/2022/07/01/juhi-patil.png\\" target=\\"_blank\\"><img src=\\"https://dev-media.amazoncloud.cn/21b59e6ea949442fbaf0dd6599b366ae_image.png\\" alt=\\"image.png\\" /></a></p>\\n<p><strong>Juhi Patil</strong> is a London-based DynamoDB Specialist Solutions Architect with a background in big data technologies. In her current role, she helps customers design, evaluate, and optimize their DynamoDB-based solutions.</p>\n"}
目录
亚马逊云科技解决方案 基于行业客户应用场景及技术领域的解决方案
联系亚马逊云科技专家
亚马逊云科技解决方案
基于行业客户应用场景及技术领域的解决方案
联系专家
0
目录
关闭