Analyze Amazon Ion datasets using Amazon Athena

海外精选
海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时,内容中提到的“AWS” 是 “Amazon Web Services” 的缩写,在此网站不作为商标展示。
0
0
{"value":"<p><a href=\"http://aws.amazon.com/athena\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Athena</a> is an interactive query service that makes it easy to analyze data in <a href=\"http://aws.amazon.com/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service</a> (Amazon S3) using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.</p>\n<p><a href=\"https://amzn.github.io/ion-docs/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Ion</a> is a <a href=\"https://amzn.github.io/ion-docs/guides/why.html#rich-type-system\" target=\"_blank\" rel=\"noopener noreferrer\">richly typed</a>, <a href=\"https://amzn.github.io/ion-docs/guides/why.html#self-describing\" target=\"_blank\" rel=\"noopener noreferrer\">self-describing</a>, hierarchical data serialization format offering <a href=\"https://amzn.github.io/ion-docs/guides/why.html#dual-format-interoperability\" target=\"_blank\" rel=\"noopener noreferrer\">interchangeable binary and text</a> representations. The <a href=\"https://amzn.github.io/ion-docs/docs/spec.html\" target=\"_blank\" rel=\"noopener noreferrer\">text format</a> extends <a href=\"http://json.org/\" target=\"_blank\" rel=\"noopener noreferrer\">JSON</a> (meaning all JSON files are valid Ion files), and is easy to read and author, supporting rapid prototyping. The <a href=\"https://amzn.github.io/ion-docs/docs/binary.html\" target=\"_blank\" rel=\"noopener noreferrer\">binary representation</a> is <a href=\"https://amzn.github.io/ion-docs/guides/why.html#read-optimized-binary-format\" target=\"_blank\" rel=\"noopener noreferrer\">efficient to store, transmit, and skip-scan parse</a>. The rich type system provides unambiguous semantics for long-term preservation of data that can survive multiple generations of software evolution.</p>\n<p>Athena now supports querying and writing data in Ion format. The Ion format is currently used by internal Amazon teams, by external services such as <a href=\"https://aws.amazon.com/qldb/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Quantum Ledger Database</a> (Amazon QLDB) and <a href=\"https://aws.amazon.com/dynamodb/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon DynamoDB</a> (which can be exported into Ion), and in the open-source SQL query language PartiQL.</p>\n<p>In this post, we discuss use cases and the unique features Ion offers, followed by examples of querying Ion with Athena. For demonstration purposes, we use the transformed version of the <a href=\"https://github.com/zemirco/sf-city-lots-json\" target=\"_blank\" rel=\"noopener noreferrer\">City Lots San Francisco</a> dataset.</p>\n\n### **Features of Ion**\n<p>In this section, we discuss some of the unique features that Ion offers:</p>\n<ul> \n <li>Type system</li> \n <li>Dual format</li> \n <li>Efficiency gains</li> \n <li>Skip scanning</li> \n </ul>\n\n#### **Type system**\n<p>Ion extends JSON, adding support for more precise <a href=\"https://amzn.github.io/ion-docs/docs/spec.html\" target=\"_blank\" rel=\"noopener noreferrer\">data types</a> to improve interpretability, simplify processing, and avoid rounding errors. These high precision numeric types are essential for financial services, where fractions of a cent on every transaction add up. Data types that are added are arbitrary-size integers, binary floating-point numbers, infinite-precision decimals, timestamps, CLOBS, and BLOBS.</p>\n\n#### **Dual format**\n<p>Users can be presented with a familiar text-based representation while benefiting from the performance efficiencies of a binary format. The interoperability between the two formats enables you to rapidly discover, digest, and interpret data in a familiar JSON-like representation, while underlying applications benefit from a reduction in storage, memory, network bandwidth, and latency from the binary format. This means you can write plain text queries that run against both text-based and binary-based Ion. You can rewrite parts of your data in text-based Ion when you need human readable data during development and switch to binary in production.</p>\n<p>When debugging a process, the ability for systems engineers to locate data and understand it as quickly as possible is vital. Ion provides mechanisms to move between binary and a text-based representation, optimizing for both the human and the machine. Athena supports querying and writing data in both of these Ion formats. The following is an example Ion text document taken from the transformed version of the <code>citylots</code> dataset:</p>\n\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-code\"><code class=\" language-code\">{ \"type\": \"Feature\"\n, \"properties\": { \"MAPBLKLOT\": \"0004002\"\n ,\"BLKLOT\": \"0004002\"\n ,\"BLOCK_NUM\": \"0004\"\n , \"LOT_NUM\": \"002\"\n , \"FROM_ST\": \"0\"\n , \"TO_ST\": \"0\"\n , \"STREET\": \"UNKNOWN\"\n , \"ST_TYPE\": null\n , \"ODD_EVEN\": \"E\" }\n, \"geometry\": { \"type\": \"Polygon\"\n , \"coordinates\": [ [ [ -122.415701204606876, 37.808327252671461, 0.0 ],\n [ -122.415760743593196, 37.808630700240904, 0.0 ],\n [ -122.413787891332404, 37.808566801319841, 0.0 ],\n [ -122.415701204606876, 37.808327252671461, 0.0 ] ] ] } }</code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"><span>Code</span></div></div></div> \n </div>\n\n#### **Efficiency gains**\n<p>Binary-encoded Ion reduces file size by moving repeated values, such as field names, into a <a href=\"https://amzn.github.io/ion-docs/docs/spec.html#symbol\" target=\"_blank\" rel=\"noopener noreferrer\">symbol</a> table. Symbol tables reduce CPU and read latency by limiting the validation of character encoding to the single instance of the value in the symbol table.</p>\n<p>For example, a company that operates at Amazon’s scale can produce large volumes of application logs. When compressing Ion and JSON logs, we noticed approximately 35% less CPU time to compress the log, which produced an average of roughly 26% smaller files. Log files are critical when needed but costly to retain, so the reduction in file sizes combined with the read performance gains from symbol tables helps when handling these logs. The following is an example of file size reduction with the <code>citylots</code> JSON dataset when converted to Ion binary with GZIP and ZSTD compression:</p>\n\n\n```\n77MB citylots.ion\n 17MB citylots.ion.gz\n 15MB citylots.ion.zst\n181MB citylots.json\n 22MB citylots.json.gz\n 18MB citylots.json.zst\n```\n#### **Skip-scanning**\n<p>In a textual format, every byte must be read and interpreted, but because Ion’s binary format is a TLV (type-length-value) encoding, an application may skip over elements that aren’t needed. This reduces query and application processing costs correlated with the proportion of unexamined fields.</p>\n<p>For example, forensic analysis of application log data involves reading large volumes of data where only a fraction of the data is needed for diagnosis. In these scenarios, skip-scanning allows the binary Ion reader to move past irrelevant fields without the cost of reading the element stored within a field. This results in users experiencing lower resource usage and quicker response times.</p>\n\n### **Query Ion datasets using Athena**\n<p>Athena now supports querying and creating Ion-formatted datasets via an <a href=\"https://github.com/amzn/ion-hive-serde\" target=\"_blank\" rel=\"noopener noreferrer\">Ion-specific SerDe</a>, which in conjunction with <code>IonInputFormat</code> and <code>IonOutputFormat</code> allows you to read and write valid Ion data. Deserialization allows you to run SELECT queries on the Ion data so that it can be queried to gain insights. Serialization through CTAS or INSERT INTO queries allows you to copy datasets from existing tables’ values or generate new data in the Ion format.</p>\n<p>The interchangeable nature of Ion text and Ion binary means that Athena can read datasets that contain both types of files. Because Ion is a superset of JSON, a table using the Ion SerDe can also include JSON files. Unlike the <a href=\"https://docs.aws.amazon.com/athena/latest/ug/json-serde.html\" target=\"_blank\" rel=\"noopener noreferrer\">JSON SerDe</a>, where every new line character indicates a new row, the Ion SerDe uses a combination of closing brackets and new line characters to determine new rows. This means that if each JSON record in your source documents isn’t on a single line, these files can now be read in Athena via the Ion SerDe.</p>\n\n### **Create external tables**\n<p>Athena supports querying Ion-based datasets by defining <a href=\"https://aws.amazon.com/glue\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Glue</a> tables with the user-defined metadata. Let’s start with an example of creating an external table for a dataset stored in Ion text. The following is a sample row from the <code>citylots</code> dataset:</p>\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-code\"><code class=\" language-code\">{\n type:\"Feature\",\n properties:{\n mapblklot:\"0579021\",\n blklot:\"0579024\",\n block_num:\"0579\",\n lot_num:\"024\",\n from_st:\"2160\",\n to_st:\"2160\",\n street:\"PACIFIC\",\n st_type:\"AVE\",\n odd_even:\"E\"\n },\n geometry:{\n type:\"Polygon\",coordinates:[[[-122.4308798855922, ...]]]\n }\n}</code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"><span></span></div></div></div> \n </div>\n<p>To create an external table that has its data stored in Ion, you have two syntactic options.</p>\n<p>First, you can specify STORED AS ION. This is a more concise method, and is best used for simple cases, when no additional properties are required. See the following code:</p>\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\"><span class=\"token keyword\">CREATE</span> EXTERNAL <span class=\"token keyword\">TABLE</span> city_lots_ion1 <span class=\"token punctuation\">(</span>\n <span class=\"token keyword\">type</span> STRING<span class=\"token punctuation\">,</span> \n properties struct<span class=\"token operator\">&lt;</span>\n mapblklot:string<span class=\"token punctuation\">,</span>\n blklot:string<span class=\"token punctuation\">,</span>\n block_num:string<span class=\"token punctuation\">,</span>\n lot_num:string<span class=\"token punctuation\">,</span>\n from_st:string<span class=\"token punctuation\">,</span>\n to_st:string<span class=\"token punctuation\">,</span>\n street:string<span class=\"token punctuation\">,</span>\n st_type:string<span class=\"token punctuation\">,</span>\n odd_even:string<span class=\"token operator\">&gt;</span><span class=\"token punctuation\">,</span> \n <span class=\"token keyword\">geometry</span> struct<span class=\"token operator\">&lt;</span>\n <span class=\"token keyword\">type</span>:string<span class=\"token punctuation\">,</span>\n coordinates:array<span class=\"token operator\">&lt;</span>array<span class=\"token operator\">&lt;</span>array<span class=\"token operator\">&lt;</span><span class=\"token keyword\">decimal</span><span class=\"token punctuation\">(</span><span class=\"token number\">18</span><span class=\"token punctuation\">,</span><span class=\"token number\">15</span><span class=\"token punctuation\">)</span><span class=\"token operator\">&gt;&gt;</span><span class=\"token operator\">&gt;</span><span class=\"token punctuation\">,</span>\n multi_coordinates:array<span class=\"token operator\">&lt;</span>array<span class=\"token operator\">&lt;</span>array<span class=\"token operator\">&lt;</span>array<span class=\"token operator\">&lt;</span><span class=\"token keyword\">decimal</span><span class=\"token punctuation\">(</span><span class=\"token number\">18</span><span class=\"token punctuation\">,</span><span class=\"token number\">15</span><span class=\"token punctuation\">)</span><span class=\"token operator\">&gt;&gt;</span><span class=\"token operator\">&gt;&gt;</span><span class=\"token operator\">&gt;</span>\n<span class=\"token punctuation\">)</span>\nSTORED <span class=\"token keyword\">AS</span> ION\nLOCATION <span class=\"token string\">'s3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_ion_binary/'</span></code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"><span></span></div></div></div> \n </div>\n<p>Alternatively, you can explicitly specify the Ion classpaths in ROW FORMAT SERDE, INPUTFORMAT, and OUTPUTFORMAT. Unlike the first method, you can specify a SERDEPROPERTIES clause here. In our example DDL, we added a SerDe property that allows values that are outside of the Hive data type ranges to overflow rather than fail the query:</p>\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\"><span class=\"token keyword\">CREATE</span> EXTERNAL <span class=\"token keyword\">TABLE</span> city_lots_ion2<span class=\"token punctuation\">(</span>\n <span class=\"token keyword\">type</span> STRING<span class=\"token punctuation\">,</span> \n properties struct<span class=\"token operator\">&lt;</span>\n mapblklot:string<span class=\"token punctuation\">,</span>\n blklot:string<span class=\"token punctuation\">,</span>\n block_num:string<span class=\"token punctuation\">,</span>\n lot_num:string<span class=\"token punctuation\">,</span>\n from_st:string<span class=\"token punctuation\">,</span>\n to_st:string<span class=\"token punctuation\">,</span>\n street:string<span class=\"token punctuation\">,</span>\n st_type:string<span class=\"token punctuation\">,</span>\n odd_even:string<span class=\"token operator\">&gt;</span><span class=\"token punctuation\">,</span> \n <span class=\"token keyword\">geometry</span> struct<span class=\"token operator\">&lt;</span>\n <span class=\"token keyword\">type</span>:string<span class=\"token punctuation\">,</span>\n coordinates:array<span class=\"token operator\">&lt;</span>array<span class=\"token operator\">&lt;</span>array<span class=\"token operator\">&lt;</span><span class=\"token keyword\">decimal</span><span class=\"token punctuation\">(</span><span class=\"token number\">18</span><span class=\"token punctuation\">,</span><span class=\"token number\">15</span><span class=\"token punctuation\">)</span><span class=\"token operator\">&gt;&gt;</span><span class=\"token operator\">&gt;</span><span class=\"token punctuation\">,</span>\n multi_coordinates:array<span class=\"token operator\">&lt;</span>array<span class=\"token operator\">&lt;</span>array<span class=\"token operator\">&lt;</span>array<span class=\"token operator\">&lt;</span><span class=\"token keyword\">decimal</span><span class=\"token punctuation\">(</span><span class=\"token number\">18</span><span class=\"token punctuation\">,</span><span class=\"token number\">15</span><span class=\"token punctuation\">)</span><span class=\"token operator\">&gt;&gt;</span><span class=\"token operator\">&gt;&gt;</span><span class=\"token operator\">&gt;</span>\n<span class=\"token punctuation\">)</span>\n<span class=\"token keyword\">ROW</span> FORMAT SERDE \n <span class=\"token string\">'com.amazon.ionhiveserde.IonHiveSerDe'</span>\n<span class=\"token keyword\">WITH</span> SERDEPROPERTIES <span class=\"token punctuation\">(</span>\n <span class=\"token string\">'ion.fail_on_overflow'</span><span class=\"token operator\">=</span><span class=\"token string\">'false'</span>\n <span class=\"token punctuation\">)</span>\nSTORED <span class=\"token keyword\">AS</span> INPUTFORMAT \n <span class=\"token string\">'com.amazon.ionhiveserde.formats.IonInputFormat'</span> \nOUTPUTFORMAT \n <span class=\"token string\">'com.amazon.ionhiveserde.formats.IonOutputFormat'</span>\nLOCATION\n <span class=\"token string\">'s3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_ion_binary/'</span></code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"><span></span></div></div></div> \n </div>\n<p>Athena converts STORED AS ION into the explicit classpaths, so both tables look similar in the metastore. If we look in AWS Glue, we see both tables we just created have the same input format, output format, and SerDe serialization library.</p>\n\n![image.png](https://dev-media.amazoncloud.cn/f779018c121640cdbb74bcd91c84fb32_image.png)\n<p>Now that our table is created, we can run standard SELECT queries on the <code>city_lots_ion</code> table. Let’s run a query that specifies the <code>block_num</code> from our example row of Ion data to verify that we can read from the table:</p>\n\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\"><span class=\"token comment\">-- QUERY</span>\n<span class=\"token keyword\">SELECT</span> <span class=\"token operator\">*</span> <span class=\"token keyword\">FROM</span> city_lots_ion1 <span class=\"token keyword\">WHERE</span> properties<span class=\"token punctuation\">.</span>block_num<span class=\"token operator\">=</span><span class=\"token string\">'0579'</span><span class=\"token punctuation\">;</span></code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"><span></span></div></div></div> \n </div>\n<p>The following screenshot shows our results.</p>\n\n![image.png](https://dev-media.amazoncloud.cn/5dce9e8139d94b1bb750c5d30e78e32e_image.png)\n### **Use path extraction to read from specific fields**\n<p>Athena supports further customization of how data is interpreted via <a href=\"https://github.com/amzn/ion-hive-serde/blob/master/docs/serde-properties.md\" target=\"_blank\" rel=\"noopener noreferrer\">SerDe properties</a>. To specify these, you can add a WITH SERDEPROPERTIES clause, which is a subfield of the ROW FORMAT SERDE field.</p>\n<p>In some situations, we may only care about some parts of the information. Let’s suppose we don’t want any of the geometry info from the <code>citylots</code> dataset, and only need a few of the fields in properties. One solution is to specify a search path using the path extractor SerDe property:</p>\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\"><span class=\"token comment\">-- Path Extractor property</span>\nion<span class=\"token punctuation\">.</span><span class=\"token operator\"></span><span style=\"color: #ff0000\"><span class=\"token operator\">&lt;</span><span class=\"token keyword\">column</span><span class=\"token operator\">&gt;</span></span><span class=\"token operator\"></span><span class=\"token punctuation\">.</span>path_extractor <span class=\"token operator\">=</span> <span class=\"token operator\"></span><span style=\"color: #ff0000\"><span class=\"token operator\">&lt;</span>search path<span class=\"token operator\">&gt;</span></span><span class=\"token operator\"></span></code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"><span></span></div></div></div> \n </div>\n<p>Path extractors are search paths that Athena uses to map the table columns to locations in the individual document. Full information on what can be done with path extractors <a href=\"https://github.com/amzn/ion-java-path-extraction\" target=\"_blank\" rel=\"noopener noreferrer\">is available on GitHub</a>, but for our example, we focus on creating simple paths that use the names of each field as an index. In this case, the search path takes the form of a space-delimited set of indexes (and wraps it with parentheses) that indicate the location of each desired piece of information. We map the search paths to table columns by using the path extractor property.</p>\n<p>By default, Athena builds path extractors dynamically based on column names unless overridden. This means that when we run our SELECT query on our <code>city_lots_ion1</code> table, Athena builds the following search paths:</p>\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\"><span class=\"token keyword\">Default</span> Extractors generated <span class=\"token keyword\">by</span> Athena <span class=\"token keyword\">for</span> city_lots_ion1<span class=\"token punctuation\">.</span>\n<span class=\"token comment\">-- Extracts the 'type' field to the 'type' column</span>\n <span class=\"token string\">'ion.type.path_extractor'</span> <span class=\"token operator\">=</span> <span class=\"token string\">'(type)'</span>\n\n<span class=\"token comment\">-- Extracts the 'properties' field to the 'properties' column</span>\n <span class=\"token string\">'ion.properties.path_extractor'</span> <span class=\"token operator\">=</span> <span class=\"token string\">'(properties)'</span>\n\n<span class=\"token comment\">-- Extracts the 'geometry' field to the 'geometry' column</span>\n <span class=\"token string\">'ion.geometry.path_extractor'</span> <span class=\"token operator\">=</span> <span class=\"token string\">'(geometry)'</span></code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"><span></span></div></div></div> \n </div>\n<p>Assuming we only care about the block and lot information from the properties struct, and the geometry type from the geometry struct, we can build search paths that map the desired fields from the row of data to table columns. First let’s build the search paths:</p>\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\"><span class=\"token punctuation\">(</span>properties mapblklot<span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span> Search path <span class=\"token keyword\">for</span> the mapblklot field <span class=\"token operator\">in</span> the properties struct\n<span class=\"token punctuation\">(</span>properties blklot<span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span> Search path <span class=\"token keyword\">for</span> the blklot field <span class=\"token operator\">in</span> the properties struct\n<span class=\"token punctuation\">(</span>properties block_num<span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span> Search path <span class=\"token keyword\">for</span> the block_num field <span class=\"token operator\">in</span> the properties struct\n<span class=\"token punctuation\">(</span>properties lot_num<span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span> Search path <span class=\"token keyword\">for</span> the lot_num field <span class=\"token operator\">in</span> the properties struct\n<span class=\"token punctuation\">(</span><span class=\"token keyword\">geometry</span> <span class=\"token keyword\">type</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span> Search path <span class=\"token keyword\">for</span> the <span class=\"token keyword\">type</span> field <span class=\"token operator\">in</span> the <span class=\"token keyword\">geometry</span> struct</code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"><span></span></div></div></div> \n </div>\n<p>Now let’s map these search paths to table columns using the path extractor SerDe property. Because the search paths specify where to look for data, we are able to flatten and rename our datasets to better serve our purpose. For this example, let’s rename the <code>mapblklot</code> field to <code>map_block_lot</code>, <code>blklot</code> to <code>block_lot</code>, and the geometry type to shape:</p>\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-code\"><code class=\" language-code\"> 'ion.map_block_lot.path_extractor' = '(properties mapblklot)'\n 'ion.block_lot.path_extractor' = '(properties blklot)'\n 'ion.block_num.path_extractor' = '(properties block_num)'\n 'ion.lot_num.path_extractor' = '(properties lot_num)'\n 'ion.shape.path_extractor' = '(geometry type)'</code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"><span></span></div></div></div> \n </div>\n<p>Let’s put all of this together and create the <code>city_blocks</code> table:</p>\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\"><span class=\"token keyword\">CREATE</span> EXTERNAL <span class=\"token keyword\">TABLE</span> city_blocks <span class=\"token punctuation\">(</span>\n map_block_lot STRING<span class=\"token punctuation\">,</span>\n block_lot STRING<span class=\"token punctuation\">,</span>\n block_num STRING<span class=\"token punctuation\">,</span>\n lot_num STRING<span class=\"token punctuation\">,</span>\n shape STRING\n<span class=\"token punctuation\">)</span>\n<span class=\"token keyword\">ROW</span> FORMAT SERDE\n <span class=\"token string\">'com.amazon.ionhiveserde.IonHiveSerDe'</span>\n<span class=\"token keyword\">WITH</span> SERDEPROPERTIES <span class=\"token punctuation\">(</span>\n <span class=\"token string\">'ion.map_block_lot.path_extractor'</span> <span class=\"token operator\">=</span> <span class=\"token string\">'(properties mapblklot)'</span><span class=\"token punctuation\">,</span>\n <span class=\"token string\">'ion.block_lot.path_extractor'</span> <span class=\"token operator\">=</span> <span class=\"token string\">'(properties blklot)'</span><span class=\"token punctuation\">,</span> \n <span class=\"token string\">'ion.block_num.path_extractor'</span> <span class=\"token operator\">=</span> <span class=\"token string\">'(properties block_num)'</span><span class=\"token punctuation\">,</span>\n <span class=\"token string\">'ion.lot_num.path_extractor'</span> <span class=\"token operator\">=</span> <span class=\"token string\">'(properties lot_num)'</span><span class=\"token punctuation\">,</span>\n <span class=\"token string\">'ion.shape.path_extractor'</span> <span class=\"token operator\">=</span> <span class=\"token string\">'(geometry type)'</span>\n <span class=\"token punctuation\">)</span>\nSTORED <span class=\"token keyword\">AS</span> ION\nLOCATION <span class=\"token string\">'s3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_ion_binary/'</span></code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"><span></span></div></div></div> \n </div>\n<p>Now we can run a select query on the <code>city_blocks</code> table, and see the results:</p>\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\"><span class=\"token comment\">-- Select Query</span>\n<span class=\"token keyword\">SELECT</span> <span class=\"token operator\">*</span> <span class=\"token keyword\">FROM</span> city_blocks <span class=\"token keyword\">WHERE</span> block_num<span class=\"token operator\">=</span><span class=\"token string\">'0579'</span><span class=\"token punctuation\">;</span></code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"><span></span></div></div></div> \n </div>\n\n![image.png](https://dev-media.amazoncloud.cn/7f1a949812c64291a1fb639e265f80b2_image.png)\n<p>Utilizing search paths in this way enables skip-scan parsing when reading from Ion binary files, which allows Athena to skip over the unneeded fields and reduces the overall time it takes to run the query.</p>\n\n### **Use CTAS and UNLOAD for data transformation**\n\n<p>Athena supports <a href=\"https://docs.aws.amazon.com/athena/latest/ug/ctas.html\" target=\"_blank\" rel=\"noopener noreferrer\">CREATE TABLE AS SELECT</a> (CTAS), which creates a new table in Athena from the results of a SELECT statement from another query. Athena also supports <a href=\"https://docs.aws.amazon.com/athena/latest/ug/unload.html\" target=\"_blank\" rel=\"noopener noreferrer\">UNLOAD</a>, which writes query results to Amazon S3 from a SELECT statement to the specified data format.</p>\n<p>Both CTAS and UNLOAD have a property to specify a format and a compression type. This allows you to easily convert Ion datasets to other data formats, such as Parquet or ORC, and vice versa, without needing to set up a complex extract, transform, and load (ETL) job. This is beneficial for situations when you want to transform your data, or know you will run repeated queries on a subset of your data and want to use some of the benefits inherent to columnar formats. Combining it with path extractors is especially helpful, because we’re only storing the data that we need in the new format.</p>\n<p>Let’s use CTAS to convert our <code>city_blocks</code> table from Ion to Parquet, and compress it via GZIP. Because we have path extractors set up for the <code>city_blocks</code> table, we only need to convert a small portion of the original dataset:</p>\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\"><span class=\"token keyword\">CREATE</span> <span class=\"token keyword\">TABLE</span> city_blocks_parquet_gzip\n<span class=\"token keyword\">WITH</span> <span class=\"token punctuation\">(</span>format <span class=\"token operator\">=</span> <span class=\"token string\">'PARQUET'</span><span class=\"token punctuation\">,</span> write_compression<span class=\"token operator\">=</span><span class=\"token string\">'GZIP'</span><span class=\"token punctuation\">)</span>\n<span class=\"token keyword\">AS</span> <span class=\"token keyword\">SELECT</span> <span class=\"token operator\">*</span> <span class=\"token keyword\">FROM</span> city_blocks<span class=\"token punctuation\">;</span></code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"><span></span></div></div></div> \n </div>\n<p>We can now run queries against the <code>city_block_parquet_gzip</code> table, and should see the same result. To test this out, let’s run the same SELECT query we ran before on the Parquet table:</p>\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\"><span class=\"token keyword\">SELECT</span> <span class=\"token operator\">*</span> <span class=\"token keyword\">FROM</span> city_blocks_parquet_gzip <span class=\"token keyword\">WHERE</span> block_num<span class=\"token operator\">=</span><span class=\"token string\">'0579'</span><span class=\"token punctuation\">;</span></code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"><span></span></div></div></div> \n </div>\n\n![image.png](https://dev-media.amazoncloud.cn/cd019036c0e54d678c1e45b2acc4ba0f_image.png)\n<p>When converting tables from another format to Ion, Athena supports the following compression codecs: ZSTD, BZIP2, GZIP, SNAPPY, and NONE. In addition to adding Ion as a new format for CTAS, we added the <code>ion_encoding</code> property, which allows you to choose whether the output files are created in Ion text or Ion binary. This allows for serialization of data from other formats back into Ion.</p>\n<p>Let’s convert the original <code>city_lots</code> JSON file back to Ion, but this time we specify that we want to use ZSTD compression and a binary encoding.</p>\n<p>The JSON file can be found at following location: <code>s3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_json/</code></p>\n<p>Because Ion is a superset of JSON, we can use the Ion SerDe to read this file:</p>\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\"><span class=\"token keyword\">CREATE</span> EXTERNAL <span class=\"token keyword\">TABLE</span> city_blocks_json_ion_serde <span class=\"token punctuation\">(</span>\n map_block_lot STRING<span class=\"token punctuation\">,</span>\n block_lot STRING<span class=\"token punctuation\">,</span>\n block_num STRING<span class=\"token punctuation\">,</span>\n lot_num STRING<span class=\"token punctuation\">,</span>\n shape STRING\n<span class=\"token punctuation\">)</span>\n<span class=\"token keyword\">ROW</span> FORMAT SERDE\n<span class=\"token string\">'com.amazon.ionhiveserde.IonHiveSerDe'</span>\n<span class=\"token keyword\">WITH</span> SERDEPROPERTIES <span class=\"token punctuation\">(</span>\n<span class=\"token string\">'ion.map_block_lot.path_extractor'</span> <span class=\"token operator\">=</span> <span class=\"token string\">'(properties mapblklot)'</span><span class=\"token punctuation\">,</span>\n<span class=\"token string\">'ion.block_lot.path_extractor'</span> <span class=\"token operator\">=</span> <span class=\"token string\">'(properties blklot)'</span><span class=\"token punctuation\">,</span>\n<span class=\"token string\">'ion.block_num.path_extractor'</span> <span class=\"token operator\">=</span> <span class=\"token string\">'(properties block_num)'</span><span class=\"token punctuation\">,</span>\n<span class=\"token string\">'ion.lot_num.path_extractor'</span> <span class=\"token operator\">=</span> <span class=\"token string\">'(properties lot_num)'</span><span class=\"token punctuation\">,</span>\n<span class=\"token string\">'ion.shape.path_extractor'</span> <span class=\"token operator\">=</span> <span class=\"token string\">'(geometry type)'</span>\n<span class=\"token punctuation\">)</span>\nSTORED <span class=\"token keyword\">AS</span> ION\nLOCATION <span class=\"token string\">'s3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_json/'</span></code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"><span></span></div></div></div> \n </div>\n<p>Now let’s copy this table into our desired Ion binary form:</p>\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\"><span class=\"token keyword\">CREATE</span> <span class=\"token keyword\">TABLE</span> city_blocks_ion_zstd\n<span class=\"token keyword\">WITH</span> <span class=\"token punctuation\">(</span>format <span class=\"token operator\">=</span> <span class=\"token string\">'ION'</span><span class=\"token punctuation\">,</span> write_compression<span class=\"token operator\">=</span><span class=\"token string\">'ZSTD'</span><span class=\"token punctuation\">,</span> ion_encoding<span class=\"token operator\">=</span><span class=\"token string\">'BINARY'</span><span class=\"token punctuation\">)</span>\n<span class=\"token keyword\">AS</span> <span class=\"token keyword\">SELECT</span> <span class=\"token operator\">*</span> <span class=\"token keyword\">FROM</span> city_blocks_parquet_gzip</code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"><span></span></div></div></div> \n </div>\n<p>Finally, let’s run our verification SELECT statement to verify everything was created properly:</p>\n\n```\nSELECT * FROM city_blocks_ion_zstd WHERE block_num='0579';\n\n```\n\n![image.png](https://dev-media.amazoncloud.cn/536c19e02a9044f8be12b4d03a718d69_image.png)\n\n### **Use UNLOAD to store Ion data in Amazon S3**\n\n<p>Sometimes we just want to reformat the data and don’t need to store the additional metadata to query the table. In this case, we can use UNLOAD, which stores the results of the query in the specified format in an S3 bucket.</p>\n<p>Let’s test it out, using UNLOAD to convert the <code>drivers_names</code> table from Ion to ORC, compress it via ZLIB, and store it to an S3 bucket:</p>\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\">UNLOAD <span class=\"token punctuation\">(</span><span class=\"token keyword\">SELECT</span> <span class=\"token operator\">*</span> <span class=\"token keyword\">FROM</span> city_blocks_ion_zstd <span class=\"token keyword\">WHERE</span> block_num<span class=\"token operator\">=</span><span class=\"token string\">'0579'</span><span class=\"token punctuation\">)</span> \n<span class=\"token keyword\">TO</span> <span class=\"token string\">'s3://<span style=\"color: #ff0000\">&lt;your-s3-bucket&gt;</span>/athena-ion-blog/unload/orc_zlib/'</span>\n<span class=\"token keyword\">WITH</span> <span class=\"token punctuation\">(</span>format <span class=\"token operator\">=</span> <span class=\"token string\">'ORC'</span><span class=\"token punctuation\">,</span> compression<span class=\"token operator\">=</span><span class=\"token string\">'ZLIB'</span><span class=\"token punctuation\">)</span></code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"><span></span></div></div></div> \n </div>\n<p>When you check in Amazon S3, you can find a new file in the ORC format.</p>\n\n![image.png](https://dev-media.amazoncloud.cn/c18faaa826794792be5465041074b593_image.png)\n\n### **Conclusion**\n<p>This post talked about the new feature in Athena that allows you to query and create Ion datasets using standard SQL. We discussed use cases and unique features of the Ion format like type system, dual formats (Ion text and Ion binary), efficiency gains, and skip-scanning. You can get started with querying an Ion dataset stored in Amazon S3 by simply creating a table in Athena, and also converting existing datasets to Ion format and vice versa using CTAS and UNLOAD statements.</p>\n<p>To learn more about querying Ion using Athena, refer to <a href=\"https://docs.aws.amazon.com/athena/latest/ug/ion-serde.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Ion Hive SerDe</a>.</p>\n\n### **References**\n<ul> \n <li>Name of Dataset: City Lots San Francisco in .json</li> \n <li>Licenses and Notes: OG dataset under PDDL 1.0; Github .json version under MIT style license</li> \n <li><a href=\"https://github.com/zemirco/sf-city-lots-json#license\" target=\"_blank\" rel=\"noopener noreferrer\">Link(s) to Dataset(s)</a> (note – original dataset is NLA from DataSF)</li> \n <li><a href=\"//aws-bigdata-blog/artifacts/athena-ion-blog/\" target=\"_blank\" rel=\"noopener noreferrer\">Location of transformed version</a></li> \n </ul>\n\n![image.png](https://dev-media.amazoncloud.cn/eeee1f0a46de48c6aab4c27f1361c454_image.png)\n<p style=\"clear: both\"><strong>Pathik Shah </strong>is a Sr. Big Data Architect on Amazon Athena. He joined AWS in 2015 and has been focusing in the big data analytics space since then, helping customers build scalable and robust solutions using AWS analytics services.</p>\n\n![image.png](https://dev-media.amazoncloud.cn/a7d359d3ca914cbc9d5038d78fc520da_image.png)\n<p style=\"clear: both\"><strong>Jacob Stein</strong> works on the Amazon Athena team as a Software Development Engineer. He led the project to add support for Ion in Athena. He loves working on technical problems unique to internet scale data, and is passionate about developing scalable solutions for distributed systems.</p>\n\n![image.png](https://dev-media.amazoncloud.cn/a10f3cbac13c4fe98acb4a11b4d5f805_image.png)\n<p style=\"clear: both\"><strong>Giovanni Matteo Fumarola</strong> is the Engineering Manager of the Athena Data Lake and Storage team. He is an Apache Hadoop Committer and PMC member. He has been focusing in the big data analytics space since 2013.</p>\n\n![image.png](https://dev-media.amazoncloud.cn/06de775a780f4e2797641f3467f3891c_image.png)\n<p style=\"clear: both\"><strong>Pete Ford</strong> is a Sr. Technical Program Manager at Amazon.</p>\n\n\n\n\n\n","render":"<p><a href=\"http://aws.amazon.com/athena\" target=\"_blank\">Amazon Athena</a> is an interactive query service that makes it easy to analyze data in <a href=\"http://aws.amazon.com/s3\" target=\"_blank\">Amazon Simple Storage Service</a> (Amazon S3) using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.</p>\n<p><a href=\"https://amzn.github.io/ion-docs/\" target=\"_blank\">Amazon Ion</a> is a <a href=\"https://amzn.github.io/ion-docs/guides/why.html#rich-type-system\" target=\"_blank\">richly typed</a>, <a href=\"https://amzn.github.io/ion-docs/guides/why.html#self-describing\" target=\"_blank\">self-describing</a>, hierarchical data serialization format offering <a href=\"https://amzn.github.io/ion-docs/guides/why.html#dual-format-interoperability\" target=\"_blank\">interchangeable binary and text</a> representations. The <a href=\"https://amzn.github.io/ion-docs/docs/spec.html\" target=\"_blank\">text format</a> extends <a href=\"http://json.org/\" target=\"_blank\">JSON</a> (meaning all JSON files are valid Ion files), and is easy to read and author, supporting rapid prototyping. The <a href=\"https://amzn.github.io/ion-docs/docs/binary.html\" target=\"_blank\">binary representation</a> is <a href=\"https://amzn.github.io/ion-docs/guides/why.html#read-optimized-binary-format\" target=\"_blank\">efficient to store, transmit, and skip-scan parse</a>. The rich type system provides unambiguous semantics for long-term preservation of data that can survive multiple generations of software evolution.</p>\n<p>Athena now supports querying and writing data in Ion format. The Ion format is currently used by internal Amazon teams, by external services such as <a href=\"https://aws.amazon.com/qldb/\" target=\"_blank\">Amazon Quantum Ledger Database</a> (Amazon QLDB) and <a href=\"https://aws.amazon.com/dynamodb/\" target=\"_blank\">Amazon DynamoDB</a> (which can be exported into Ion), and in the open-source SQL query language PartiQL.</p>\n<p>In this post, we discuss use cases and the unique features Ion offers, followed by examples of querying Ion with Athena. For demonstration purposes, we use the transformed version of the <a href=\"https://github.com/zemirco/sf-city-lots-json\" target=\"_blank\">City Lots San Francisco</a> dataset.</p>\n<h3><a id=\"Features_of_Ion_5\"></a><strong>Features of Ion</strong></h3>\n<p>In this section, we discuss some of the unique features that Ion offers:</p>\n<ul> \n <li>Type system</li> \n <li>Dual format</li> \n <li>Efficiency gains</li> \n <li>Skip scanning</li> \n </ul>\n<h4><a id=\"Type_system_14\"></a><strong>Type system</strong></h4>\n<p>Ion extends JSON, adding support for more precise <a href=\"https://amzn.github.io/ion-docs/docs/spec.html\" target=\"_blank\">data types</a> to improve interpretability, simplify processing, and avoid rounding errors. These high precision numeric types are essential for financial services, where fractions of a cent on every transaction add up. Data types that are added are arbitrary-size integers, binary floating-point numbers, infinite-precision decimals, timestamps, CLOBS, and BLOBS.</p>\n<h4><a id=\"Dual_format_17\"></a><strong>Dual format</strong></h4>\n<p>Users can be presented with a familiar text-based representation while benefiting from the performance efficiencies of a binary format. The interoperability between the two formats enables you to rapidly discover, digest, and interpret data in a familiar JSON-like representation, while underlying applications benefit from a reduction in storage, memory, network bandwidth, and latency from the binary format. This means you can write plain text queries that run against both text-based and binary-based Ion. You can rewrite parts of your data in text-based Ion when you need human readable data during development and switch to binary in production.</p>\n<p>When debugging a process, the ability for systems engineers to locate data and understand it as quickly as possible is vital. Ion provides mechanisms to move between binary and a text-based representation, optimizing for both the human and the machine. Athena supports querying and writing data in both of these Ion formats. The following is an example Ion text document taken from the transformed version of the <code>citylots</code> dataset:</p>\n<div> \n <div><pre><code>{ \"type\": \"Feature\"\n, \"properties\": { \"MAPBLKLOT\": \"0004002\"\n ,\"BLKLOT\": \"0004002\"\n ,\"BLOCK_NUM\": \"0004\"\n , \"LOT_NUM\": \"002\"\n , \"FROM_ST\": \"0\"\n , \"TO_ST\": \"0\"\n , \"STREET\": \"UNKNOWN\"\n , \"ST_TYPE\": null\n , \"ODD_EVEN\": \"E\" }\n, \"geometry\": { \"type\": \"Polygon\"\n , \"coordinates\": [ [ [ -122.415701204606876, 37.808327252671461, 0.0 ],\n [ -122.415760743593196, 37.808630700240904, 0.0 ],\n [ -122.413787891332404, 37.808566801319841, 0.0 ],\n [ -122.415701204606876, 37.808327252671461, 0.0 ] ] ] } }</code></pre><div><div><span>Code</span></div></div></div> \n </div>\n<h4><a id=\"Efficiency_gains_39\"></a><strong>Efficiency gains</strong></h4>\n<p>Binary-encoded Ion reduces file size by moving repeated values, such as field names, into a <a href=\"https://amzn.github.io/ion-docs/docs/spec.html#symbol\" target=\"_blank\">symbol</a> table. Symbol tables reduce CPU and read latency by limiting the validation of character encoding to the single instance of the value in the symbol table.</p>\n<p>For example, a company that operates at Amazon’s scale can produce large volumes of application logs. When compressing Ion and JSON logs, we noticed approximately 35% less CPU time to compress the log, which produced an average of roughly 26% smaller files. Log files are critical when needed but costly to retain, so the reduction in file sizes combined with the read performance gains from symbol tables helps when handling these logs. The following is an example of file size reduction with the <code>citylots</code> JSON dataset when converted to Ion binary with GZIP and ZSTD compression:</p>\n<pre><code class=\"lang-\">77MB citylots.ion\n 17MB citylots.ion.gz\n 15MB citylots.ion.zst\n181MB citylots.json\n 22MB citylots.json.gz\n 18MB citylots.json.zst\n</code></pre>\n<h4><a id=\"Skipscanning_52\"></a><strong>Skip-scanning</strong></h4>\n<p>In a textual format, every byte must be read and interpreted, but because Ion’s binary format is a TLV (type-length-value) encoding, an application may skip over elements that aren’t needed. This reduces query and application processing costs correlated with the proportion of unexamined fields.</p>\n<p>For example, forensic analysis of application log data involves reading large volumes of data where only a fraction of the data is needed for diagnosis. In these scenarios, skip-scanning allows the binary Ion reader to move past irrelevant fields without the cost of reading the element stored within a field. This results in users experiencing lower resource usage and quicker response times.</p>\n<h3><a id=\"Query_Ion_datasets_using_Athena_56\"></a><strong>Query Ion datasets using Athena</strong></h3>\n<p>Athena now supports querying and creating Ion-formatted datasets via an <a href=\"https://github.com/amzn/ion-hive-serde\" target=\"_blank\">Ion-specific SerDe</a>, which in conjunction with <code>IonInputFormat</code> and <code>IonOutputFormat</code> allows you to read and write valid Ion data. Deserialization allows you to run SELECT queries on the Ion data so that it can be queried to gain insights. Serialization through CTAS or INSERT INTO queries allows you to copy datasets from existing tables’ values or generate new data in the Ion format.</p>\n<p>The interchangeable nature of Ion text and Ion binary means that Athena can read datasets that contain both types of files. Because Ion is a superset of JSON, a table using the Ion SerDe can also include JSON files. Unlike the <a href=\"https://docs.aws.amazon.com/athena/latest/ug/json-serde.html\" target=\"_blank\">JSON SerDe</a>, where every new line character indicates a new row, the Ion SerDe uses a combination of closing brackets and new line characters to determine new rows. This means that if each JSON record in your source documents isn’t on a single line, these files can now be read in Athena via the Ion SerDe.</p>\n<h3><a id=\"Create_external_tables_60\"></a><strong>Create external tables</strong></h3>\n<p>Athena supports querying Ion-based datasets by defining <a href=\"https://aws.amazon.com/glue\" target=\"_blank\">AWS Glue</a> tables with the user-defined metadata. Let’s start with an example of creating an external table for a dataset stored in Ion text. The following is a sample row from the <code>citylots</code> dataset:</p>\n<div> \n <div><pre><code>{\n type:\"Feature\",\n properties:{\n mapblklot:\"0579021\",\n blklot:\"0579024\",\n block_num:\"0579\",\n lot_num:\"024\",\n from_st:\"2160\",\n to_st:\"2160\",\n street:\"PACIFIC\",\n st_type:\"AVE\",\n odd_even:\"E\"\n },\n geometry:{\n type:\"Polygon\",coordinates:[[[-122.4308798855922, ...]]]\n }\n}</code></pre><div><div><span></span></div></div></div> \n </div>\n<p>To create an external table that has its data stored in Ion, you have two syntactic options.</p>\n<p>First, you can specify STORED AS ION. This is a more concise method, and is best used for simple cases, when no additional properties are required. See the following code:</p>\n<div> \n <div><pre><code><span>CREATE</span> EXTERNAL <span>TABLE</span> city_lots_ion1 <span>(</span>\n <span>type</span> STRING<span>,</span> \n properties struct<span>&lt;</span>\n mapblklot:string<span>,</span>\n blklot:string<span>,</span>\n block_num:string<span>,</span>\n lot_num:string<span>,</span>\n from_st:string<span>,</span>\n to_st:string<span>,</span>\n street:string<span>,</span>\n st_type:string<span>,</span>\n odd_even:string<span>&gt;</span><span>,</span> \n <span>geometry</span> struct<span>&lt;</span>\n <span>type</span>:string<span>,</span>\n coordinates:array<span>&lt;</span>array<span>&lt;</span>array<span>&lt;</span><span>decimal</span><span>(</span><span>18</span><span>,</span><span>15</span><span>)</span><span>&gt;&gt;</span><span>&gt;</span><span>,</span>\n multi_coordinates:array<span>&lt;</span>array<span>&lt;</span>array<span>&lt;</span>array<span>&lt;</span><span>decimal</span><span>(</span><span>18</span><span>,</span><span>15</span><span>)</span><span>&gt;&gt;</span><span>&gt;&gt;</span><span>&gt;</span>\n<span>)</span>\nSTORED <span>AS</span> ION\nLOCATION <span>'s3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_ion_binary/'</span></code></pre><div><div><span></span></div></div></div> \n </div>\n<p>Alternatively, you can explicitly specify the Ion classpaths in ROW FORMAT SERDE, INPUTFORMAT, and OUTPUTFORMAT. Unlike the first method, you can specify a SERDEPROPERTIES clause here. In our example DDL, we added a SerDe property that allows values that are outside of the Hive data type ranges to overflow rather than fail the query:</p>\n<div> \n <div><pre><code><span>CREATE</span> EXTERNAL <span>TABLE</span> city_lots_ion2<span>(</span>\n <span>type</span> STRING<span>,</span> \n properties struct<span>&lt;</span>\n mapblklot:string<span>,</span>\n blklot:string<span>,</span>\n block_num:string<span>,</span>\n lot_num:string<span>,</span>\n from_st:string<span>,</span>\n to_st:string<span>,</span>\n street:string<span>,</span>\n st_type:string<span>,</span>\n odd_even:string<span>&gt;</span><span>,</span> \n <span>geometry</span> struct<span>&lt;</span>\n <span>type</span>:string<span>,</span>\n coordinates:array<span>&lt;</span>array<span>&lt;</span>array<span>&lt;</span><span>decimal</span><span>(</span><span>18</span><span>,</span><span>15</span><span>)</span><span>&gt;&gt;</span><span>&gt;</span><span>,</span>\n multi_coordinates:array<span>&lt;</span>array<span>&lt;</span>array<span>&lt;</span>array<span>&lt;</span><span>decimal</span><span>(</span><span>18</span><span>,</span><span>15</span><span>)</span><span>&gt;&gt;</span><span>&gt;&gt;</span><span>&gt;</span>\n<span>)</span>\n<span>ROW</span> FORMAT SERDE \n <span>'com.amazon.ionhiveserde.IonHiveSerDe'</span>\n<span>WITH</span> SERDEPROPERTIES <span>(</span>\n <span>'ion.fail_on_overflow'</span><span>=</span><span>'false'</span>\n <span>)</span>\nSTORED <span>AS</span> INPUTFORMAT \n <span>'com.amazon.ionhiveserde.formats.IonInputFormat'</span> \nOUTPUTFORMAT \n <span>'com.amazon.ionhiveserde.formats.IonOutputFormat'</span>\nLOCATION\n <span>'s3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_ion_binary/'</span></code></pre><div><div><span></span></div></div></div> \n </div>\n<p>Athena converts STORED AS ION into the explicit classpaths, so both tables look similar in the metastore. If we look in AWS Glue, we see both tables we just created have the same input format, output format, and SerDe serialization library.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/f779018c121640cdbb74bcd91c84fb32_image.png\" alt=\"image.png\" /></p>\n<p>Now that our table is created, we can run standard SELECT queries on the <code>city_lots_ion</code> table. Let’s run a query that specifies the <code>block_num</code> from our example row of Ion data to verify that we can read from the table:</p>\n<div> \n <div><pre><code><span>-- QUERY</span>\n<span>SELECT</span> <span>*</span> <span>FROM</span> city_lots_ion1 <span>WHERE</span> properties<span>.</span>block_num<span>=</span><span>'0579'</span><span>;</span></code></pre><div><div><span></span></div></div></div> \n </div>\n<p>The following screenshot shows our results.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/5dce9e8139d94b1bb750c5d30e78e32e_image.png\" alt=\"image.png\" /></p>\n<h3><a id=\"Use_path_extraction_to_read_from_specific_fields_147\"></a><strong>Use path extraction to read from specific fields</strong></h3>\n<p>Athena supports further customization of how data is interpreted via <a href=\"https://github.com/amzn/ion-hive-serde/blob/master/docs/serde-properties.md\" target=\"_blank\">SerDe properties</a>. To specify these, you can add a WITH SERDEPROPERTIES clause, which is a subfield of the ROW FORMAT SERDE field.</p>\n<p>In some situations, we may only care about some parts of the information. Let’s suppose we don’t want any of the geometry info from the <code>citylots</code> dataset, and only need a few of the fields in properties. One solution is to specify a search path using the path extractor SerDe property:</p>\n<div> \n <div><pre><code><span>-- Path Extractor property</span>\nion<span>.</span><span></span><span><span>&lt;</span><span>column</span><span>&gt;</span></span><span></span><span>.</span>path_extractor <span>=</span> <span></span><span><span>&lt;</span>search path<span>&gt;</span></span><span></span></code></pre><div><div><span></span></div></div></div> \n </div>\n<p>Path extractors are search paths that Athena uses to map the table columns to locations in the individual document. Full information on what can be done with path extractors <a href=\"https://github.com/amzn/ion-java-path-extraction\" target=\"_blank\">is available on GitHub</a>, but for our example, we focus on creating simple paths that use the names of each field as an index. In this case, the search path takes the form of a space-delimited set of indexes (and wraps it with parentheses) that indicate the location of each desired piece of information. We map the search paths to table columns by using the path extractor property.</p>\n<p>By default, Athena builds path extractors dynamically based on column names unless overridden. This means that when we run our SELECT query on our <code>city_lots_ion1</code> table, Athena builds the following search paths:</p>\n<div> \n <div><pre><code><span>Default</span> Extractors generated <span>by</span> Athena <span>for</span> city_lots_ion1<span>.</span>\n<span>-- Extracts the 'type' field to the 'type' column</span>\n <span>'ion.type.path_extractor'</span> <span>=</span> <span>'(type)'</span>\n<p><span>– Extracts the ‘properties’ field to the ‘properties’ column</span><br />\n<span>‘ion.properties.path_extractor’</span> <span>=</span> <span>‘(properties)’</span></p>\n<p><span>– Extracts the ‘geometry’ field to the ‘geometry’ column</span><br />\n<span>‘ion.geometry.path_extractor’</span> <span>=</span> <span>‘(geometry)’</span></code></pre><div><div><span></span></div></div></div><br />\n</div></p>\n<p>Assuming we only care about the block and lot information from the properties struct, and the geometry type from the geometry struct, we can build search paths that map the desired fields from the row of data to table columns. First let’s build the search paths:</p>\n<div> \n <div><pre><code><span>(</span>properties mapblklot<span>)</span> <span>-</span> Search path <span>for</span> the mapblklot field <span>in</span> the properties struct\n<span>(</span>properties blklot<span>)</span> <span>-</span> Search path <span>for</span> the blklot field <span>in</span> the properties struct\n<span>(</span>properties block_num<span>)</span> <span>-</span> Search path <span>for</span> the block_num field <span>in</span> the properties struct\n<span>(</span>properties lot_num<span>)</span> <span>-</span> Search path <span>for</span> the lot_num field <span>in</span> the properties struct\n<span>(</span><span>geometry</span> <span>type</span><span>)</span> <span>-</span> Search path <span>for</span> the <span>type</span> field <span>in</span> the <span>geometry</span> struct</code></pre><div><div><span></span></div></div></div> \n </div>\n<p>Now let’s map these search paths to table columns using the path extractor SerDe property. Because the search paths specify where to look for data, we are able to flatten and rename our datasets to better serve our purpose. For this example, let’s rename the <code>mapblklot</code> field to <code>map_block_lot</code>, <code>blklot</code> to <code>block_lot</code>, and the geometry type to shape:</p>\n<div> \n <div><pre><code> 'ion.map_block_lot.path_extractor' = '(properties mapblklot)'\n 'ion.block_lot.path_extractor' = '(properties blklot)'\n 'ion.block_num.path_extractor' = '(properties block_num)'\n 'ion.lot_num.path_extractor' = '(properties lot_num)'\n 'ion.shape.path_extractor' = '(geometry type)'</code></pre><div><div><span></span></div></div></div> \n </div>\n<p>Let’s put all of this together and create the <code>city_blocks</code> table:</p>\n<div> \n <div><pre><code><span>CREATE</span> EXTERNAL <span>TABLE</span> city_blocks <span>(</span>\n map_block_lot STRING<span>,</span>\n block_lot STRING<span>,</span>\n block_num STRING<span>,</span>\n lot_num STRING<span>,</span>\n shape STRING\n<span>)</span>\n<span>ROW</span> FORMAT SERDE\n <span>'com.amazon.ionhiveserde.IonHiveSerDe'</span>\n<span>WITH</span> SERDEPROPERTIES <span>(</span>\n <span>'ion.map_block_lot.path_extractor'</span> <span>=</span> <span>'(properties mapblklot)'</span><span>,</span>\n <span>'ion.block_lot.path_extractor'</span> <span>=</span> <span>'(properties blklot)'</span><span>,</span> \n <span>'ion.block_num.path_extractor'</span> <span>=</span> <span>'(properties block_num)'</span><span>,</span>\n <span>'ion.lot_num.path_extractor'</span> <span>=</span> <span>'(properties lot_num)'</span><span>,</span>\n <span>'ion.shape.path_extractor'</span> <span>=</span> <span>'(geometry type)'</span>\n <span>)</span>\nSTORED <span>AS</span> ION\nLOCATION <span>'s3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_ion_binary/'</span></code></pre><div><div><span></span></div></div></div> \n </div>\n<p>Now we can run a select query on the <code>city_blocks</code> table, and see the results:</p>\n<div> \n <div><pre><code><span>-- Select Query</span>\n<span>SELECT</span> <span>*</span> <span>FROM</span> city_blocks <span>WHERE</span> block_num<span>=</span><span>'0579'</span><span>;</span></code></pre><div><div><span></span></div></div></div> \n </div>\n<p><img src=\"https://dev-media.amazoncloud.cn/7f1a949812c64291a1fb639e265f80b2_image.png\" alt=\"image.png\" /></p>\n<p>Utilizing search paths in this way enables skip-scan parsing when reading from Ion binary files, which allows Athena to skip over the unneeded fields and reduces the overall time it takes to run the query.</p>\n<h3><a id=\"Use_CTAS_and_UNLOAD_for_data_transformation_213\"></a><strong>Use CTAS and UNLOAD for data transformation</strong></h3>\n<p>Athena supports <a href=\"https://docs.aws.amazon.com/athena/latest/ug/ctas.html\" target=\"_blank\">CREATE TABLE AS SELECT</a> (CTAS), which creates a new table in Athena from the results of a SELECT statement from another query. Athena also supports <a href=\"https://docs.aws.amazon.com/athena/latest/ug/unload.html\" target=\"_blank\">UNLOAD</a>, which writes query results to Amazon S3 from a SELECT statement to the specified data format.</p>\n<p>Both CTAS and UNLOAD have a property to specify a format and a compression type. This allows you to easily convert Ion datasets to other data formats, such as Parquet or ORC, and vice versa, without needing to set up a complex extract, transform, and load (ETL) job. This is beneficial for situations when you want to transform your data, or know you will run repeated queries on a subset of your data and want to use some of the benefits inherent to columnar formats. Combining it with path extractors is especially helpful, because we’re only storing the data that we need in the new format.</p>\n<p>Let’s use CTAS to convert our <code>city_blocks</code> table from Ion to Parquet, and compress it via GZIP. Because we have path extractors set up for the <code>city_blocks</code> table, we only need to convert a small portion of the original dataset:</p>\n<div> \n <div><pre><code><span>CREATE</span> <span>TABLE</span> city_blocks_parquet_gzip\n<span>WITH</span> <span>(</span>format <span>=</span> <span>'PARQUET'</span><span>,</span> write_compression<span>=</span><span>'GZIP'</span><span>)</span>\n<span>AS</span> <span>SELECT</span> <span>*</span> <span>FROM</span> city_blocks<span>;</span></code></pre><div><div><span></span></div></div></div> \n </div>\n<p>We can now run queries against the <code>city_block_parquet_gzip</code> table, and should see the same result. To test this out, let’s run the same SELECT query we ran before on the Parquet table:</p>\n<div> \n <div><pre><code><span>SELECT</span> <span>*</span> <span>FROM</span> city_blocks_parquet_gzip <span>WHERE</span> block_num<span>=</span><span>'0579'</span><span>;</span></code></pre><div><div><span></span></div></div></div> \n </div>\n<p><img src=\"https://dev-media.amazoncloud.cn/cd019036c0e54d678c1e45b2acc4ba0f_image.png\" alt=\"image.png\" /></p>\n<p>When converting tables from another format to Ion, Athena supports the following compression codecs: ZSTD, BZIP2, GZIP, SNAPPY, and NONE. In addition to adding Ion as a new format for CTAS, we added the <code>ion_encoding</code> property, which allows you to choose whether the output files are created in Ion text or Ion binary. This allows for serialization of data from other formats back into Ion.</p>\n<p>Let’s convert the original <code>city_lots</code> JSON file back to Ion, but this time we specify that we want to use ZSTD compression and a binary encoding.</p>\n<p>The JSON file can be found at following location: <code>s3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_json/</code></p>\n<p>Because Ion is a superset of JSON, we can use the Ion SerDe to read this file:</p>\n<div> \n <div><pre><code><span>CREATE</span> EXTERNAL <span>TABLE</span> city_blocks_json_ion_serde <span>(</span>\n map_block_lot STRING<span>,</span>\n block_lot STRING<span>,</span>\n block_num STRING<span>,</span>\n lot_num STRING<span>,</span>\n shape STRING\n<span>)</span>\n<span>ROW</span> FORMAT SERDE\n<span>'com.amazon.ionhiveserde.IonHiveSerDe'</span>\n<span>WITH</span> SERDEPROPERTIES <span>(</span>\n<span>'ion.map_block_lot.path_extractor'</span> <span>=</span> <span>'(properties mapblklot)'</span><span>,</span>\n<span>'ion.block_lot.path_extractor'</span> <span>=</span> <span>'(properties blklot)'</span><span>,</span>\n<span>'ion.block_num.path_extractor'</span> <span>=</span> <span>'(properties block_num)'</span><span>,</span>\n<span>'ion.lot_num.path_extractor'</span> <span>=</span> <span>'(properties lot_num)'</span><span>,</span>\n<span>'ion.shape.path_extractor'</span> <span>=</span> <span>'(geometry type)'</span>\n<span>)</span>\nSTORED <span>AS</span> ION\nLOCATION <span>'s3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_json/'</span></code></pre><div><div><span></span></div></div></div> \n </div>\n<p>Now let’s copy this table into our desired Ion binary form:</p>\n<div> \n <div><pre><code><span>CREATE</span> <span>TABLE</span> city_blocks_ion_zstd\n<span>WITH</span> <span>(</span>format <span>=</span> <span>'ION'</span><span>,</span> write_compression<span>=</span><span>'ZSTD'</span><span>,</span> ion_encoding<span>=</span><span>'BINARY'</span><span>)</span>\n<span>AS</span> <span>SELECT</span> <span>*</span> <span>FROM</span> city_blocks_parquet_gzip</code></pre><div><div><span></span></div></div></div> \n </div>\n<p>Finally, let’s run our verification SELECT statement to verify everything was created properly:</p>\n<pre><code class=\"lang-\">SELECT * FROM city_blocks_ion_zstd WHERE block_num='0579';\n\n</code></pre>\n<p><img src=\"https://dev-media.amazoncloud.cn/536c19e02a9044f8be12b4d03a718d69_image.png\" alt=\"image.png\" /></p>\n<h3><a id=\"Use_UNLOAD_to_store_Ion_data_in_Amazon_S3_268\"></a><strong>Use UNLOAD to store Ion data in Amazon S3</strong></h3>\n<p>Sometimes we just want to reformat the data and don’t need to store the additional metadata to query the table. In this case, we can use UNLOAD, which stores the results of the query in the specified format in an S3 bucket.</p>\n<p>Let’s test it out, using UNLOAD to convert the <code>drivers_names</code> table from Ion to ORC, compress it via ZLIB, and store it to an S3 bucket:</p>\n<div> \n <div><pre><code>UNLOAD <span>(</span><span>SELECT</span> <span>*</span> <span>FROM</span> city_blocks_ion_zstd <span>WHERE</span> block_num<span>=</span><span>'0579'</span><span>)</span> \n<span>TO</span> <span>'s3://<span>&lt;your-s3-bucket&gt;</span>/athena-ion-blog/unload/orc_zlib/'</span>\n<span>WITH</span> <span>(</span>format <span>=</span> <span>'ORC'</span><span>,</span> compression<span>=</span><span>'ZLIB'</span><span>)</span></code></pre><div><div><span></span></div></div></div> \n </div>\n<p>When you check in Amazon S3, you can find a new file in the ORC format.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/c18faaa826794792be5465041074b593_image.png\" alt=\"image.png\" /></p>\n<h3><a id=\"Conclusion_281\"></a><strong>Conclusion</strong></h3>\n<p>This post talked about the new feature in Athena that allows you to query and create Ion datasets using standard SQL. We discussed use cases and unique features of the Ion format like type system, dual formats (Ion text and Ion binary), efficiency gains, and skip-scanning. You can get started with querying an Ion dataset stored in Amazon S3 by simply creating a table in Athena, and also converting existing datasets to Ion format and vice versa using CTAS and UNLOAD statements.</p>\n<p>To learn more about querying Ion using Athena, refer to <a href=\"https://docs.aws.amazon.com/athena/latest/ug/ion-serde.html\" target=\"_blank\">Amazon Ion Hive SerDe</a>.</p>\n<h3><a id=\"References_285\"></a><strong>References</strong></h3>\n<ul> \n <li>Name of Dataset: City Lots San Francisco in .json</li> \n <li>Licenses and Notes: OG dataset under PDDL 1.0; Github .json version under MIT style license</li> \n <li><a href=\"https://github.com/zemirco/sf-city-lots-json#license\" target=\"_blank\">Link(s) to Dataset(s)</a> (note – original dataset is NLA from DataSF)</li> \n <li><a href=\"//aws-bigdata-blog/artifacts/athena-ion-blog/\" target=\"_blank\">Location of transformed version</a></li> \n </ul>\n<p><img src=\"https://dev-media.amazoncloud.cn/eeee1f0a46de48c6aab4c27f1361c454_image.png\" alt=\"image.png\" /></p>\n<p><strong>Pathik Shah </strong>is a Sr. Big Data Architect on Amazon Athena. He joined AWS in 2015 and has been focusing in the big data analytics space since then, helping customers build scalable and robust solutions using AWS analytics services.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/a7d359d3ca914cbc9d5038d78fc520da_image.png\" alt=\"image.png\" /></p>\n<p><strong>Jacob Stein</strong> works on the Amazon Athena team as a Software Development Engineer. He led the project to add support for Ion in Athena. He loves working on technical problems unique to internet scale data, and is passionate about developing scalable solutions for distributed systems.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/a10f3cbac13c4fe98acb4a11b4d5f805_image.png\" alt=\"image.png\" /></p>\n<p><strong>Giovanni Matteo Fumarola</strong> is the Engineering Manager of the Athena Data Lake and Storage team. He is an Apache Hadoop Committer and PMC member. He has been focusing in the big data analytics space since 2013.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/06de775a780f4e2797641f3467f3891c_image.png\" alt=\"image.png\" /></p>\n<p><strong>Pete Ford</strong> is a Sr. Technical Program Manager at Amazon.</p>\n"}
目录
亚马逊云科技解决方案 基于行业客户应用场景及技术领域的解决方案
联系亚马逊云科技专家
亚马逊云科技解决方案
基于行业客户应用场景及技术领域的解决方案
联系专家
0
目录
关闭
contact-us