Analyze Amazon Ion datasets using Amazon Athena

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"<a href=\"http://aws.amazon.com/athena\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Athena</a> is an interactive query service that makes it easy to analyze data in <a href=\"http://aws.amazon.com/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service</a> (Amazon S3) using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.\n<a href=\"https://amzn.github.io/ion-docs/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Ion</a> is a <a href=\"https://amzn.github.io/ion-docs/guides/why.html#rich-type-system\" target=\"_blank\" rel=\"noopener noreferrer\">richly typed</a>, <a href=\"https://amzn.github.io/ion-docs/guides/why.html#self-describing\" target=\"_blank\" rel=\"noopener noreferrer\">self-describing</a>, hierarchical data serialization format offering <a href=\"https://amzn.github.io/ion-docs/guides/why.html#dual-format-interoperability\" target=\"_blank\" rel=\"noopener noreferrer\">interchangeable binary and text</a> representations. The <a href=\"https://amzn.github.io/ion-docs/docs/spec.html\" target=\"_blank\" rel=\"noopener noreferrer\">text format</a> extends <a href=\"http://json.org/\" target=\"_blank\" rel=\"noopener noreferrer\">JSON</a> (meaning all JSON files are valid Ion files), and is easy to read and author, supporting rapid prototyping. The <a href=\"https://amzn.github.io/ion-docs/docs/binary.html\" target=\"_blank\" rel=\"noopener noreferrer\">binary representation</a> is <a href=\"https://amzn.github.io/ion-docs/guides/why.html#read-optimized-binary-format\" target=\"_blank\" rel=\"noopener noreferrer\">efficient to store, transmit, and skip-scan parse</a>. The rich type system provides unambiguous semantics for long-term preservation of data that can survive multiple generations of software evolution.\nAthena now supports querying and writing data in Ion format. The Ion format is currently used by internal Amazon teams, by external services such as <a href=\"https://aws.amazon.com/qldb/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Quantum Ledger Database</a> (Amazon QLDB) and <a href=\"https://aws.amazon.com/dynamodb/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon DynamoDB</a> (which can be exported into Ion), and in the open-source SQL query language PartiQL.\nIn this post, we discuss use cases and the unique features Ion offers, followed by examples of querying Ion with Athena. For demonstration purposes, we use the transformed version of the <a href=\"https://github.com/zemirco/sf-city-lots-json\" target=\"_blank\" rel=\"noopener noreferrer\">City Lots San Francisco</a> dataset.\n\n### **Features of Ion**\nIn this section, we discuss some of the unique features that Ion offers:\n<ul> \n <li>Type system</li> \n <li>Dual format</li> \n <li>Efficiency gains</li> \n <li>Skip scanning</li> \n </ul>\n\n#### **Type system**\nIon extends JSON, adding support for more precise <a href=\"https://amzn.github.io/ion-docs/docs/spec.html\" target=\"_blank\" rel=\"noopener noreferrer\">data types</a> to improve interpretability, simplify processing, and avoid rounding errors. These high precision numeric types are essential for financial services, where fractions of a cent on every transaction add up. Data types that are added are arbitrary-size integers, binary floating-point numbers, infinite-precision decimals, timestamps, CLOBS, and BLOBS.\n\n#### **Dual format**\nUsers can be presented with a familiar text-based representation while benefiting from the performance efficiencies of a binary format. The interoperability between the two formats enables you to rapidly discover, digest, and interpret data in a familiar JSON-like representation, while underlying applications benefit from a reduction in storage, memory, network bandwidth, and latency from the binary format. This means you can write plain text queries that run against both text-based and binary-based Ion. You can rewrite parts of your data in text-based Ion when you need human readable data during development and switch to binary in production.\nWhen debugging a process, the ability for systems engineers to locate data and understand it as quickly as possible is vital. Ion provides mechanisms to move between binary and a text-based representation, optimizing for both the human and the machine. Athena supports querying and writing data in both of these Ion formats. The following is an example Ion text document taken from the transformed version of the <code>citylots</code> dataset:\n\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-code\"><code class=\" language-code\">{ \"type\": \"Feature\"\n, \"properties\": { \"MAPBLKLOT\": \"0004002\"\n ,\"BLKLOT\": \"0004002\"\n ,\"BLOCK_NUM\": \"0004\"\n , \"LOT_NUM\": \"002\"\n , \"FROM_ST\": \"0\"\n , \"TO_ST\": \"0\"\n , \"STREET\": \"UNKNOWN\"\n , \"ST_TYPE\": null\n , \"ODD_EVEN\": \"E\" }\n, \"geometry\": { \"type\": \"Polygon\"\n , \"coordinates\": [ [ [ -122.415701204606876, 37.808327252671461, 0.0 ],\n [ -122.415760743593196, 37.808630700240904, 0.0 ],\n [ -122.413787891332404, 37.808566801319841, 0.0 ],\n [ -122.415701204606876, 37.808327252671461, 0.0 ] ] ] } }</code></pre><div class=\"toolbar\"><div class=\"toolbar-item\">Code</div></div></div> \n </div>\n\n#### **Efficiency gains**\nBinary-encoded Ion reduces file size by moving repeated values, such as field names, into a <a href=\"https://amzn.github.io/ion-docs/docs/spec.html#symbol\" target=\"_blank\" rel=\"noopener noreferrer\">symbol</a> table. Symbol tables reduce CPU and read latency by limiting the validation of character encoding to the single instance of the value in the symbol table.\nFor example, a company that operates at Amazon’s scale can produce large volumes of application logs. When compressing Ion and JSON logs, we noticed approximately 35% less CPU time to compress the log, which produced an average of roughly 26% smaller files. Log files are critical when needed but costly to retain, so the reduction in file sizes combined with the read performance gains from symbol tables helps when handling these logs. The following is an example of file size reduction with the <code>citylots</code> JSON dataset when converted to Ion binary with GZIP and ZSTD compression:\n\n\n```\n77MB citylots.ion\n 17MB citylots.ion.gz\n 15MB citylots.ion.zst\n181MB citylots.json\n 22MB citylots.json.gz\n 18MB citylots.json.zst\n```\n#### **Skip-scanning**\nIn a textual format, every byte must be read and interpreted, but because Ion’s binary format is a TLV (type-length-value) encoding, an application may skip over elements that aren’t needed. This reduces query and application processing costs correlated with the proportion of unexamined fields.\nFor example, forensic analysis of application log data involves reading large volumes of data where only a fraction of the data is needed for diagnosis. In these scenarios, skip-scanning allows the binary Ion reader to move past irrelevant fields without the cost of reading the element stored within a field. This results in users experiencing lower resource usage and quicker response times.\n\n### **Query Ion datasets using Athena**\nAthena now supports querying and creating Ion-formatted datasets via an <a href=\"https://github.com/amzn/ion-hive-serde\" target=\"_blank\" rel=\"noopener noreferrer\">Ion-specific SerDe</a>, which in conjunction with <code>IonInputFormat</code> and <code>IonOutputFormat</code> allows you to read and write valid Ion data. Deserialization allows you to run SELECT queries on the Ion data so that it can be queried to gain insights. Serialization through CTAS or INSERT INTO queries allows you to copy datasets from existing tables’ values or generate new data in the Ion format.\nThe interchangeable nature of Ion text and Ion binary means that Athena can read datasets that contain both types of files. Because Ion is a superset of JSON, a table using the Ion SerDe can also include JSON files. Unlike the <a href=\"https://docs.aws.amazon.com/athena/latest/ug/json-serde.html\" target=\"_blank\" rel=\"noopener noreferrer\">JSON SerDe</a>, where every new line character indicates a new row, the Ion SerDe uses a combination of closing brackets and new line characters to determine new rows. This means that if each JSON record in your source documents isn’t on a single line, these files can now be read in Athena via the Ion SerDe.\n\n### **Create external tables**\nAthena supports querying Ion-based datasets by defining <a href=\"https://aws.amazon.com/glue\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Glue</a> tables with the user-defined metadata. Let’s start with an example of creating an external table for a dataset stored in Ion text. The following is a sample row from the <code>citylots</code> dataset:\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-code\"><code class=\" language-code\">{\n type:\"Feature\",\n properties:{\n mapblklot:\"0579021\",\n blklot:\"0579024\",\n block_num:\"0579\",\n lot_num:\"024\",\n from_st:\"2160\",\n to_st:\"2160\",\n street:\"PACIFIC\",\n st_type:\"AVE\",\n odd_even:\"E\"\n },\n geometry:{\n type:\"Polygon\",coordinates:[[[-122.4308798855922, ...]]]\n }\n}</code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"></div></div></div> \n </div>\nTo create an external table that has its data stored in Ion, you have two syntactic options.\nFirst, you can specify STORED AS ION. This is a more concise method, and is best used for simple cases, when no additional properties are required. See the following code:\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\">CREATE EXTERNAL TABLE city_lots_ion1 (\n type STRING, \n properties struct<\n mapblklot:string,\n blklot:string,\n block_num:string,\n lot_num:string,\n from_st:string,\n to_st:string,\n street:string,\n st_type:string,\n odd_even:string>, \n geometry struct<\n type:string,\n coordinates:array<array<array<decimal(18,15)>>>,\n multi_coordinates:array<array<array<array<decimal(18,15)>>>>>\n)\nSTORED AS ION\nLOCATION 's3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_ion_binary/'</code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"></div></div></div> \n </div>\nAlternatively, you can explicitly specify the Ion classpaths in ROW FORMAT SERDE, INPUTFORMAT, and OUTPUTFORMAT. Unlike the first method, you can specify a SERDEPROPERTIES clause here. In our example DDL, we added a SerDe property that allows values that are outside of the Hive data type ranges to overflow rather than fail the query:\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\">CREATE EXTERNAL TABLE city_lots_ion2(\n type STRING, \n properties struct<\n mapblklot:string,\n blklot:string,\n block_num:string,\n lot_num:string,\n from_st:string,\n to_st:string,\n street:string,\n st_type:string,\n odd_even:string>, \n geometry struct<\n type:string,\n coordinates:array<array<array<decimal(18,15)>>>,\n multi_coordinates:array<array<array<array<decimal(18,15)>>>>>\n)\nROW FORMAT SERDE \n 'com.amazon.ionhiveserde.IonHiveSerDe'\nWITH SERDEPROPERTIES (\n 'ion.fail_on_overflow'='false'\n )\nSTORED AS INPUTFORMAT \n 'com.amazon.ionhiveserde.formats.IonInputFormat' \nOUTPUTFORMAT \n 'com.amazon.ionhiveserde.formats.IonOutputFormat'\nLOCATION\n 's3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_ion_binary/'</code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"></div></div></div> \n </div>\nAthena converts STORED AS ION into the explicit classpaths, so both tables look similar in the metastore. If we look in AWS Glue, we see both tables we just created have the same input format, output format, and SerDe serialization library.\n\n![image.png](https://dev-media.amazoncloud.cn/f779018c121640cdbb74bcd91c84fb32_image.png)\nNow that our table is created, we can run standard SELECT queries on the <code>city_lots_ion</code> table. Let’s run a query that specifies the <code>block_num</code> from our example row of Ion data to verify that we can read from the table:\n\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\">-- QUERY\nSELECT * FROM city_lots_ion1 WHERE properties.block_num='0579';</code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"></div></div></div> \n </div>\nThe following screenshot shows our results.\n\n![image.png](https://dev-media.amazoncloud.cn/5dce9e8139d94b1bb750c5d30e78e32e_image.png)\n### **Use path extraction to read from specific fields**\nAthena supports further customization of how data is interpreted via <a href=\"https://github.com/amzn/ion-hive-serde/blob/master/docs/serde-properties.md\" target=\"_blank\" rel=\"noopener noreferrer\">SerDe properties</a>. To specify these, you can add a WITH SERDEPROPERTIES clause, which is a subfield of the ROW FORMAT SERDE field.\nIn some situations, we may only care about some parts of the information. Let’s suppose we don’t want any of the geometry info from the <code>citylots</code> dataset, and only need a few of the fields in properties. One solution is to specify a search path using the path extractor SerDe property:\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\">-- Path Extractor property\nion.<column>.path_extractor = <search path></code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"></div></div></div> \n </div>\nPath extractors are search paths that Athena uses to map the table columns to locations in the individual document. Full information on what can be done with path extractors <a href=\"https://github.com/amzn/ion-java-path-extraction\" target=\"_blank\" rel=\"noopener noreferrer\">is available on GitHub</a>, but for our example, we focus on creating simple paths that use the names of each field as an index. In this case, the search path takes the form of a space-delimited set of indexes (and wraps it with parentheses) that indicate the location of each desired piece of information. We map the search paths to table columns by using the path extractor property.\nBy default, Athena builds path extractors dynamically based on column names unless overridden. This means that when we run our SELECT query on our <code>city_lots_ion1</code> table, Athena builds the following search paths:\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\">Default Extractors generated by Athena for city_lots_ion1.\n-- Extracts the 'type' field to the 'type' column\n 'ion.type.path_extractor' = '(type)'\n\n-- Extracts the 'properties' field to the 'properties' column\n 'ion.properties.path_extractor' = '(properties)'\n\n-- Extracts the 'geometry' field to the 'geometry' column\n 'ion.geometry.path_extractor' = '(geometry)'</code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"></div></div></div> \n </div>\nAssuming we only care about the block and lot information from the properties struct, and the geometry type from the geometry struct, we can build search paths that map the desired fields from the row of data to table columns. First let’s build the search paths:\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\">(properties mapblklot) - Search path for the mapblklot field in the properties struct\n(properties blklot) - Search path for the blklot field in the properties struct\n(properties block_num) - Search path for the block_num field in the properties struct\n(properties lot_num) - Search path for the lot_num field in the properties struct\n(geometry type) - Search path for the type field in the geometry struct</code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"></div></div></div> \n </div>\nNow let’s map these search paths to table columns using the path extractor SerDe property. Because the search paths specify where to look for data, we are able to flatten and rename our datasets to better serve our purpose. For this example, let’s rename the <code>mapblklot</code> field to <code>map_block_lot</code>, <code>blklot</code> to <code>block_lot</code>, and the geometry type to shape:\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-code\"><code class=\" language-code\"> 'ion.map_block_lot.path_extractor' = '(properties mapblklot)'\n 'ion.block_lot.path_extractor' = '(properties blklot)'\n 'ion.block_num.path_extractor' = '(properties block_num)'\n 'ion.lot_num.path_extractor' = '(properties lot_num)'\n 'ion.shape.path_extractor' = '(geometry type)'</code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"></div></div></div> \n </div>\nLet’s put all of this together and create the <code>city_blocks</code> table:\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\">CREATE EXTERNAL TABLE city_blocks (\n map_block_lot STRING,\n block_lot STRING,\n block_num STRING,\n lot_num STRING,\n shape STRING\n)\nROW FORMAT SERDE\n 'com.amazon.ionhiveserde.IonHiveSerDe'\nWITH SERDEPROPERTIES (\n 'ion.map_block_lot.path_extractor' = '(properties mapblklot)',\n 'ion.block_lot.path_extractor' = '(properties blklot)', \n 'ion.block_num.path_extractor' = '(properties block_num)',\n 'ion.lot_num.path_extractor' = '(properties lot_num)',\n 'ion.shape.path_extractor' = '(geometry type)'\n )\nSTORED AS ION\nLOCATION 's3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_ion_binary/'</code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"></div></div></div> \n </div>\nNow we can run a select query on the <code>city_blocks</code> table, and see the results:\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\">-- Select Query\nSELECT * FROM city_blocks WHERE block_num='0579';</code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"></div></div></div> \n </div>\n\n![image.png](https://dev-media.amazoncloud.cn/7f1a949812c64291a1fb639e265f80b2_image.png)\nUtilizing search paths in this way enables skip-scan parsing when reading from Ion binary files, which allows Athena to skip over the unneeded fields and reduces the overall time it takes to run the query.\n\n### **Use CTAS and UNLOAD for data transformation**\n\nAthena supports <a href=\"https://docs.aws.amazon.com/athena/latest/ug/ctas.html\" target=\"_blank\" rel=\"noopener noreferrer\">CREATE TABLE AS SELECT</a> (CTAS), which creates a new table in Athena from the results of a SELECT statement from another query. Athena also supports <a href=\"https://docs.aws.amazon.com/athena/latest/ug/unload.html\" target=\"_blank\" rel=\"noopener noreferrer\">UNLOAD</a>, which writes query results to Amazon S3 from a SELECT statement to the specified data format.\nBoth CTAS and UNLOAD have a property to specify a format and a compression type. This allows you to easily convert Ion datasets to other data formats, such as Parquet or ORC, and vice versa, without needing to set up a complex extract, transform, and load (ETL) job. This is beneficial for situations when you want to transform your data, or know you will run repeated queries on a subset of your data and want to use some of the benefits inherent to columnar formats. Combining it with path extractors is especially helpful, because we’re only storing the data that we need in the new format.\nLet’s use CTAS to convert our <code>city_blocks</code> table from Ion to Parquet, and compress it via GZIP. Because we have path extractors set up for the <code>city_blocks</code> table, we only need to convert a small portion of the original dataset:\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\">CREATE TABLE city_blocks_parquet_gzip\nWITH (format = 'PARQUET', write_compression='GZIP')\nAS SELECT * FROM city_blocks;</code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"></div></div></div> \n </div>\nWe can now run queries against the <code>city_block_parquet_gzip</code> table, and should see the same result. To test this out, let’s run the same SELECT query we ran before on the Parquet table:\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\">SELECT * FROM city_blocks_parquet_gzip WHERE block_num='0579';</code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"></div></div></div> \n </div>\n\n![image.png](https://dev-media.amazoncloud.cn/cd019036c0e54d678c1e45b2acc4ba0f_image.png)\nWhen converting tables from another format to Ion, Athena supports the following compression codecs: ZSTD, BZIP2, GZIP, SNAPPY, and NONE. In addition to adding Ion as a new format for CTAS, we added the <code>ion_encoding</code> property, which allows you to choose whether the output files are created in Ion text or Ion binary. This allows for serialization of data from other formats back into Ion.\nLet’s convert the original <code>city_lots</code> JSON file back to Ion, but this time we specify that we want to use ZSTD compression and a binary encoding.\nThe JSON file can be found at following location: <code>s3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_json/</code>\nBecause Ion is a superset of JSON, we can use the Ion SerDe to read this file:\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\">CREATE EXTERNAL TABLE city_blocks_json_ion_serde (\n map_block_lot STRING,\n block_lot STRING,\n block_num STRING,\n lot_num STRING,\n shape STRING\n)\nROW FORMAT SERDE\n'com.amazon.ionhiveserde.IonHiveSerDe'\nWITH SERDEPROPERTIES (\n'ion.map_block_lot.path_extractor' = '(properties mapblklot)',\n'ion.block_lot.path_extractor' = '(properties blklot)',\n'ion.block_num.path_extractor' = '(properties block_num)',\n'ion.lot_num.path_extractor' = '(properties lot_num)',\n'ion.shape.path_extractor' = '(geometry type)'\n)\nSTORED AS ION\nLOCATION 's3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_json/'</code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"></div></div></div> \n </div>\nNow let’s copy this table into our desired Ion binary form:\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\">CREATE TABLE city_blocks_ion_zstd\nWITH (format = 'ION', write_compression='ZSTD', ion_encoding='BINARY')\nAS SELECT * FROM city_blocks_parquet_gzip</code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"></div></div></div> \n </div>\nFinally, let’s run our verification SELECT statement to verify everything was created properly:\n\n```\nSELECT * FROM city_blocks_ion_zstd WHERE block_num='0579';\n\n```\n\n![image.png](https://dev-media.amazoncloud.cn/536c19e02a9044f8be12b4d03a718d69_image.png)\n\n### **Use UNLOAD to store Ion data in Amazon S3**\n\nSometimes we just want to reformat the data and don’t need to store the additional metadata to query the table. In this case, we can use UNLOAD, which stores the results of the query in the specified format in an S3 bucket.\nLet’s test it out, using UNLOAD to convert the <code>drivers_names</code> table from Ion to ORC, compress it via ZLIB, and store it to an S3 bucket:\n<div class=\"hide-language\"> \n <div class=\"code-toolbar\"><pre class=\" language-sql\"><code class=\" language-sql\">UNLOAD (SELECT * FROM city_blocks_ion_zstd WHERE block_num='0579') \nTO 's3://<your-s3-bucket>/athena-ion-blog/unload/orc_zlib/'\nWITH (format = 'ORC', compression='ZLIB')</code></pre><div class=\"toolbar\"><div class=\"toolbar-item\"></div></div></div> \n </div>\nWhen you check in Amazon S3, you can find a new file in the ORC format.\n\n![image.png](https://dev-media.amazoncloud.cn/c18faaa826794792be5465041074b593_image.png)\n\n### **Conclusion**\nThis post talked about the new feature in Athena that allows you to query and create Ion datasets using standard SQL. We discussed use cases and unique features of the Ion format like type system, dual formats (Ion text and Ion binary), efficiency gains, and skip-scanning. You can get started with querying an Ion dataset stored in Amazon S3 by simply creating a table in Athena, and also converting existing datasets to Ion format and vice versa using CTAS and UNLOAD statements.\nTo learn more about querying Ion using Athena, refer to <a href=\"https://docs.aws.amazon.com/athena/latest/ug/ion-serde.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Ion Hive SerDe</a>.\n\n### **References**\n<ul> \n <li>Name of Dataset: City Lots San Francisco in .json</li> \n <li>Licenses and Notes: OG dataset under PDDL 1.0; Github .json version under MIT style license</li> \n <li><a href=\"https://github.com/zemirco/sf-city-lots-json#license\" target=\"_blank\" rel=\"noopener noreferrer\">Link(s) to Dataset(s)</a> (note – original dataset is NLA from DataSF)</li> \n <li><a href=\"//aws-bigdata-blog/artifacts/athena-ion-blog/\" target=\"_blank\" rel=\"noopener noreferrer\">Location of transformed version</a></li> \n </ul>\n\n![image.png](https://dev-media.amazoncloud.cn/eeee1f0a46de48c6aab4c27f1361c454_image.png)\nPathik Shah is a Sr. Big Data Architect on Amazon Athena. He joined AWS in 2015 and has been focusing in the big data analytics space since then, helping customers build scalable and robust solutions using AWS analytics services.\n\n![image.png](https://dev-media.amazoncloud.cn/a7d359d3ca914cbc9d5038d78fc520da_image.png)\nJacob Stein works on the Amazon Athena team as a Software Development Engineer. He led the project to add support for Ion in Athena. He loves working on technical problems unique to internet scale data, and is passionate about developing scalable solutions for distributed systems.\n\n![image.png](https://dev-media.amazoncloud.cn/a10f3cbac13c4fe98acb4a11b4d5f805_image.png)\nGiovanni Matteo Fumarola is the Engineering Manager of the Athena Data Lake and Storage team. He is an Apache Hadoop Committer and PMC member. He has been focusing in the big data analytics space since 2013.\n\n![image.png](https://dev-media.amazoncloud.cn/06de775a780f4e2797641f3467f3891c_image.png)\nPete Ford is a Sr. Technical Program Manager at Amazon.\n\n\n\n\n\n","render":"<a href=\"http://aws.amazon.com/athena\" target=\"_blank\">Amazon Athena</a> is an interactive query service that makes it easy to analyze data in <a href=\"http://aws.amazon.com/s3\" target=\"_blank\">Amazon Simple Storage Service</a> (Amazon S3) using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.\n<a href=\"https://amzn.github.io/ion-docs/\" target=\"_blank\">Amazon Ion</a> is a <a href=\"https://amzn.github.io/ion-docs/guides/why.html#rich-type-system\" target=\"_blank\">richly typed</a>, <a href=\"https://amzn.github.io/ion-docs/guides/why.html#self-describing\" target=\"_blank\">self-describing</a>, hierarchical data serialization format offering <a href=\"https://amzn.github.io/ion-docs/guides/why.html#dual-format-interoperability\" target=\"_blank\">interchangeable binary and text</a> representations. The <a href=\"https://amzn.github.io/ion-docs/docs/spec.html\" target=\"_blank\">text format</a> extends <a href=\"http://json.org/\" target=\"_blank\">JSON</a> (meaning all JSON files are valid Ion files), and is easy to read and author, supporting rapid prototyping. The <a href=\"https://amzn.github.io/ion-docs/docs/binary.html\" target=\"_blank\">binary representation</a> is <a href=\"https://amzn.github.io/ion-docs/guides/why.html#read-optimized-binary-format\" target=\"_blank\">efficient to store, transmit, and skip-scan parse</a>. The rich type system provides unambiguous semantics for long-term preservation of data that can survive multiple generations of software evolution.\nAthena now supports querying and writing data in Ion format. The Ion format is currently used by internal Amazon teams, by external services such as <a href=\"https://aws.amazon.com/qldb/\" target=\"_blank\">Amazon Quantum Ledger Database</a> (Amazon QLDB) and <a href=\"https://aws.amazon.com/dynamodb/\" target=\"_blank\">Amazon DynamoDB</a> (which can be exported into Ion), and in the open-source SQL query language PartiQL.\nIn this post, we discuss use cases and the unique features Ion offers, followed by examples of querying Ion with Athena. For demonstration purposes, we use the transformed version of the <a href=\"https://github.com/zemirco/sf-city-lots-json\" target=\"_blank\">City Lots San Francisco</a> dataset.\n<h3><a id=\"Features_of_Ion_5\"></a>Features of Ion</h3>\nIn this section, we discuss some of the unique features that Ion offers:\n<ul> \n <li>Type system</li> \n <li>Dual format</li> \n <li>Efficiency gains</li> \n <li>Skip scanning</li> \n </ul>\n<h4><a id=\"Type_system_14\"></a>Type system</h4>\nIon extends JSON, adding support for more precise <a href=\"https://amzn.github.io/ion-docs/docs/spec.html\" target=\"_blank\">data types</a> to improve interpretability, simplify processing, and avoid rounding errors. These high precision numeric types are essential for financial services, where fractions of a cent on every transaction add up. Data types that are added are arbitrary-size integers, binary floating-point numbers, infinite-precision decimals, timestamps, CLOBS, and BLOBS.\n<h4><a id=\"Dual_format_17\"></a>Dual format</h4>\nUsers can be presented with a familiar text-based representation while benefiting from the performance efficiencies of a binary format. The interoperability between the two formats enables you to rapidly discover, digest, and interpret data in a familiar JSON-like representation, while underlying applications benefit from a reduction in storage, memory, network bandwidth, and latency from the binary format. This means you can write plain text queries that run against both text-based and binary-based Ion. You can rewrite parts of your data in text-based Ion when you need human readable data during development and switch to binary in production.\nWhen debugging a process, the ability for systems engineers to locate data and understand it as quickly as possible is vital. Ion provides mechanisms to move between binary and a text-based representation, optimizing for both the human and the machine. Athena supports querying and writing data in both of these Ion formats. The following is an example Ion text document taken from the transformed version of the <code>citylots</code> dataset:\n<div> \n <div><pre><code>{ \"type\": \"Feature\"\n, \"properties\": { \"MAPBLKLOT\": \"0004002\"\n ,\"BLKLOT\": \"0004002\"\n ,\"BLOCK_NUM\": \"0004\"\n , \"LOT_NUM\": \"002\"\n , \"FROM_ST\": \"0\"\n , \"TO_ST\": \"0\"\n , \"STREET\": \"UNKNOWN\"\n , \"ST_TYPE\": null\n , \"ODD_EVEN\": \"E\" }\n, \"geometry\": { \"type\": \"Polygon\"\n , \"coordinates\": [ [ [ -122.415701204606876, 37.808327252671461, 0.0 ],\n [ -122.415760743593196, 37.808630700240904, 0.0 ],\n [ -122.413787891332404, 37.808566801319841, 0.0 ],\n [ -122.415701204606876, 37.808327252671461, 0.0 ] ] ] } }</code></pre><div><div>Code</div></div></div> \n </div>\n<h4><a id=\"Efficiency_gains_39\"></a>Efficiency gains</h4>\nBinary-encoded Ion reduces file size by moving repeated values, such as field names, into a <a href=\"https://amzn.github.io/ion-docs/docs/spec.html#symbol\" target=\"_blank\">symbol</a> table. Symbol tables reduce CPU and read latency by limiting the validation of character encoding to the single instance of the value in the symbol table.\nFor example, a company that operates at Amazon’s scale can produce large volumes of application logs. When compressing Ion and JSON logs, we noticed approximately 35% less CPU time to compress the log, which produced an average of roughly 26% smaller files. Log files are critical when needed but costly to retain, so the reduction in file sizes combined with the read performance gains from symbol tables helps when handling these logs. The following is an example of file size reduction with the <code>citylots</code> JSON dataset when converted to Ion binary with GZIP and ZSTD compression:\n<pre><code class=\"lang-\">77MB citylots.ion\n 17MB citylots.ion.gz\n 15MB citylots.ion.zst\n181MB citylots.json\n 22MB citylots.json.gz\n 18MB citylots.json.zst\n</code></pre>\n<h4><a id=\"Skipscanning_52\"></a>Skip-scanning</h4>\nIn a textual format, every byte must be read and interpreted, but because Ion’s binary format is a TLV (type-length-value) encoding, an application may skip over elements that aren’t needed. This reduces query and application processing costs correlated with the proportion of unexamined fields.\nFor example, forensic analysis of application log data involves reading large volumes of data where only a fraction of the data is needed for diagnosis. In these scenarios, skip-scanning allows the binary Ion reader to move past irrelevant fields without the cost of reading the element stored within a field. This results in users experiencing lower resource usage and quicker response times.\n<h3><a id=\"Query_Ion_datasets_using_Athena_56\"></a>Query Ion datasets using Athena</h3>\nAthena now supports querying and creating Ion-formatted datasets via an <a href=\"https://github.com/amzn/ion-hive-serde\" target=\"_blank\">Ion-specific SerDe</a>, which in conjunction with <code>IonInputFormat</code> and <code>IonOutputFormat</code> allows you to read and write valid Ion data. Deserialization allows you to run SELECT queries on the Ion data so that it can be queried to gain insights. Serialization through CTAS or INSERT INTO queries allows you to copy datasets from existing tables’ values or generate new data in the Ion format.\nThe interchangeable nature of Ion text and Ion binary means that Athena can read datasets that contain both types of files. Because Ion is a superset of JSON, a table using the Ion SerDe can also include JSON files. Unlike the <a href=\"https://docs.aws.amazon.com/athena/latest/ug/json-serde.html\" target=\"_blank\">JSON SerDe</a>, where every new line character indicates a new row, the Ion SerDe uses a combination of closing brackets and new line characters to determine new rows. This means that if each JSON record in your source documents isn’t on a single line, these files can now be read in Athena via the Ion SerDe.\n<h3><a id=\"Create_external_tables_60\"></a>Create external tables</h3>\nAthena supports querying Ion-based datasets by defining <a href=\"https://aws.amazon.com/glue\" target=\"_blank\">AWS Glue</a> tables with the user-defined metadata. Let’s start with an example of creating an external table for a dataset stored in Ion text. The following is a sample row from the <code>citylots</code> dataset:\n<div> \n <div><pre><code>{\n type:\"Feature\",\n properties:{\n mapblklot:\"0579021\",\n blklot:\"0579024\",\n block_num:\"0579\",\n lot_num:\"024\",\n from_st:\"2160\",\n to_st:\"2160\",\n street:\"PACIFIC\",\n st_type:\"AVE\",\n odd_even:\"E\"\n },\n geometry:{\n type:\"Polygon\",coordinates:[[[-122.4308798855922, ...]]]\n }\n}</code></pre><div><div></div></div></div> \n </div>\nTo create an external table that has its data stored in Ion, you have two syntactic options.\nFirst, you can specify STORED AS ION. This is a more concise method, and is best used for simple cases, when no additional properties are required. See the following code:\n<div> \n <div><pre><code>CREATE EXTERNAL TABLE city_lots_ion1 (\n type STRING, \n properties struct<\n mapblklot:string,\n blklot:string,\n block_num:string,\n lot_num:string,\n from_st:string,\n to_st:string,\n street:string,\n st_type:string,\n odd_even:string>, \n geometry struct<\n type:string,\n coordinates:array<array<array<decimal(18,15)>>>,\n multi_coordinates:array<array<array<array<decimal(18,15)>>>>>\n)\nSTORED AS ION\nLOCATION 's3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_ion_binary/'</code></pre><div><div></div></div></div> \n </div>\nAlternatively, you can explicitly specify the Ion classpaths in ROW FORMAT SERDE, INPUTFORMAT, and OUTPUTFORMAT. Unlike the first method, you can specify a SERDEPROPERTIES clause here. In our example DDL, we added a SerDe property that allows values that are outside of the Hive data type ranges to overflow rather than fail the query:\n<div> \n <div><pre><code>CREATE EXTERNAL TABLE city_lots_ion2(\n type STRING, \n properties struct<\n mapblklot:string,\n blklot:string,\n block_num:string,\n lot_num:string,\n from_st:string,\n to_st:string,\n street:string,\n st_type:string,\n odd_even:string>, \n geometry struct<\n type:string,\n coordinates:array<array<array<decimal(18,15)>>>,\n multi_coordinates:array<array<array<array<decimal(18,15)>>>>>\n)\nROW FORMAT SERDE \n 'com.amazon.ionhiveserde.IonHiveSerDe'\nWITH SERDEPROPERTIES (\n 'ion.fail_on_overflow'='false'\n )\nSTORED AS INPUTFORMAT \n 'com.amazon.ionhiveserde.formats.IonInputFormat' \nOUTPUTFORMAT \n 'com.amazon.ionhiveserde.formats.IonOutputFormat'\nLOCATION\n 's3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_ion_binary/'</code></pre><div><div></div></div></div> \n </div>\nAthena converts STORED AS ION into the explicit classpaths, so both tables look similar in the metastore. If we look in AWS Glue, we see both tables we just created have the same input format, output format, and SerDe serialization library.\n<img src=\"https://dev-media.amazoncloud.cn/f779018c121640cdbb74bcd91c84fb32_image.png\" alt=\"image.png\" />\nNow that our table is created, we can run standard SELECT queries on the <code>city_lots_ion</code> table. Let’s run a query that specifies the <code>block_num</code> from our example row of Ion data to verify that we can read from the table:\n<div> \n <div><pre><code>-- QUERY\nSELECT * FROM city_lots_ion1 WHERE properties.block_num='0579';</code></pre><div><div></div></div></div> \n </div>\nThe following screenshot shows our results.\n<img src=\"https://dev-media.amazoncloud.cn/5dce9e8139d94b1bb750c5d30e78e32e_image.png\" alt=\"image.png\" />\n<h3><a id=\"Use_path_extraction_to_read_from_specific_fields_147\"></a>Use path extraction to read from specific fields</h3>\nAthena supports further customization of how data is interpreted via <a href=\"https://github.com/amzn/ion-hive-serde/blob/master/docs/serde-properties.md\" target=\"_blank\">SerDe properties</a>. To specify these, you can add a WITH SERDEPROPERTIES clause, which is a subfield of the ROW FORMAT SERDE field.\nIn some situations, we may only care about some parts of the information. Let’s suppose we don’t want any of the geometry info from the <code>citylots</code> dataset, and only need a few of the fields in properties. One solution is to specify a search path using the path extractor SerDe property:\n<div> \n <div><pre><code>-- Path Extractor property\nion.<column>.path_extractor = <search path></code></pre><div><div></div></div></div> \n </div>\nPath extractors are search paths that Athena uses to map the table columns to locations in the individual document. Full information on what can be done with path extractors <a href=\"https://github.com/amzn/ion-java-path-extraction\" target=\"_blank\">is available on GitHub</a>, but for our example, we focus on creating simple paths that use the names of each field as an index. In this case, the search path takes the form of a space-delimited set of indexes (and wraps it with parentheses) that indicate the location of each desired piece of information. We map the search paths to table columns by using the path extractor property.\nBy default, Athena builds path extractors dynamically based on column names unless overridden. This means that when we run our SELECT query on our <code>city_lots_ion1</code> table, Athena builds the following search paths:\n<div> \n <div><pre><code>Default Extractors generated by Athena for city_lots_ion1.\n-- Extracts the 'type' field to the 'type' column\n 'ion.type.path_extractor' = '(type)'\n– Extracts the ‘properties’ field to the ‘properties’ column \n‘ion.properties.path_extractor’ = ‘(properties)’\n– Extracts the ‘geometry’ field to the ‘geometry’ column \n‘ion.geometry.path_extractor’ = ‘(geometry)’</code></pre><div><div></div></div></div> \n</div>\nAssuming we only care about the block and lot information from the properties struct, and the geometry type from the geometry struct, we can build search paths that map the desired fields from the row of data to table columns. First let’s build the search paths:\n<div> \n <div><pre><code>(properties mapblklot) - Search path for the mapblklot field in the properties struct\n(properties blklot) - Search path for the blklot field in the properties struct\n(properties block_num) - Search path for the block_num field in the properties struct\n(properties lot_num) - Search path for the lot_num field in the properties struct\n(geometry type) - Search path for the type field in the geometry struct</code></pre><div><div></div></div></div> \n </div>\nNow let’s map these search paths to table columns using the path extractor SerDe property. Because the search paths specify where to look for data, we are able to flatten and rename our datasets to better serve our purpose. For this example, let’s rename the <code>mapblklot</code> field to <code>map_block_lot</code>, <code>blklot</code> to <code>block_lot</code>, and the geometry type to shape:\n<div> \n <div><pre><code> 'ion.map_block_lot.path_extractor' = '(properties mapblklot)'\n 'ion.block_lot.path_extractor' = '(properties blklot)'\n 'ion.block_num.path_extractor' = '(properties block_num)'\n 'ion.lot_num.path_extractor' = '(properties lot_num)'\n 'ion.shape.path_extractor' = '(geometry type)'</code></pre><div><div></div></div></div> \n </div>\nLet’s put all of this together and create the <code>city_blocks</code> table:\n<div> \n <div><pre><code>CREATE EXTERNAL TABLE city_blocks (\n map_block_lot STRING,\n block_lot STRING,\n block_num STRING,\n lot_num STRING,\n shape STRING\n)\nROW FORMAT SERDE\n 'com.amazon.ionhiveserde.IonHiveSerDe'\nWITH SERDEPROPERTIES (\n 'ion.map_block_lot.path_extractor' = '(properties mapblklot)',\n 'ion.block_lot.path_extractor' = '(properties blklot)', \n 'ion.block_num.path_extractor' = '(properties block_num)',\n 'ion.lot_num.path_extractor' = '(properties lot_num)',\n 'ion.shape.path_extractor' = '(geometry type)'\n )\nSTORED AS ION\nLOCATION 's3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_ion_binary/'</code></pre><div><div></div></div></div> \n </div>\nNow we can run a select query on the <code>city_blocks</code> table, and see the results:\n<div> \n <div><pre><code>-- Select Query\nSELECT * FROM city_blocks WHERE block_num='0579';</code></pre><div><div></div></div></div> \n </div>\n<img src=\"https://dev-media.amazoncloud.cn/7f1a949812c64291a1fb639e265f80b2_image.png\" alt=\"image.png\" />\nUtilizing search paths in this way enables skip-scan parsing when reading from Ion binary files, which allows Athena to skip over the unneeded fields and reduces the overall time it takes to run the query.\n<h3><a id=\"Use_CTAS_and_UNLOAD_for_data_transformation_213\"></a>Use CTAS and UNLOAD for data transformation</h3>\nAthena supports <a href=\"https://docs.aws.amazon.com/athena/latest/ug/ctas.html\" target=\"_blank\">CREATE TABLE AS SELECT</a> (CTAS), which creates a new table in Athena from the results of a SELECT statement from another query. Athena also supports <a href=\"https://docs.aws.amazon.com/athena/latest/ug/unload.html\" target=\"_blank\">UNLOAD</a>, which writes query results to Amazon S3 from a SELECT statement to the specified data format.\nBoth CTAS and UNLOAD have a property to specify a format and a compression type. This allows you to easily convert Ion datasets to other data formats, such as Parquet or ORC, and vice versa, without needing to set up a complex extract, transform, and load (ETL) job. This is beneficial for situations when you want to transform your data, or know you will run repeated queries on a subset of your data and want to use some of the benefits inherent to columnar formats. Combining it with path extractors is especially helpful, because we’re only storing the data that we need in the new format.\nLet’s use CTAS to convert our <code>city_blocks</code> table from Ion to Parquet, and compress it via GZIP. Because we have path extractors set up for the <code>city_blocks</code> table, we only need to convert a small portion of the original dataset:\n<div> \n <div><pre><code>CREATE TABLE city_blocks_parquet_gzip\nWITH (format = 'PARQUET', write_compression='GZIP')\nAS SELECT * FROM city_blocks;</code></pre><div><div></div></div></div> \n </div>\nWe can now run queries against the <code>city_block_parquet_gzip</code> table, and should see the same result. To test this out, let’s run the same SELECT query we ran before on the Parquet table:\n<div> \n <div><pre><code>SELECT * FROM city_blocks_parquet_gzip WHERE block_num='0579';</code></pre><div><div></div></div></div> \n </div>\n<img src=\"https://dev-media.amazoncloud.cn/cd019036c0e54d678c1e45b2acc4ba0f_image.png\" alt=\"image.png\" />\nWhen converting tables from another format to Ion, Athena supports the following compression codecs: ZSTD, BZIP2, GZIP, SNAPPY, and NONE. In addition to adding Ion as a new format for CTAS, we added the <code>ion_encoding</code> property, which allows you to choose whether the output files are created in Ion text or Ion binary. This allows for serialization of data from other formats back into Ion.\nLet’s convert the original <code>city_lots</code> JSON file back to Ion, but this time we specify that we want to use ZSTD compression and a binary encoding.\nThe JSON file can be found at following location: <code>s3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_json/</code>\nBecause Ion is a superset of JSON, we can use the Ion SerDe to read this file:\n<div> \n <div><pre><code>CREATE EXTERNAL TABLE city_blocks_json_ion_serde (\n map_block_lot STRING,\n block_lot STRING,\n block_num STRING,\n lot_num STRING,\n shape STRING\n)\nROW FORMAT SERDE\n'com.amazon.ionhiveserde.IonHiveSerDe'\nWITH SERDEPROPERTIES (\n'ion.map_block_lot.path_extractor' = '(properties mapblklot)',\n'ion.block_lot.path_extractor' = '(properties blklot)',\n'ion.block_num.path_extractor' = '(properties block_num)',\n'ion.lot_num.path_extractor' = '(properties lot_num)',\n'ion.shape.path_extractor' = '(geometry type)'\n)\nSTORED AS ION\nLOCATION 's3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_json/'</code></pre><div><div></div></div></div> \n </div>\nNow let’s copy this table into our desired Ion binary form:\n<div> \n <div><pre><code>CREATE TABLE city_blocks_ion_zstd\nWITH (format = 'ION', write_compression='ZSTD', ion_encoding='BINARY')\nAS SELECT * FROM city_blocks_parquet_gzip</code></pre><div><div></div></div></div> \n </div>\nFinally, let’s run our verification SELECT statement to verify everything was created properly:\n<pre><code class=\"lang-\">SELECT * FROM city_blocks_ion_zstd WHERE block_num='0579';\n\n</code></pre>\n<img src=\"https://dev-media.amazoncloud.cn/536c19e02a9044f8be12b4d03a718d69_image.png\" alt=\"image.png\" />\n<h3><a id=\"Use_UNLOAD_to_store_Ion_data_in_Amazon_S3_268\"></a>Use UNLOAD to store Ion data in Amazon S3</h3>\nSometimes we just want to reformat the data and don’t need to store the additional metadata to query the table. In this case, we can use UNLOAD, which stores the results of the query in the specified format in an S3 bucket.\nLet’s test it out, using UNLOAD to convert the <code>drivers_names</code> table from Ion to ORC, compress it via ZLIB, and store it to an S3 bucket:\n<div> \n <div><pre><code>UNLOAD (SELECT * FROM city_blocks_ion_zstd WHERE block_num='0579') \nTO 's3://<your-s3-bucket>/athena-ion-blog/unload/orc_zlib/'\nWITH (format = 'ORC', compression='ZLIB')</code></pre><div><div></div></div></div> \n </div>\nWhen you check in Amazon S3, you can find a new file in the ORC format.\n<img src=\"https://dev-media.amazoncloud.cn/c18faaa826794792be5465041074b593_image.png\" alt=\"image.png\" />\n<h3><a id=\"Conclusion_281\"></a>Conclusion</h3>\nThis post talked about the new feature in Athena that allows you to query and create Ion datasets using standard SQL. We discussed use cases and unique features of the Ion format like type system, dual formats (Ion text and Ion binary), efficiency gains, and skip-scanning. You can get started with querying an Ion dataset stored in Amazon S3 by simply creating a table in Athena, and also converting existing datasets to Ion format and vice versa using CTAS and UNLOAD statements.\nTo learn more about querying Ion using Athena, refer to <a href=\"https://docs.aws.amazon.com/athena/latest/ug/ion-serde.html\" target=\"_blank\">Amazon Ion Hive SerDe</a>.\n<h3><a id=\"References_285\"></a>References</h3>\n<ul> \n <li>Name of Dataset: City Lots San Francisco in .json</li> \n <li>Licenses and Notes: OG dataset under PDDL 1.0; Github .json version under MIT style license</li> \n <li><a href=\"https://github.com/zemirco/sf-city-lots-json#license\" target=\"_blank\">Link(s) to Dataset(s)</a> (note – original dataset is NLA from DataSF)</li> \n <li><a href=\"//aws-bigdata-blog/artifacts/athena-ion-blog/\" target=\"_blank\">Location of transformed version</a></li> \n </ul>\n<img src=\"https://dev-media.amazoncloud.cn/eeee1f0a46de48c6aab4c27f1361c454_image.png\" alt=\"image.png\" />\nPathik Shah is a Sr. Big Data Architect on Amazon Athena. He joined AWS in 2015 and has been focusing in the big data analytics space since then, helping customers build scalable and robust solutions using AWS analytics services.\n<img src=\"https://dev-media.amazoncloud.cn/a7d359d3ca914cbc9d5038d78fc520da_image.png\" alt=\"image.png\" />\nJacob Stein works on the Amazon Athena team as a Software Development Engineer. He led the project to add support for Ion in Athena. He loves working on technical problems unique to internet scale data, and is passionate about developing scalable solutions for distributed systems.\n<img src=\"https://dev-media.amazoncloud.cn/a10f3cbac13c4fe98acb4a11b4d5f805_image.png\" alt=\"image.png\" />\nGiovanni Matteo Fumarola is the Engineering Manager of the Athena Data Lake and Storage team. He is an Apache Hadoop Committer and PMC member. He has been focusing in the big data analytics space since 2013.\n<img src=\"https://dev-media.amazoncloud.cn/06de775a780f4e2797641f3467f3891c_image.png\" alt=\"image.png\" />\nPete Ford is a Sr. Technical Program Manager at Amazon.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家