Build a high-performance, ACID compliant, evolving data lake using Apache Iceberg on Amazon EMR

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"[Amazon EMR](https://aws.amazon.com/emr/) is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as [Apache Spark, Apache Hive](https://aws.amazon.com/emr/features/spark/), and [Presto](https://aws.amazon.com/emr/features/presto/).\n\n[Apache Iceberg](https://iceberg.apache.org/) is an open table format for huge analytic datasets. Table formats typically indicate the format and location of individual table files. Iceberg adds functionality on top of that to help manage petabyte-scale datasets as well as newer data lake requirements such as transactions, upsert/merge, time travel, and schema and partition evolution. Iceberg adds tables to compute engines including Spark, Trino, PrestoDB, Flink, and Hive using a high-performance table format that works just like a SQL table.\n\nAmazon EMR release 6.5.0 and later [includes Apache Iceberg](https://aws.amazon.com/about-aws/whats-new/2022/01/amazon-emr-supports-apache-iceberg/) so you can reliably work with huge tables with full support for ACID (Atomic, Consistent, Isolated, Durable) transactions in a highly concurrent and performant manner without getting locked into a single file format.\n\nIn this post, we discuss the modern data lake requirements and the challenges—including support for ACID transactions and concurrent writers, partition and schema evolution—that come with these. We also discuss how Iceberg solves these challenges. Additionally, we provide a step-by-step guide on how to get started with an Iceberg notebook in [Amazon EMR Studio](https://aws.amazon.com/emr/features/studio/).You can access this sample notebook from the GitHub repo. You can also find this notebook in your EMR Studio workspace under **Notebook Examples**.\n\n\n### **Modern data lake challenges**\n\n\nAmazon EMR integrates with [Amazon Simple Storage Service](http://aws.amazon.com/s3) (Amazon S3) natively for persistent data storage, and allows you to independently scale your data in Amazon S3 and compute on your EMR cluster. This enables you to bring in data from multiple sources (for example, transactional data from operational databases, social media feeds, and SaaS data sources) using different tools, and each data source has its own transient EMR cluster to perform transformation and ingestion in parallel. You can now keep one central copy of your data and share it with multiple user groups that run analytics and even make in-place updates on a data lake. We’re increasingly seeing the following requirements (and challenges) emerge as mainstream:\n\n- Consistent reads and writes across multiple concurrent users – There are two primary concerns:\n\t- \tReader-writer isolation – When a job is updating a huge dataset, another job accessing the same data frequently works on a partially updated dataset, leaving the data in an inconsistent state.\n\t- Concurrent writes on the same dataset – Table formats relying on coarse-grained locks slow down the system. This limitation is even more telling in real-time streaming workloads.\n- Consistent table updates across multiple files or partitions – With Hive tables, writing to multiple partitions at once isn’t an atomic operation. If you’re overwriting a partition, for instance, you might delete several files across partitions without having any guarantees that you will replace them, potentially resulting in data loss. For huge tables, it’s not practical to use global locks and keep the readers and writers waiting. Common workarounds (such as rewriting all the data in all the partitions that need to be changed at the same time and then pointing to the new locations) cause huge data duplication and redundant extract, transform, and load (ETL) jobs.\n- Continuous schema evolution – Simple DDL commands often render the data unusable. For instance, say a data engineer renames a column and writes some data. The consuming analytics tool now can’t read it because the metastore can’t track former names for columns. That rename operation has effectively dropped a column and added a new column. Now there is data written in both schemas. Historically, schema changes required expensive backfills and redundant ETL processes.\n- Different query patterns on the same data – If you change the partitioning to optimize your query after a year, say from daily to hourly, you have to rewrite the table with the new hour column as the partition. In addition, you have to rewrite queries to use the new partition column in your table.\n- ACID transactions, streaming upserts, file size optimization, and data snapshots – Existing tools that support these features lock you into specific file formats, complicating interoperability across the analytics ecosystem.\n- Support for mixed file formats – With existing solutions, if you rename a column in one file format (say Parquet, ORC, or Avro), you get a different behavior than if you rename a column in a different file format. There is inconsistency in data types supported by different file formats. These limitations necessitate additional ETL steps.\n\n\n\n#### **The problem**\n\n\nWhen multiple users share the same data, varied requirements ensue. The data platform needs to be transactional to handle concurrent upserts and reads.\n\nTable formats such as Hive track a list of partitions inside the table within a data catalog. However, the underlying files are still not tracked transactionally, because we’re relying on an immutable object storage that is just not designed to be transactional. After the specific partitions to be updated or inserted have been identified, we still need to list all the files in those partitions at the leaf level of the partition hierarchy before we can filter out which of those files are relevant. For huge analytic datasets with thousands of files in each partition, listing all those files each time you run a query slows it down considerably. Furthermore, doing atomic commits—getting thousands of files in the table live in exactly the same moment—becomes impractical.\n\n\n### **Apache Iceberg on Amazon EMR**\n\n\nIceberg development was started by Netflix in December 2017 and was donated to the Apache software foundation in November 2018 as an [incubator project](https://incubator.apache.org/projects/iceberg.html). In May 2020, it [graduated](https://incubator.apache.org/projects/iceberg.html) from the incubator.\n\nIceberg on Amazon EMR comes completely integrated and tested for running in production backed by Enterprise Support. This means you get 24/7 technical support from Amazon EMR experts, tools and technology to automatically manage the health of your environment, and consultative architectural, performance, and troubleshooting guidance on Iceberg issues.\n\nIceberg has integrations with other AWS services. For example, you can use the [AWS Glue](https://aws.amazon.com/glue)Data Catalog as the metastore for Iceberg tables. Iceberg also supports other catalog types such as Hive, Hadoop, Amazon DynamoDB, [Amazon Relational Database Service](http://aws.amazon.com/rds) (Amazon RDS), and other custom implementations. When using AWS Glue as the data catalog, the AWS Glue database serves as your Iceberg namespace. Similarly, the AWS Glue table and AWS Glue TableVersion serve as the Iceberg table and table version, respectively. Your AWS Glue Data Catalog could be in the same or different account or even a different Region, making multi-account, multi-Region pipelines easily deployable. [Amazon Athena](http://aws.amazon.com/athena) [supports](https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg.html) read, time travel, write, and DDL queries for Apache Iceberg tables that use the Apache Parquet format for data and the AWS Glue Data Catalog for their metastore.\n\n\n### **How Iceberg addresses these challenges**\n\n\nIceberg tracks individual data files in a table instead of simply maintaining a pointer to high-level table or partition locations. This allows writers to create data files in-place and only adds files to the table in an explicit commit. Every time new datasets are ingested into this table, a new point-in-time snapshot gets created. At query time, there is no need to list a directory to find the files we need to work with, because the snapshot already has that information pre-populated during the write time. Because of this design, Iceberg solves the problems listed earlier in the following ways:\n\n- Consistent reads and writes across multiple concurrent users – Iceberg relies on [optimistic concurrency](https://en.wikipedia.org/wiki/Optimistic_concurrency_control) to support concurrent reads and writes from multiple user groups. If two operations are running at the same time, only one of them will be successful. The other job will retry, but that retry will be implicit to the user and that will be done at the metadata level. If Iceberg detects that the second update is not in conflict, it will commit it successfully.\n- Consistent table updates across multiple partitions – In Iceberg, the partition of a file isn’t determined by the physical location of the files within directories or prefixes. Instead, Iceberg stores partition information within manifests of the data files. Therefore, updates across multiple partitions entail a simple, atomic metadata change.\n- Continuous schema evolution – Iceberg tracks columns by using unique IDs and not by the column name, which enables easy schema evolution. You can safely add, drop, rename, or even reorder columns. You can also update column data types if the update is safe (such as widening from INT to BIGINT or float to double)\n- Different query patterns on the same data – Iceberg keeps track of the relationship between partitioning values and the column that they came from. Logical data is decoupled from physical layout, which enables easy partition evolution as well. Partition values can be implicitly derived using a transform such as day(timestamp) or hour(timestamp) of an existing column.\n- ACID transactions, streaming upserts, file size optimization, and data snapshots – Iceberg supports ACID transactions with serializable isolation. Furthermore, Iceberg supports deletes, upserts, change data capture (CDC), time travel (getting the state of the data from a past time regardless of the current state of the data), and compaction (consolidating small files into larger files to reduce metadata overhead and improve query speed). Table changes are atomic, and readers never see partial or uncommitted changes.\n- Support for mixed file formats – Because schema fields are tracked by unique IDs independent of the underlying file format, you can have consistent queries across file formats such as Avro, Parquet, and ORC.\n\n\n### **Using Apache Iceberg with Amazon EMR**\n\n\nIn this post, we demonstrate creating an Amazon EMR cluster that supports Iceberg using the [AWS Command Line Interface](http://aws.amazon.com/cli) (AWS CLI). You can also create the cluster from the [Amazon EMR console](https://console.aws.amazon.com/elasticmapreduce). We use Amazon EMR Studio to run notebook code on our EMR cluster. To set up an EMR Studio, refer to [Set up an EMR Studio](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-studio-set-up.html).First, we note down the subnets that we specified when we created our EMR Studio. Now we launch our EMR cluster using the AWS CLI:\n\n```\naws emr create-cluster \\\n--name iceberg-emr-cluster \\\n--use-default-roles \\\n--release-label emr-6.6.0 \\\n--instance-count 1 \\\n--instance-type r5.4xlarge \\\n--applications Name=Hadoop Name=Livy Name=Spark Name=JupyterEnterpriseGateway \\\n--ec2-attributes SubnetId=<EMR-STUDIO-SUBNET>\\\n--configurations '[{\"Classification\":\"iceberg-\ndefaults\",\"Properties\":{\"iceberg.enabled\":\"true\"}},{\"Classification\":\"spark-hive-\nsite\",\"Properties\":{\"hive.metastore.client.factory.class\":\"com.amazonaws.glue.catalog.met\nastore.AWSGlueDataCatalogHiveClientFactory\"}}]'\n\n```\n\nWe choose ```emr-6.6.0``` as the release label. This release comes with Iceberg version 0.13.1 pre-installed. We launch a single-node EMR cluster with the instance type R5.4xlarge and with the following applications installed: Hadoop, Spark, Livy, and Jupyter Enterprise Gateway. Make sure that you replace ```<EMR-STUDIO-SUBNET>```\n with a subnet ID from the list of EMR Studio’s subnets you noted earlier. We need to enable Iceberg and the AWS Glue Data Catalog on our cluster. To do this, we use the following configuration classifications:\n\n```\n[\n {\n \"Classification\": \"iceberg-defaults \",\n \"Properties\": {\n \"iceberg.enabled\":\"true\"\n }\n },\n {\n \"Classification\": \"spark-hive-site \",\n \"Properties\": {\n \"hive.metastore.client.factory.class\": \n \"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory\"\n }\n }\n]\n```\n\n\n### **Initial setup**\n\n\n\nLet’s first create an S3 bucket location in the same Region as the EMR cluster to save a sample dataset that we’re going to create and work with. In this post, we use the placeholder bucket name ```YOUR-BUCKET-NAME```. Remember to replace this with a globally unique bucket name when testing this out in your environment. From our EMR Studio workspace, we attach our cluster and use the PySpark kernel.\n\nYou can upload the sample notebook from the [GitHub repo](https://github.com/aws-samples/emr-studio-notebook-examples/blob/iceberg/examples/iceberg-example-notebook.ipynb) or use the Iceberg example under Notebook Examples in your own EMR Studio workspace and run the cells following the instructions in the notebook.\n\n\n### **Configure a Spark session**\n\n\nIn this command, we set our AWS Glue Data Catalog name as```glue_catalog1```. You can replace it with a different name. But if you do so, remember to change the Data Catalog name throughout this example, because we use the fully qualified table name including the Data Catalog name in all of our commands going forward. In the following command, remember to replace YOUR-BUCKET-NAME with your own bucket name:\n\n```\n%%configure -f\n{\n \"conf\": {\n \"spark.sql.catalog.glue_catalog1\": \"org.apache.iceberg.spark.SparkCatalog\",\n \"spark.sql.catalog.glue_catalog1.warehouse\": \n \"s3://YOUR-BUCKET-NAME/iceberg/glue_catalog1/tables/\",\n \"spark.sql.catalog.glue_catalog1.catalog-impl\": \"org.apache.iceberg.aws.glue.GlueCatalog\",\n \"spark.sql.catalog.glue_catalog1.io-impl\": \"org.apache.iceberg.aws.s3.S3FileIO\",\n \"spark.sql.catalog.glue_catalog1.lock-impl\": \"org.apache.iceberg.aws.glue.DynamoLockManager\",\n \"spark.sql.catalog.glue_catalog1.lock.table\": \"myGlueLockTable\",\n \"spark.sql.extensions\": \"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions\"\n } \n}\n```\n\nLet’s assume that the name of your catalog is```glue_catalog1```. The preceding code has the following components:\n\n- ```glue_catalog1.warehouse``` points to the Amazon S3 path where you want to store your data and metadata.\n- To make the catalog an AWS Glue Data Catalog, set```glue_catalog1.catalog-impl``` to```org.apache.iceberg.aws.glue.GlueCatalog```. This key is required to point to an implementation class for any custom catalog implementation.\n- Use ```org.apache.iceberg.aws.s3.S3FileIO```\n as the ```glue_catalog1.io-impl```\n in order to take advantage of Amazon S3 multipart upload for high parallelism.\n- We use an [Amazon DynamoDB](https://aws.amazon.com/dynamodb/) table for lock implementation. This is optional, and is recommended for high concurrency workloads. To do that, we set ```lock-impl``` for our catalog to ```org.apache.iceberg.aws.glue.DynamoLockManager``` and we set ```lock.table``` to ```myGlueLockTable``` as the table name so that for every commit, the Data Catalog first obtains a lock using this table and then tries to safely modify the AWS Glue table. If you choose this option, the table gets created in your own account. Note that you need to have the necessary [access permissions](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/access-control-overview.html) to create and use a DynamoDB table. Furthermore, additional DynamoDB charges apply.\n\nNow that you’re all set with your EMR cluster for compute, S3 bucket for data, and AWS Glue Data Catalog for metadata, you can start creating a table and running the DML statements.\n\nFor all commands going forward, we use the ```%%sql``` cell magic to run Spark SQL commands in our EMR Studio notebook. However, for brevity, we don’t show the cell magic command. But you may need to use that in your Studio notebook for the SQL commands to work.\n\n\n### **Create an Iceberg table in the AWS Glue Data Catalog**\n\n\nThe default catalog is the ```AwsDataCatalog```\n. Let’s switch to our AWS Glue catalog```glue_catalog1```, which has support for Iceberg tables. There are no namespaces as yet. A namespace in Iceberg is the same thing as a database in AWS Glue.\n\n```\n%%sql\nuse glue_catalog1\n```\n\nLet’s create a table called ```orders```. The DDL syntax looks the same as creating a Hive table, for example, except that we include ```USING iceberg```:\n\n```\nCREATE TABLE glue_catalog1.salesdb.orders\n (\n order_id int,\n product_name string,\n product_category string,\n qty int,\n unit_price decimal(7,2),\n order_datetime timestamp\n )\nUSING iceberg\nPARTITIONED BY (days(order_datetime))\n```\n\nNote that we’re also partitioning this table by extracting the day out of the ```order_datetime ```column. We don’t have to create a separate column for the partition.\n\n\n### **DML statements**\n\n\nWe then insert records to our table. Here is an example:\n\n```\nINSERT INTO glue_catalog1.salesdb.orders VALUES \n (\n 1, \n 'Harry Potter and the Prisoner of Azkaban',\n 'Books',\n 2,\n 7.99,\n current_timestamp()\n )\n```\n\nDML statements result in snapshots getting created. Note the ```snapshot_id``` and the timestamp column called```committed_at```:\n\n```\nSELECT * FROM glue_catalog1.salesdb.orders.snapshots;\n```\n\n![image.png](https://dev-media.amazoncloud.cn/df1e5c06515b46b4b20db37ba5c54d15_image.png)\n\n\nWe now insert four more records and then query the```orders``` table and confirm that the five records are present:\n\n```\nSELECT * FROM glue_catalog1.salesdb.orders\n```\n\n![image.png](https://dev-media.amazoncloud.cn/990aaffc383b4c6399ee7f1552fc8b0f_image.png)\n\n\n### **Querying from Athena**\n\n\nBecause Iceberg on Amazon EMR comes pre-integrated with the AWS Glue Data Catalog, we can now query the Iceberg tables from AWS analytics services that support Iceberg. Let’s query the ```salesdb/orders``` table from Athena as shown in the following screenshot.\n\n\n![image.png](https://dev-media.amazoncloud.cn/ef5837ad2504408ea343b1cd89028ee2_image.png)\n\n\n### **Upserts**\n\n\nThe notebook then gives examples for updates and deletes, and even upserts. We use the MERGE INTO statement for upserts, which uses the source table ```orders_update``` with new and updated records:\n\n```\nMERGE INTO glue_catalog1.salesdb.orders target \nUSING glue_catalog1.salesdb.orders_update source \nON target.order_id = source.order_id \nWHEN MATCHED THEN \n UPDATE SET\n order_id = source.order_id,\n product_name = source.product_name,\n product_category = source.product_category,\n qty = source.qty,\n unit_price = source.unit_price,\n order_datetime = source.order_datetime\nWHEN NOT MATCHED THEN\n INSERT *\nselect * from glue_catalog1.salesdb.orders;\n```\n\n\n\n### **Schema evolution**\n\n\nWe then walk through schema evolution using simple ALTER TABLE commands to add, rename, and drop columns. The following example how simple it is to rename a column:\n\n```\nALTER TABLE glue_catalog1.salesdb.orders RENAME COLUMN qty TO quantity\nDESC table glue_catalog1.salesdb.orders\n```\n\n\n### **Time travel**\n\n\nIceberg also allows us to travel backward or forward by storing point-in-time snapshots. We can travel using timestamps when the snapshots were created or directly using the ```snapshot_id```. The following is an example of a CALL statement that uses ```\nrollback_to_snapshot```:\n\n```\n%%sql\nCALL glue_catalog1.system.rollback_to_snapshot('salesdb.orders', 8008410363488501197)\n```\n\nWe then travel forward in time by calling```set_current_snapshot```:\n\n```\n%%sql\nCALL glue_catalog1.system.set_current_snapshot('salesdb.orders', 8392090950225782953)\n```\n\n\n\n### **Partition evolution**\n\n\nThe notebook ends with an example that shows how partition evolution works in Iceberg. Iceberg stores partition information as part of the metadata. Because there is no separate partition column in the data itself, changing the partitioning scheme to hourly partitions for example is just a matter of calling a different partition transform ```hours(…)``` on an existing column ```order_datetime``` as shown in the following example:\n\n```\n%%sql\nALTER TABLE glue_catalog1.salesdb.orders ADD PARTITION FIELD hours(order_datetime)\n```\n\nYou can continue to use the old partition on the old data. New data is written using the new spec in a new layout. Metadata for each of the partition versions is kept separately.\n\nThe notebook shows how you can query the table using the new hourly partition:\n\n```\n%%sql\nSELECT * FROM glue_catalog1.salesdb.orders where hour(order_datetime)=1\n```\n\nYou can continue to query your old data using the```day()```\n transform. There is only the original ```order_datetime ```\ncolumn in the table.\n\n```\n%%sql\nSELECT * FROM glue_catalog1.salesdb.orders where day(order_datetime)>=14\n```\n\nYou don’t have to store additional columns to accommodate multiple partitioning schemes. The partition definitions are in the metadata, providing the flexibility to evolve and change the partition definitions in the future.\n\n\n### **Conclusion**\n\n\nIn this post, we introduced Apache Iceberg and explained how Iceberg solves some challenges in modern data lakes. We then walked you through how to run Iceberg on Amazon EMR using the AWS Glue Data Catalog as the metastore, and query the data using Athena. You can also run upserts on this data from Athena. There is no additional cost to using Iceberg with Amazon EMR.\n\nFor more information about Iceberg, refer to[ How Iceberg works](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-iceberg-how-it-works.html). Iceberg on Amazon EMR, with its integration with AWS Analytics services, can simplify the way you process, upsert, and delete data, with full support for ACID transactions in Amazon S3. You can also implement schema evolution, partition evolution, time travel, and compaction of data.\n\n### **About the Author**\n\n\n![image.png](https://dev-media.amazoncloud.cn/93d81d0012c8475dba9ba552836261ad_image.png)\n\n**Sekar Srinivasan** is a Sr. Specialist Solutions Architect at AWS focused on Big Data and Analytics. Sekar has over 20 years of experience working with data. He is passionate about helping customers build scalable solutions modernizing their architecture and generating insights from their data. In his spare time he likes to work on non-profit projects, especially those focused on underprivileged Children’s education.","render":"<p><a href=\"https://aws.amazon.com/emr/\" target=\"_blank\">Amazon EMR</a> is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as <a href=\"https://aws.amazon.com/emr/features/spark/\" target=\"_blank\">Apache Spark, Apache Hive</a>, and <a href=\"https://aws.amazon.com/emr/features/presto/\" target=\"_blank\">Presto</a>.</p>\n<p><a href=\"https://iceberg.apache.org/\" target=\"_blank\">Apache Iceberg</a> is an open table format for huge analytic datasets. Table formats typically indicate the format and location of individual table files. Iceberg adds functionality on top of that to help manage petabyte-scale datasets as well as newer data lake requirements such as transactions, upsert/merge, time travel, and schema and partition evolution. Iceberg adds tables to compute engines including Spark, Trino, PrestoDB, Flink, and Hive using a high-performance table format that works just like a SQL table.</p>\n<p>Amazon EMR release 6.5.0 and later <a href=\"https://aws.amazon.com/about-aws/whats-new/2022/01/amazon-emr-supports-apache-iceberg/\" target=\"_blank\">includes Apache Iceberg</a> so you can reliably work with huge tables with full support for ACID (Atomic, Consistent, Isolated, Durable) transactions in a highly concurrent and performant manner without getting locked into a single file format.</p>\n<p>In this post, we discuss the modern data lake requirements and the challenges—including support for ACID transactions and concurrent writers, partition and schema evolution—that come with these. We also discuss how Iceberg solves these challenges. Additionally, we provide a step-by-step guide on how to get started with an Iceberg notebook in <a href=\"https://aws.amazon.com/emr/features/studio/\" target=\"_blank\">Amazon EMR Studio</a>.You can access this sample notebook from the GitHub repo. You can also find this notebook in your EMR Studio workspace under <strong>Notebook Examples</strong>.</p>\n<h3><a id=\"Modern_data_lake_challenges_9\"></a><strong>Modern data lake challenges</strong></h3>\n<p>Amazon EMR integrates with <a href=\"http://aws.amazon.com/s3\" target=\"_blank\">Amazon Simple Storage Service</a> (Amazon S3) natively for persistent data storage, and allows you to independently scale your data in Amazon S3 and compute on your EMR cluster. This enables you to bring in data from multiple sources (for example, transactional data from operational databases, social media feeds, and SaaS data sources) using different tools, and each data source has its own transient EMR cluster to perform transformation and ingestion in parallel. You can now keep one central copy of your data and share it with multiple user groups that run analytics and even make in-place updates on a data lake. We’re increasingly seeing the following requirements (and challenges) emerge as mainstream:</p>\n<ul>\n<li>Consistent reads and writes across multiple concurrent users – There are two primary concerns:\n<ul>\n<li>Reader-writer isolation – When a job is updating a huge dataset, another job accessing the same data frequently works on a partially updated dataset, leaving the data in an inconsistent state.</li>\n<li>Concurrent writes on the same dataset – Table formats relying on coarse-grained locks slow down the system. This limitation is even more telling in real-time streaming workloads.</li>\n</ul>\n</li>\n<li>Consistent table updates across multiple files or partitions – With Hive tables, writing to multiple partitions at once isn’t an atomic operation. If you’re overwriting a partition, for instance, you might delete several files across partitions without having any guarantees that you will replace them, potentially resulting in data loss. For huge tables, it’s not practical to use global locks and keep the readers and writers waiting. Common workarounds (such as rewriting all the data in all the partitions that need to be changed at the same time and then pointing to the new locations) cause huge data duplication and redundant extract, transform, and load (ETL) jobs.</li>\n<li>Continuous schema evolution – Simple DDL commands often render the data unusable. For instance, say a data engineer renames a column and writes some data. The consuming analytics tool now can’t read it because the metastore can’t track former names for columns. That rename operation has effectively dropped a column and added a new column. Now there is data written in both schemas. Historically, schema changes required expensive backfills and redundant ETL processes.</li>\n<li>Different query patterns on the same data – If you change the partitioning to optimize your query after a year, say from daily to hourly, you have to rewrite the table with the new hour column as the partition. In addition, you have to rewrite queries to use the new partition column in your table.</li>\n<li>ACID transactions, streaming upserts, file size optimization, and data snapshots – Existing tools that support these features lock you into specific file formats, complicating interoperability across the analytics ecosystem.</li>\n<li>Support for mixed file formats – With existing solutions, if you rename a column in one file format (say Parquet, ORC, or Avro), you get a different behavior than if you rename a column in a different file format. There is inconsistency in data types supported by different file formats. These limitations necessitate additional ETL steps.</li>\n</ul>\n<h4><a id=\"The_problem_25\"></a><strong>The problem</strong></h4>\n<p>When multiple users share the same data, varied requirements ensue. The data platform needs to be transactional to handle concurrent upserts and reads.</p>\n<p>Table formats such as Hive track a list of partitions inside the table within a data catalog. However, the underlying files are still not tracked transactionally, because we’re relying on an immutable object storage that is just not designed to be transactional. After the specific partitions to be updated or inserted have been identified, we still need to list all the files in those partitions at the leaf level of the partition hierarchy before we can filter out which of those files are relevant. For huge analytic datasets with thousands of files in each partition, listing all those files each time you run a query slows it down considerably. Furthermore, doing atomic commits—getting thousands of files in the table live in exactly the same moment—becomes impractical.</p>\n<h3><a id=\"Apache_Iceberg_on_Amazon_EMR_33\"></a><strong>Apache Iceberg on Amazon EMR</strong></h3>\n<p>Iceberg development was started by Netflix in December 2017 and was donated to the Apache software foundation in November 2018 as an <a href=\"https://incubator.apache.org/projects/iceberg.html\" target=\"_blank\">incubator project</a>. In May 2020, it <a href=\"https://incubator.apache.org/projects/iceberg.html\" target=\"_blank\">graduated</a> from the incubator.</p>\n<p>Iceberg on Amazon EMR comes completely integrated and tested for running in production backed by Enterprise Support. This means you get 24/7 technical support from Amazon EMR experts, tools and technology to automatically manage the health of your environment, and consultative architectural, performance, and troubleshooting guidance on Iceberg issues.</p>\n<p>Iceberg has integrations with other AWS services. For example, you can use the <a href=\"https://aws.amazon.com/glue\" target=\"_blank\">AWS Glue</a>Data Catalog as the metastore for Iceberg tables. Iceberg also supports other catalog types such as Hive, Hadoop, Amazon DynamoDB, <a href=\"http://aws.amazon.com/rds\" target=\"_blank\">Amazon Relational Database Service</a> (Amazon RDS), and other custom implementations. When using AWS Glue as the data catalog, the AWS Glue database serves as your Iceberg namespace. Similarly, the AWS Glue table and AWS Glue TableVersion serve as the Iceberg table and table version, respectively. Your AWS Glue Data Catalog could be in the same or different account or even a different Region, making multi-account, multi-Region pipelines easily deployable. <a href=\"http://aws.amazon.com/athena\" target=\"_blank\">Amazon Athena</a> <a href=\"https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg.html\" target=\"_blank\">supports</a> read, time travel, write, and DDL queries for Apache Iceberg tables that use the Apache Parquet format for data and the AWS Glue Data Catalog for their metastore.</p>\n<h3><a id=\"How_Iceberg_addresses_these_challenges_43\"></a><strong>How Iceberg addresses these challenges</strong></h3>\n<p>Iceberg tracks individual data files in a table instead of simply maintaining a pointer to high-level table or partition locations. This allows writers to create data files in-place and only adds files to the table in an explicit commit. Every time new datasets are ingested into this table, a new point-in-time snapshot gets created. At query time, there is no need to list a directory to find the files we need to work with, because the snapshot already has that information pre-populated during the write time. Because of this design, Iceberg solves the problems listed earlier in the following ways:</p>\n<ul>\n<li>Consistent reads and writes across multiple concurrent users – Iceberg relies on <a href=\"https://en.wikipedia.org/wiki/Optimistic_concurrency_control\" target=\"_blank\">optimistic concurrency</a> to support concurrent reads and writes from multiple user groups. If two operations are running at the same time, only one of them will be successful. The other job will retry, but that retry will be implicit to the user and that will be done at the metadata level. If Iceberg detects that the second update is not in conflict, it will commit it successfully.</li>\n<li>Consistent table updates across multiple partitions – In Iceberg, the partition of a file isn’t determined by the physical location of the files within directories or prefixes. Instead, Iceberg stores partition information within manifests of the data files. Therefore, updates across multiple partitions entail a simple, atomic metadata change.</li>\n<li>Continuous schema evolution – Iceberg tracks columns by using unique IDs and not by the column name, which enables easy schema evolution. You can safely add, drop, rename, or even reorder columns. You can also update column data types if the update is safe (such as widening from INT to BIGINT or float to double)</li>\n<li>Different query patterns on the same data – Iceberg keeps track of the relationship between partitioning values and the column that they came from. Logical data is decoupled from physical layout, which enables easy partition evolution as well. Partition values can be implicitly derived using a transform such as day(timestamp) or hour(timestamp) of an existing column.</li>\n<li>ACID transactions, streaming upserts, file size optimization, and data snapshots – Iceberg supports ACID transactions with serializable isolation. Furthermore, Iceberg supports deletes, upserts, change data capture (CDC), time travel (getting the state of the data from a past time regardless of the current state of the data), and compaction (consolidating small files into larger files to reduce metadata overhead and improve query speed). Table changes are atomic, and readers never see partial or uncommitted changes.</li>\n<li>Support for mixed file formats – Because schema fields are tracked by unique IDs independent of the underlying file format, you can have consistent queries across file formats such as Avro, Parquet, and ORC.</li>\n</ul>\n<h3><a id=\"Using_Apache_Iceberg_with_Amazon_EMR_56\"></a><strong>Using Apache Iceberg with Amazon EMR</strong></h3>\n<p>In this post, we demonstrate creating an Amazon EMR cluster that supports Iceberg using the <a href=\"http://aws.amazon.com/cli\" target=\"_blank\">AWS Command Line Interface</a> (AWS CLI). You can also create the cluster from the <a href=\"https://console.aws.amazon.com/elasticmapreduce\" target=\"_blank\">Amazon EMR console</a>. We use Amazon EMR Studio to run notebook code on our EMR cluster. To set up an EMR Studio, refer to <a href=\"https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-studio-set-up.html\" target=\"_blank\">Set up an EMR Studio</a>.First, we note down the subnets that we specified when we created our EMR Studio. Now we launch our EMR cluster using the AWS CLI:</p>\n<pre><code class=\"lang-\">aws emr create-cluster \\\n--name iceberg-emr-cluster \\\n--use-default-roles \\\n--release-label emr-6.6.0 \\\n--instance-count 1 \\\n--instance-type r5.4xlarge \\\n--applications Name=Hadoop Name=Livy Name=Spark Name=JupyterEnterpriseGateway \\\n--ec2-attributes SubnetId=<EMR-STUDIO-SUBNET>\\\n--configurations '[{"Classification":"iceberg-\ndefaults","Properties":{"iceberg.enabled":"true"}},{"Classification":"spark-hive-\nsite","Properties":{"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.met\nastore.AWSGlueDataCatalogHiveClientFactory"}}]'\n\n</code></pre>\n<p>We choose <code>emr-6.6.0</code> as the release label. This release comes with Iceberg version 0.13.1 pre-installed. We launch a single-node EMR cluster with the instance type R5.4xlarge and with the following applications installed: Hadoop, Spark, Livy, and Jupyter Enterprise Gateway. Make sure that you replace <code><EMR-STUDIO-SUBNET></code><br />\nwith a subnet ID from the list of EMR Studio’s subnets you noted earlier. We need to enable Iceberg and the AWS Glue Data Catalog on our cluster. To do this, we use the following configuration classifications:</p>\n<pre><code class=\"lang-\">[\n {\n "Classification": "iceberg-defaults ",\n "Properties": {\n "iceberg.enabled":"true"\n }\n },\n {\n "Classification": "spark-hive-site ",\n "Properties": {\n "hive.metastore.client.factory.class": \n "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"\n }\n }\n]\n</code></pre>\n<h3><a id=\"Initial_setup_99\"></a><strong>Initial setup</strong></h3>\n<p>Let’s first create an S3 bucket location in the same Region as the EMR cluster to save a sample dataset that we’re going to create and work with. In this post, we use the placeholder bucket name <code>YOUR-BUCKET-NAME</code>. Remember to replace this with a globally unique bucket name when testing this out in your environment. From our EMR Studio workspace, we attach our cluster and use the PySpark kernel.</p>\n<p>You can upload the sample notebook from the <a href=\"https://github.com/aws-samples/emr-studio-notebook-examples/blob/iceberg/examples/iceberg-example-notebook.ipynb\" target=\"_blank\">GitHub repo</a> or use the Iceberg example under Notebook Examples in your own EMR Studio workspace and run the cells following the instructions in the notebook.</p>\n<h3><a id=\"Configure_a_Spark_session_108\"></a><strong>Configure a Spark session</strong></h3>\n<p>In this command, we set our AWS Glue Data Catalog name as<code>glue_catalog1</code>. You can replace it with a different name. But if you do so, remember to change the Data Catalog name throughout this example, because we use the fully qualified table name including the Data Catalog name in all of our commands going forward. In the following command, remember to replace YOUR-BUCKET-NAME with your own bucket name:</p>\n<pre><code class=\"lang-\">%%configure -f\n{\n "conf": {\n "spark.sql.catalog.glue_catalog1": "org.apache.iceberg.spark.SparkCatalog",\n "spark.sql.catalog.glue_catalog1.warehouse": \n "s3://YOUR-BUCKET-NAME/iceberg/glue_catalog1/tables/",\n "spark.sql.catalog.glue_catalog1.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",\n "spark.sql.catalog.glue_catalog1.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",\n "spark.sql.catalog.glue_catalog1.lock-impl": "org.apache.iceberg.aws.glue.DynamoLockManager",\n "spark.sql.catalog.glue_catalog1.lock.table": "myGlueLockTable",\n "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"\n } \n}\n</code></pre>\n<p>Let’s assume that the name of your catalog is<code>glue_catalog1</code>. The preceding code has the following components:</p>\n<ul>\n<li><code>glue_catalog1.warehouse</code> points to the Amazon S3 path where you want to store your data and metadata.</li>\n<li>To make the catalog an AWS Glue Data Catalog, set<code>glue_catalog1.catalog-impl</code> to<code>org.apache.iceberg.aws.glue.GlueCatalog</code>. This key is required to point to an implementation class for any custom catalog implementation.</li>\n<li>Use <code>org.apache.iceberg.aws.s3.S3FileIO</code><br />\nas the <code>glue_catalog1.io-impl</code><br />\nin order to take advantage of Amazon S3 multipart upload for high parallelism.</li>\n<li>We use an <a href=\"https://aws.amazon.com/dynamodb/\" target=\"_blank\">Amazon DynamoDB</a> table for lock implementation. This is optional, and is recommended for high concurrency workloads. To do that, we set <code>lock-impl</code> for our catalog to <code>org.apache.iceberg.aws.glue.DynamoLockManager</code> and we set <code>lock.table</code> to <code>myGlueLockTable</code> as the table name so that for every commit, the Data Catalog first obtains a lock using this table and then tries to safely modify the AWS Glue table. If you choose this option, the table gets created in your own account. Note that you need to have the necessary <a href=\"https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/access-control-overview.html\" target=\"_blank\">access permissions</a> to create and use a DynamoDB table. Furthermore, additional DynamoDB charges apply.</li>\n</ul>\n<p>Now that you’re all set with your EMR cluster for compute, S3 bucket for data, and AWS Glue Data Catalog for metadata, you can start creating a table and running the DML statements.</p>\n<p>For all commands going forward, we use the <code>%%sql</code> cell magic to run Spark SQL commands in our EMR Studio notebook. However, for brevity, we don’t show the cell magic command. But you may need to use that in your Studio notebook for the SQL commands to work.</p>\n<h3><a id=\"Create_an_Iceberg_table_in_the_AWS_Glue_Data_Catalog_143\"></a><strong>Create an Iceberg table in the AWS Glue Data Catalog</strong></h3>\n<p>The default catalog is the <code>AwsDataCatalog</code><br />\n. Let’s switch to our AWS Glue catalog<code>glue_catalog1</code>, which has support for Iceberg tables. There are no namespaces as yet. A namespace in Iceberg is the same thing as a database in AWS Glue.</p>\n<pre><code class=\"lang-\">%%sql\nuse glue_catalog1\n</code></pre>\n<p>Let’s create a table called <code>orders</code>. The DDL syntax looks the same as creating a Hive table, for example, except that we include <code>USING iceberg</code>:</p>\n<pre><code class=\"lang-\">CREATE TABLE glue_catalog1.salesdb.orders\n (\n order_id int,\n product_name string,\n product_category string,\n qty int,\n unit_price decimal(7,2),\n order_datetime timestamp\n )\nUSING iceberg\nPARTITIONED BY (days(order_datetime))\n</code></pre>\n<p>Note that we’re also partitioning this table by extracting the day out of the <code>order_datetime </code>column. We don’t have to create a separate column for the partition.</p>\n<h3><a id=\"DML_statements_173\"></a><strong>DML statements</strong></h3>\n<p>We then insert records to our table. Here is an example:</p>\n<pre><code class=\"lang-\">INSERT INTO glue_catalog1.salesdb.orders VALUES \n (\n 1, \n 'Harry Potter and the Prisoner of Azkaban',\n 'Books',\n 2,\n 7.99,\n current_timestamp()\n )\n</code></pre>\n<p>DML statements result in snapshots getting created. Note the <code>snapshot_id</code> and the timestamp column called<code>committed_at</code>:</p>\n<pre><code class=\"lang-\">SELECT * FROM glue_catalog1.salesdb.orders.snapshots;\n</code></pre>\n<p><img src=\"https://dev-media.amazoncloud.cn/df1e5c06515b46b4b20db37ba5c54d15_image.png\" alt=\"image.png\" /></p>\n<p>We now insert four more records and then query the<code>orders</code> table and confirm that the five records are present:</p>\n<pre><code class=\"lang-\">SELECT * FROM glue_catalog1.salesdb.orders\n</code></pre>\n<p><img src=\"https://dev-media.amazoncloud.cn/990aaffc383b4c6399ee7f1552fc8b0f_image.png\" alt=\"image.png\" /></p>\n<h3><a id=\"Querying_from_Athena_208\"></a><strong>Querying from Athena</strong></h3>\n<p>Because Iceberg on Amazon EMR comes pre-integrated with the AWS Glue Data Catalog, we can now query the Iceberg tables from AWS analytics services that support Iceberg. Let’s query the <code>salesdb/orders</code> table from Athena as shown in the following screenshot.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/ef5837ad2504408ea343b1cd89028ee2_image.png\" alt=\"image.png\" /></p>\n<h3><a id=\"Upserts_217\"></a><strong>Upserts</strong></h3>\n<p>The notebook then gives examples for updates and deletes, and even upserts. We use the MERGE INTO statement for upserts, which uses the source table <code>orders_update</code> with new and updated records:</p>\n<pre><code class=\"lang-\">MERGE INTO glue_catalog1.salesdb.orders target \nUSING glue_catalog1.salesdb.orders_update source \nON target.order_id = source.order_id \nWHEN MATCHED THEN \n UPDATE SET\n order_id = source.order_id,\n product_name = source.product_name,\n product_category = source.product_category,\n qty = source.qty,\n unit_price = source.unit_price,\n order_datetime = source.order_datetime\nWHEN NOT MATCHED THEN\n INSERT *\nselect * from glue_catalog1.salesdb.orders;\n</code></pre>\n<h3><a id=\"Schema_evolution_241\"></a><strong>Schema evolution</strong></h3>\n<p>We then walk through schema evolution using simple ALTER TABLE commands to add, rename, and drop columns. The following example how simple it is to rename a column:</p>\n<pre><code class=\"lang-\">ALTER TABLE glue_catalog1.salesdb.orders RENAME COLUMN qty TO quantity\nDESC table glue_catalog1.salesdb.orders\n</code></pre>\n<h3><a id=\"Time_travel_252\"></a><strong>Time travel</strong></h3>\n<p>Iceberg also allows us to travel backward or forward by storing point-in-time snapshots. We can travel using timestamps when the snapshots were created or directly using the <code>snapshot_id</code>. The following is an example of a CALL statement that uses <code> rollback_to_snapshot</code>:</p>\n<pre><code class=\"lang-\">%%sql\nCALL glue_catalog1.system.rollback_to_snapshot('salesdb.orders', 8008410363488501197)\n</code></pre>\n<p>We then travel forward in time by calling<code>set_current_snapshot</code>:</p>\n<pre><code class=\"lang-\">%%sql\nCALL glue_catalog1.system.set_current_snapshot('salesdb.orders', 8392090950225782953)\n</code></pre>\n<h3><a id=\"Partition_evolution_272\"></a><strong>Partition evolution</strong></h3>\n<p>The notebook ends with an example that shows how partition evolution works in Iceberg. Iceberg stores partition information as part of the metadata. Because there is no separate partition column in the data itself, changing the partitioning scheme to hourly partitions for example is just a matter of calling a different partition transform <code>hours(…)</code> on an existing column <code>order_datetime</code> as shown in the following example:</p>\n<pre><code class=\"lang-\">%%sql\nALTER TABLE glue_catalog1.salesdb.orders ADD PARTITION FIELD hours(order_datetime)\n</code></pre>\n<p>You can continue to use the old partition on the old data. New data is written using the new spec in a new layout. Metadata for each of the partition versions is kept separately.</p>\n<p>The notebook shows how you can query the table using the new hourly partition:</p>\n<pre><code class=\"lang-\">%%sql\nSELECT * FROM glue_catalog1.salesdb.orders where hour(order_datetime)=1\n</code></pre>\n<p>You can continue to query your old data using the<code>day()</code><br />\ntransform. There is only the original <code>order_datetime </code><br />\ncolumn in the table.</p>\n<pre><code class=\"lang-\">%%sql\nSELECT * FROM glue_catalog1.salesdb.orders where day(order_datetime)>=14\n</code></pre>\n<p>You don’t have to store additional columns to accommodate multiple partitioning schemes. The partition definitions are in the metadata, providing the flexibility to evolve and change the partition definitions in the future.</p>\n<h3><a id=\"Conclusion_303\"></a><strong>Conclusion</strong></h3>\n<p>In this post, we introduced Apache Iceberg and explained how Iceberg solves some challenges in modern data lakes. We then walked you through how to run Iceberg on Amazon EMR using the AWS Glue Data Catalog as the metastore, and query the data using Athena. You can also run upserts on this data from Athena. There is no additional cost to using Iceberg with Amazon EMR.</p>\n<p>For more information about Iceberg, refer to<a href=\"https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-iceberg-how-it-works.html\" target=\"_blank\"> How Iceberg works</a>. Iceberg on Amazon EMR, with its integration with AWS Analytics services, can simplify the way you process, upsert, and delete data, with full support for ACID transactions in Amazon S3. You can also implement schema evolution, partition evolution, time travel, and compaction of data.</p>\n<h3><a id=\"About_the_Author_310\"></a><strong>About the Author</strong></h3>\n<p><img src=\"https://dev-media.amazoncloud.cn/93d81d0012c8475dba9ba552836261ad_image.png\" alt=\"image.png\" /></p>\n<p><strong>Sekar Srinivasan</strong> is a Sr. Specialist Solutions Architect at AWS focused on Big Data and Analytics. Sekar has over 20 years of experience working with data. He is passionate about helping customers build scalable solutions modernizing their architecture and generating insights from their data. In his spare time he likes to work on non-profit projects, especially those focused on underprivileged Children’s education.</p>\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家