Join the Preview – AWS Glue Data Quality

海外精选

re:Invent

Amazon Glue

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"Back in 1980, at my second professional programming job, I was working on a project that analyzed driver’s license data from a bunch of US states. At that time data of that type was generally stored in fixed-length records, with values carefully (or not) encoded into each field. Although we were given schemas for the data, we would invariably find that the developers had to resort to tricks in order to represent values that were not anticipated up front. For example, coding for someone with [heterochromia](https://en.wikipedia.org/wiki/Heterochromia_iridum), eyes of different colors. We ended up doing a full scan of the data ahead of our actual time-consuming and expensive analytics run in order to make sure that we were dealing with known data. This was my introduction to data quality, or the lack thereof.\n\nAWS makes it easier for you to\n\nAWS makes it easier for you to build [data lakes](https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/) and [data warehouses](https://aws.amazon.com/data-warehouse/) at any scale. We want to make it easier than ever before for you to measure and maintain the desired quality level of the data that you ingest, process, and share.\n\n### ++Introducing AWS Glue Data Quality++\nToday I would like to tell you about [AWS Glue Data Quality](https://aws.amazon.com/glue/features/data-quality/), a new set of features for [AWS Glue](https://aws.amazon.com/glue/) that we are launching in preview form. It can analyze your tables and recommend a set of rules automatically based on what it finds. You can fine-tune those rules if necessary and you can also write your own rules. In this blog post I will show you a few highlights, and will save the details for a full post when these features progress from preview to generally available.\n\n\nEach data quality rule references a Glue table or selected columns in a Glue table and checks for specific types of properties: timeliness, accuracy, integrity, and so forth. For example, a rule can indicate that a table must have the expected number of columns, that the column names match a desired pattern, and that a specific column is usable as a primary key.\n\n### ++Getting Started++\nI can open the new **Data quality** tab on one of my Glue tables to get started. From there I can create a ruleset manually, or I can click **Recommend ruleset** to get started:\n\n![image.png](https://dev-media.amazoncloud.cn/a43f082cd20b43cabf5e9df51691e9d1_image.png)\n\n\nThen I enter a name for my Ruleset (**RS1**), choose an IAM Role that has permission to access it, and click **Recommend ruleset**:\n\n![image.png](https://dev-media.amazoncloud.cn/374f0345db1947198a898bf85ebea2b9_image.png)\n\nMy click initiates a Glue Recommendation task (a specialized type of Glue job) that scans the data and makes recommendations. Once the task has run to completion I can examine the recommendations:\n\n![image.png](https://dev-media.amazoncloud.cn/df10ac3a98364d3bb4ff9be03d107429_image.png)\n\nI click **Evaluate ruleset** to check on the quality of my data.\n\n![image.png](https://dev-media.amazoncloud.cn/e9b287459d98432bbbb97a8281586127_image.png)\n\nThe data quality task runs and I can examine the results:\n\n![image.png](https://dev-media.amazoncloud.cn/72f47f90b5884f479fb02707ce6f2680_image.png)\n\nIn addition to creating Rulesets that are attached to tables, I can use them as part of a Glue job. I create my job as usual and then add an **Evaluate Data Quality** node:\n\n\n![image.png](https://dev-media.amazoncloud.cn/f4e893b29dfd4f6a8e7bb6fbfc35da17_image.png)\n\nThen I use the Data Quality Definition Language (DDQL) builder to create my rules. I can choose between 20 different rule types:\n\n![image.png](https://dev-media.amazoncloud.cn/ab245bf257d74b9fa07242e9257c834f_image.png)\n\nFor this blog post, I made these rules more strict than necessary so that I could show you what happens when the data quality evaluation fails.\n\nI can set the job options, and choose the original data or the data quality results as the output of the transform. I can also write the data quality results to an S3 bucket:\n\n![image.png](https://dev-media.amazoncloud.cn/546fbf936e9246bbbb9f8459aaf1c8c1_image.png)\n\nAfter I have created my Ruleset, I set any other desired options for the job, save it, and then run it. After the job completes I can find the results in the Data quality tab. Because I made some overly strict rules, the evaluation correctly flagged my data with a 0% score:\n\n![image.png](https://dev-media.amazoncloud.cn/38889e627372466ea32aceb941ac4289_image.png)\n\nThere’s a lot more, but I will save that for the next blog post!\n\n### ++Things to Know++\n**Preview Regions** – This is an open preview and you can access it today the US East (Ohio, N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland) AWS Regions.\n\n**Pricing** – Evaluating data quality consumes Glue Data Processing Units (DPU) in the same manner and at the same per-DPU pricing as any other Glue job.\n\n— [Jeff](https://twitter.com/jeffbarr);\n\n\n![image.png](https://dev-media.amazoncloud.cn/40c8482b86a2487c822e7fcc839ad7e8_image.png)\n\n\n### [Jeff Barr](https://aws.amazon.com/blogs/aws/author/jbarr/)\nJeff Barr is Chief Evangelist for AWS. He started this blog in 2004 and has been writing posts just about non-stop ever since.\n\n\n\n\n\n\n\n\n\n\n","render":"Back in 1980, at my second professional programming job, I was working on a project that analyzed driver’s license data from a bunch of US states. At that time data of that type was generally stored in fixed-length records, with values carefully (or not) encoded into each field. Although we were given schemas for the data, we would invariably find that the developers had to resort to tricks in order to represent values that were not anticipated up front. For example, coding for someone with <a href=\"https://en.wikipedia.org/wiki/Heterochromia_iridum\" target=\"_blank\">heterochromia</a>, eyes of different colors. We ended up doing a full scan of the data ahead of our actual time-consuming and expensive analytics run in order to make sure that we were dealing with known data. This was my introduction to data quality, or the lack thereof.\nAWS makes it easier for you to\nAWS makes it easier for you to build <a href=\"https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/\" target=\"_blank\">data lakes</a> and <a href=\"https://aws.amazon.com/data-warehouse/\" target=\"_blank\">data warehouses</a> at any scale. We want to make it easier than ever before for you to measure and maintain the desired quality level of the data that you ingest, process, and share.\n<h3><a id=\"Introducing_AWS_Glue_Data_Quality_6\"></a><ins>Introducing AWS Glue Data Quality</ins></h3>\nToday I would like to tell you about <a href=\"https://aws.amazon.com/glue/features/data-quality/\" target=\"_blank\">AWS Glue Data Quality</a>, a new set of features for <a href=\"https://aws.amazon.com/glue/\" target=\"_blank\">AWS Glue</a> that we are launching in preview form. It can analyze your tables and recommend a set of rules automatically based on what it finds. You can fine-tune those rules if necessary and you can also write your own rules. In this blog post I will show you a few highlights, and will save the details for a full post when these features progress from preview to generally available.\nEach data quality rule references a Glue table or selected columns in a Glue table and checks for specific types of properties: timeliness, accuracy, integrity, and so forth. For example, a rule can indicate that a table must have the expected number of columns, that the column names match a desired pattern, and that a specific column is usable as a primary key.\n<h3><a id=\"Getting_Started_12\"></a><ins>Getting Started</ins></h3>\nI can open the new Data quality tab on one of my Glue tables to get started. From there I can create a ruleset manually, or I can click Recommend ruleset to get started:\n<img src=\"https://dev-media.amazoncloud.cn/a43f082cd20b43cabf5e9df51691e9d1_image.png\" alt=\"image.png\" />\nThen I enter a name for my Ruleset (RS1), choose an IAM Role that has permission to access it, and click Recommend ruleset:\n<img src=\"https://dev-media.amazoncloud.cn/374f0345db1947198a898bf85ebea2b9_image.png\" alt=\"image.png\" />\nMy click initiates a Glue Recommendation task (a specialized type of Glue job) that scans the data and makes recommendations. Once the task has run to completion I can examine the recommendations:\n<img src=\"https://dev-media.amazoncloud.cn/df10ac3a98364d3bb4ff9be03d107429_image.png\" alt=\"image.png\" />\nI click Evaluate ruleset to check on the quality of my data.\n<img src=\"https://dev-media.amazoncloud.cn/e9b287459d98432bbbb97a8281586127_image.png\" alt=\"image.png\" />\nThe data quality task runs and I can examine the results:\n<img src=\"https://dev-media.amazoncloud.cn/72f47f90b5884f479fb02707ce6f2680_image.png\" alt=\"image.png\" />\nIn addition to creating Rulesets that are attached to tables, I can use them as part of a Glue job. I create my job as usual and then add an Evaluate Data Quality node:\n<img src=\"https://dev-media.amazoncloud.cn/f4e893b29dfd4f6a8e7bb6fbfc35da17_image.png\" alt=\"image.png\" />\nThen I use the Data Quality Definition Language (DDQL) builder to create my rules. I can choose between 20 different rule types:\n<img src=\"https://dev-media.amazoncloud.cn/ab245bf257d74b9fa07242e9257c834f_image.png\" alt=\"image.png\" />\nFor this blog post, I made these rules more strict than necessary so that I could show you what happens when the data quality evaluation fails.\nI can set the job options, and choose the original data or the data quality results as the output of the transform. I can also write the data quality results to an S3 bucket:\n<img src=\"https://dev-media.amazoncloud.cn/546fbf936e9246bbbb9f8459aaf1c8c1_image.png\" alt=\"image.png\" />\nAfter I have created my Ruleset, I set any other desired options for the job, save it, and then run it. After the job completes I can find the results in the Data quality tab. Because I made some overly strict rules, the evaluation correctly flagged my data with a 0% score:\n<img src=\"https://dev-media.amazoncloud.cn/38889e627372466ea32aceb941ac4289_image.png\" alt=\"image.png\" />\nThere’s a lot more, but I will save that for the next blog post!\n<h3><a id=\"Things_to_Know_55\"></a><ins>Things to Know</ins></h3>\nPreview Regions – This is an open preview and you can access it today the US East (Ohio, N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland) AWS Regions.\nPricing – Evaluating data quality consumes Glue Data Processing Units (DPU) in the same manner and at the same per-DPU pricing as any other Glue job.\n— <a href=\"https://twitter.com/jeffbarr\" target=\"_blank\">Jeff</a>;\n<img src=\"https://dev-media.amazoncloud.cn/40c8482b86a2487c822e7fcc839ad7e8_image.png\" alt=\"image.png\" />\n<h3><a id=\"Jeff_Barrhttpsawsamazoncomblogsawsauthorjbarr_66\"></a><a href=\"https://aws.amazon.com/blogs/aws/author/jbarr/\" target=\"_blank\">Jeff Barr</a></h3>\nJeff Barr is Chief Evangelist for AWS. He started this blog in 2004 and has been writing posts just about non-stop ever since.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家