Manage data transformations with dbt in Amazon Redshift

海外精选
海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时,内容中提到的“AWS” 是 “Amazon Web Services” 的缩写,在此网站不作为商标展示。
0
0
{"value":"[Amazon Redshift](https://aws.amazon.com/cn/redshift/) is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. Amazon Redshift enables you to use your data to acquire new insights for your business and customers while keeping costs low.\n\nTogether with [price-performance](https://aws.amazon.com/cn/blogs/big-data/amazon-redshift-continues-its-price-performance-leadership/), customers want to manage data transformations (SQL Select statements written by data engineers, data analysts, and data scientists) in Amazon Redshift with features including modular programming and data lineage documentation.\n\n[dbt (data build tool)](https://www.getdbt.com/) is a framework that supports these features and more to manage data transformations in Amazon Redshift. There are two interfaces for dbt:\n\n- [dbt CLI](https://docs.getdbt.com/dbt-cli/cli-overview) – Available as an open-source project\n- [dbt Cloud](https://docs.getdbt.com/docs/dbt-cloud/cloud-overview) – A hosted service with added features including an IDE, job scheduling, and more\n\nIn this post, we demonstrate some features in dbt that help you manage data transformations in Amazon Redshift. We also provide the [dbt CLI and Amazon Redshift workshop](https://catalog.workshops.aws/dbt-cli-and-amazon-redshift/en-US) to get started using these features.\n\n\n#### **Manage common logic**\n\n\ndbt enables you to write SQL in a modular fashion. This improves maintainability and productivity because common logic can be consolidated (maintain a single instance of logic) and referenced (build on existing logic instead of starting from scratch).\n\nThe following figure is an example showing how dbt consolidates common logic. In this example, two [models](https://docs.getdbt.com/docs/building-a-dbt-project/building-models) rely on the same subquery. Instead of replicating the subquery, dbt allows you to create a model for the subquery and [reference](https://docs.getdbt.com/reference/dbt-jinja-functions/ref) it later.\n\n![image.png](https://dev-media.amazoncloud.cn/7ebf35399c844a05bcdcee1e8bcfe81c_image.png)\n\nFigure 1: Manage common subquery in dbt\n\nThe concept of referencing isn’t limited to logic related to subqueries. You can also use referencing for logic related to fields.\n\nThe following is an example showing how dbt consolidates common logic related to fields. In this example, a model applies the same case statement on two fields. Instead of replicating the case statement for each field, dbt allows you to create a [macro](https://docs.getdbt.com/docs/building-a-dbt-project/jinja-macros#macros) containing the case statement and reference it later.\n\n![image.png](https://dev-media.amazoncloud.cn/ede5677c18f1472f9cd4afe8dad59f8a_image.png)\n\nFigure 2: Manage common case statement in dbt\n\nHow is a model in dbt subsequently created in Amazon Redshift? dbt provides you with the command [dbt run](https://docs.getdbt.com/reference/commands/run), which [materializes](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/materializations) models as views or tables in your targeted Amazon Redshift cluster. You can try this out in the dbt CLI and [Amazon Redshift workshop](https://catalog.workshops.aws/dbt-cli-and-amazon-redshift/en-US).\n\n\n#### **Manage common data mappings**\n\n\nAlthough you can use macros to manage data mappings (for example, mapping “1” to “One” and “2” to “Two”), an alternative is to maintain data mappings in files and manage the files in dbt.\n\nThe following is an example of how dbt manages common data mappings. In this example, a model applies one-to-one data mappings on a field. Instead of creating a macro for the one-to-one data mappings, dbt allows you to create a [seed](https://docs.getdbt.com/docs/building-a-dbt-project/seeds) for the one-to-one data mappings in the form of a CSV file and then reference it later.\n\n![image.png](https://dev-media.amazoncloud.cn/e6c63ac443f44aac867fb97cde9b2239_image.png)\n\nFigure 3: Manage common data mapping in dbt\n\nYou can create or update a seed with a two-step process. After you create or update a CSV seed file, run the command [dbt seed](https://docs.getdbt.com/reference/commands/seed) to create the CSV seed as a table in your targeted Amazon Redshift cluster before referencing it.\n\n\n#### **Manage data lineage documentation**\n\n\nAfter you have created models and seeds in dbt, and used dbt’s referencing capability, dbt provides you with a method to generate documentation on your data transformations.\n\nYou can run the command [dbt docs generate](https://docs.getdbt.com/reference/commands/cmd-docs#dbt-docs-generate) followed by [dbt docs serve](https://docs.getdbt.com/reference/commands/cmd-docs#dbt-docs-serve) to launch a locally hosted website containing documentation on your dbt project. When you choose a model on the locally hosted website, information about the model is displayed, including columns in the final view or table, dependencies to create the model, and the SQL that is compiled to create the view or table. The following screenshot shows an example of this documentation.\n\n![image.png](https://dev-media.amazoncloud.cn/b158d17072064edb8972bde6f31dbf36_image.png)\n\nFigure 4: Documentation generated by dbt\n\nYou can also visualize dependencies for improved navigation of documentations during impact analysis. In the following example graph, we can see that model```rpt_tech_all_users```is built referencing the model```base_public_users```, which in turn references the table```users```in the```public```schema.\n\n![image.png](https://dev-media.amazoncloud.cn/8eab10639271411fbf42ff60cb057f0b_image.png)\n\nFigure 5: Data lineage visualization generated by dbt\n\n\n#### **Conclusion**\n\n\nThis post covered how you can use dbt to manage data transformations in Amazon Redshift. As you explore dbt, you will come across other features like [hooks](https://docs.getdbt.com/docs/building-a-dbt-project/hooks-operations), which you can use to manage administrative tasks, for example, continuous granting of privileges.\n\nFor a hands-on experience with dbt CLI and Amazon Redshift, we have a workshop with step-by-step instructions to help you create your first dbt project and explore the features mentioned in this post—models, macros, seeds, and hooks. Visit [dbt CLI and Amazon Redshift](https://catalog.workshops.aws/dbt-cli-and-amazon-redshift) to get started.\n\nIf you have any questions or suggestions, leave your feedback in the comments section. If you need any further assistance to optimize your Amazon Redshift implementation, contact your Amazon Web Services account team or a trusted Amazon Web Services partner.\n\n\n#### **About the authors**\n\n\n![image.png](https://dev-media.amazoncloud.cn/4d3fc32e795c4d69888fb71e83bf461d_image.png)\n\n**Randy Chng** is an Analytics Acceleration Lab Solutions Architect at Amazon Web Services. He works with customers to accelerate their Amazon Redshift journey by delivering proof of concepts on key business problems.\n\n![image.png](https://dev-media.amazoncloud.cn/d5b3bbffe1404879b94c507d4a50c159_image.png)\n\n**Sean Beath** is an Analytics Acceleration Lab Solutions Architect at Amazon Web Services. He delivers proof of concepts with customers on Amazon Redshift, helping customers drive analytics value on Amazon Web Services.\n\n","render":"<p><a href=\"https://aws.amazon.com/cn/redshift/\" target=\"_blank\">Amazon Redshift</a> is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. Amazon Redshift enables you to use your data to acquire new insights for your business and customers while keeping costs low.</p>\n<p>Together with <a href=\"https://aws.amazon.com/cn/blogs/big-data/amazon-redshift-continues-its-price-performance-leadership/\" target=\"_blank\">price-performance</a>, customers want to manage data transformations (SQL Select statements written by data engineers, data analysts, and data scientists) in Amazon Redshift with features including modular programming and data lineage documentation.</p>\n<p><a href=\"https://www.getdbt.com/\" target=\"_blank\">dbt (data build tool)</a> is a framework that supports these features and more to manage data transformations in Amazon Redshift. There are two interfaces for dbt:</p>\n<ul>\n<li><a href=\"https://docs.getdbt.com/dbt-cli/cli-overview\" target=\"_blank\">dbt CLI</a> – Available as an open-source project</li>\n<li><a href=\"https://docs.getdbt.com/docs/dbt-cloud/cloud-overview\" target=\"_blank\">dbt Cloud</a> – A hosted service with added features including an IDE, job scheduling, and more</li>\n</ul>\n<p>In this post, we demonstrate some features in dbt that help you manage data transformations in Amazon Redshift. We also provide the <a href=\"https://catalog.workshops.aws/dbt-cli-and-amazon-redshift/en-US\" target=\"_blank\">dbt CLI and Amazon Redshift workshop</a> to get started using these features.</p>\n<h4><a id=\"Manage_common_logic_12\"></a><strong>Manage common logic</strong></h4>\n<p>dbt enables you to write SQL in a modular fashion. This improves maintainability and productivity because common logic can be consolidated (maintain a single instance of logic) and referenced (build on existing logic instead of starting from scratch).</p>\n<p>The following figure is an example showing how dbt consolidates common logic. In this example, two <a href=\"https://docs.getdbt.com/docs/building-a-dbt-project/building-models\" target=\"_blank\">models</a> rely on the same subquery. Instead of replicating the subquery, dbt allows you to create a model for the subquery and <a href=\"https://docs.getdbt.com/reference/dbt-jinja-functions/ref\" target=\"_blank\">reference</a> it later.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/7ebf35399c844a05bcdcee1e8bcfe81c_image.png\" alt=\"image.png\" /></p>\n<p>Figure 1: Manage common subquery in dbt</p>\n<p>The concept of referencing isn’t limited to logic related to subqueries. You can also use referencing for logic related to fields.</p>\n<p>The following is an example showing how dbt consolidates common logic related to fields. In this example, a model applies the same case statement on two fields. Instead of replicating the case statement for each field, dbt allows you to create a <a href=\"https://docs.getdbt.com/docs/building-a-dbt-project/jinja-macros#macros\" target=\"_blank\">macro</a> containing the case statement and reference it later.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/ede5677c18f1472f9cd4afe8dad59f8a_image.png\" alt=\"image.png\" /></p>\n<p>Figure 2: Manage common case statement in dbt</p>\n<p>How is a model in dbt subsequently created in Amazon Redshift? dbt provides you with the command <a href=\"https://docs.getdbt.com/reference/commands/run\" target=\"_blank\">dbt run</a>, which <a href=\"https://docs.getdbt.com/docs/building-a-dbt-project/building-models/materializations\" target=\"_blank\">materializes</a> models as views or tables in your targeted Amazon Redshift cluster. You can try this out in the dbt CLI and <a href=\"https://catalog.workshops.aws/dbt-cli-and-amazon-redshift/en-US\" target=\"_blank\">Amazon Redshift workshop</a>.</p>\n<h4><a id=\"Manage_common_data_mappings_34\"></a><strong>Manage common data mappings</strong></h4>\n<p>Although you can use macros to manage data mappings (for example, mapping “1” to “One” and “2” to “Two”), an alternative is to maintain data mappings in files and manage the files in dbt.</p>\n<p>The following is an example of how dbt manages common data mappings. In this example, a model applies one-to-one data mappings on a field. Instead of creating a macro for the one-to-one data mappings, dbt allows you to create a <a href=\"https://docs.getdbt.com/docs/building-a-dbt-project/seeds\" target=\"_blank\">seed</a> for the one-to-one data mappings in the form of a CSV file and then reference it later.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/e6c63ac443f44aac867fb97cde9b2239_image.png\" alt=\"image.png\" /></p>\n<p>Figure 3: Manage common data mapping in dbt</p>\n<p>You can create or update a seed with a two-step process. After you create or update a CSV seed file, run the command <a href=\"https://docs.getdbt.com/reference/commands/seed\" target=\"_blank\">dbt seed</a> to create the CSV seed as a table in your targeted Amazon Redshift cluster before referencing it.</p>\n<h4><a id=\"Manage_data_lineage_documentation_48\"></a><strong>Manage data lineage documentation</strong></h4>\n<p>After you have created models and seeds in dbt, and used dbt’s referencing capability, dbt provides you with a method to generate documentation on your data transformations.</p>\n<p>You can run the command <a href=\"https://docs.getdbt.com/reference/commands/cmd-docs#dbt-docs-generate\" target=\"_blank\">dbt docs generate</a> followed by <a href=\"https://docs.getdbt.com/reference/commands/cmd-docs#dbt-docs-serve\" target=\"_blank\">dbt docs serve</a> to launch a locally hosted website containing documentation on your dbt project. When you choose a model on the locally hosted website, information about the model is displayed, including columns in the final view or table, dependencies to create the model, and the SQL that is compiled to create the view or table. The following screenshot shows an example of this documentation.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/b158d17072064edb8972bde6f31dbf36_image.png\" alt=\"image.png\" /></p>\n<p>Figure 4: Documentation generated by dbt</p>\n<p>You can also visualize dependencies for improved navigation of documentations during impact analysis. In the following example graph, we can see that model<code>rpt_tech_all_users</code>is built referencing the model<code>base_public_users</code>, which in turn references the table<code>users</code>in the<code>public</code>schema.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/8eab10639271411fbf42ff60cb057f0b_image.png\" alt=\"image.png\" /></p>\n<p>Figure 5: Data lineage visualization generated by dbt</p>\n<h4><a id=\"Conclusion_66\"></a><strong>Conclusion</strong></h4>\n<p>This post covered how you can use dbt to manage data transformations in Amazon Redshift. As you explore dbt, you will come across other features like <a href=\"https://docs.getdbt.com/docs/building-a-dbt-project/hooks-operations\" target=\"_blank\">hooks</a>, which you can use to manage administrative tasks, for example, continuous granting of privileges.</p>\n<p>For a hands-on experience with dbt CLI and Amazon Redshift, we have a workshop with step-by-step instructions to help you create your first dbt project and explore the features mentioned in this post—models, macros, seeds, and hooks. Visit <a href=\"https://catalog.workshops.aws/dbt-cli-and-amazon-redshift\" target=\"_blank\">dbt CLI and Amazon Redshift</a> to get started.</p>\n<p>If you have any questions or suggestions, leave your feedback in the comments section. If you need any further assistance to optimize your Amazon Redshift implementation, contact your Amazon Web Services account team or a trusted Amazon Web Services partner.</p>\n<h4><a id=\"About_the_authors_76\"></a><strong>About the authors</strong></h4>\n<p><img src=\"https://dev-media.amazoncloud.cn/4d3fc32e795c4d69888fb71e83bf461d_image.png\" alt=\"image.png\" /></p>\n<p><strong>Randy Chng</strong> is an Analytics Acceleration Lab Solutions Architect at Amazon Web Services. He works with customers to accelerate their Amazon Redshift journey by delivering proof of concepts on key business problems.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/d5b3bbffe1404879b94c507d4a50c159_image.png\" alt=\"image.png\" /></p>\n<p><strong>Sean Beath</strong> is an Analytics Acceleration Lab Solutions Architect at Amazon Web Services. He delivers proof of concepts with customers on Amazon Redshift, helping customers drive analytics value on Amazon Web Services.</p>\n"}
目录
亚马逊云科技解决方案 基于行业客户应用场景及技术领域的解决方案
联系亚马逊云科技专家
亚马逊云科技解决方案
基于行业客户应用场景及技术领域的解决方案
联系专家
0
目录
关闭