Build and train ML models using a data mesh architecture on Amazon: Part 1

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"Organizations across various industries are using artificial intelligence (AI) and machine learning (ML) to solve business challenges specific to their industry. For example, in the financial services industry, you can use AI and ML to solve challenges around fraud detection, credit risk prediction, direct marketing, and many others.\n\nLarge enterprises sometimes set up a center of excellence (CoE) to tackle the needs of different lines of business (LoBs) with innovative analytics and ML projects.\n\nTo generate high-quality and performant ML models at scale, they need to do the following:\n\n1. Provide an easy way to access relevant data to their analytics and ML CoE\n2. Create accountability on data providers from individual LoBs to share curated data assets that are discoverable, understandable, interoperable, and trustworthy\n\nThis can reduce the long cycle time for converting ML use cases from experiment to production and generate business value across the organization.\n\nA data mesh architecture strives to solve these technical and organizational challenges by introducing a decentralized socio-technical approach to share, access, and manage data in complex and large-scale environments—within or across organizations. The data mesh design pattern creates a responsible data-sharing model that aligns with the organizational growth to achieve the ultimate goal of increasing the return of business investments in the data teams, process, and technology.\n\nIn this two-part series, we provide guidance on how organizations can build a modern data architecture using a data mesh design pattern on AWS and enable an analytics and ML CoE to build and train ML models with data across multiple LoBs. We use an example of a financial service organization to set the context and the use case for this series.\n\n### **Build and train ML models using a data mesh architecture on AWS:**\n\n- Part 1: Data mesh set up and Data product registration\n- [Part 2: Data product consumption by Analytics and ML CoE](https://aws.amazon.com/blogs/machine-learning/part-2-build-and-train-ml-models-using-a-data-mesh-architecture-on-aws/)\n\nIn this first post, we show the procedures of setting up a data mesh architecture with multiple AWS data producer and consumer accounts. Then we focus on one data product, which is owned by one LoB within the financial organization, and how it can be shared into a data mesh environment to allow other LoBs to consume and use this data product. This is mainly targeting the data steward persona, who is responsible for streamlining and standardizing the process of sharing data between data producers and consumers and ensuring compliance with data governance rules.\n\nIn the second post, we show one example of how an analytics and ML CoE can consume the data product for a risk prediction use case. This is mainly targeting the data scientist persona, who is responsible for utilizing both organizational-wide and third-party data assets to build and train ML models that extract business insights to enhance the experience of financial services customers.\n\n### **Data mesh overview**\n\nThe founder of the data mesh pattern, Zhamak Dehghani in her book [Data Mesh Delivering Data-Driven Value at Scale](https://www.oreilly.com/library/view/data-mesh/9781492092384/), defined four principles towards the objective of the data mesh:\n\n- **Distributed domain ownership** – To pursue an organizational shift from centralized ownership of data by specialists who run the data platform technologies to a decentralized data ownership model, pushing ownership and accountability of the data back to the LoBs where data is produced (source-aligned domains) or consumed (consumption-aligned domains).\n- **Data as a product** – To push upstream the accountability of sharing curated, high-quality, interoperable, and secure data assets. Therefore, data producers from different LoBs are responsible for making data in a consumable form right at the source.\n- **Self-service analytics** – To streamline the experience of data users of analytics and ML so they can discover, access, and use data products with their preferred tools. Additionally, to streamline the experience of LoB data providers to build, deploy, and maintain data products via recipes and reusable components and templates.\n- **Federated computational governance** – To federate and automate the decision-making involved in managing and controlling data access to be on the level of data owners from the different LoBs, which is still in line with the wider organization’s legal, compliance, and security policies that are ultimately enforced through the mesh.\n\nAWS introduced its vision for building a data mesh on top of AWS in various posts:\n\n- First, we focused on the organizational part associated with distributed domain ownership and data as a product principles. The authors described the vision of aligning multiple LOBs across the organization towards a data product strategy that provides the consumption-aligned domains with tools to find and obtain the data they need, while guaranteeing the necessary control around the use of that data by introducing accountability for the source-aligned domains to provide data products ready to be used right at the source. For more information, refer to How [JPMorgan Chase built a data mesh architecture to drive significant value to enhance their enterprise data platform.](https://aws.amazon.com/blogs/big-data/how-jpmorgan-chase-built-a-data-mesh-architecture-to-drive-significant-value-to-enhance-their-enterprise-data-platform/)\n- Then we focused on the technical part associated with building data products, self-service analytics, and federated computational governance principles. The authors described the core AWS services that empower the source-aligned domains to build and share data products, a wide variety of services that can enable consumer-aligned domains to consume data products in different ways based on their preferred tools and the use cases they are working towards, and finally the AWS services that govern the data sharing procedure by enforcing data access polices. For more information, refer to [Design a data mesh architecture using AWS Lake Formation and AWS Glue](https://aws.amazon.com/blogs/big-data/design-a-data-mesh-architecture-using-aws-lake-formation-and-aws-glue/).\n- We also showed a solution to automate data discovery and access control through a centralized data mesh UI. For more details, refer to [Build a data sharing workflow with AWS Lake Formation for your data mesh](https://aws.amazon.com/blogs/big-data/build-a-data-sharing-workflow-with-aws-lake-formation-for-your-data-mesh/).\n\n### **Financial services use case**\n\nTypically, large financial services organizations have multiple LoBs, such as consumer banking, investment banking, and asset management, and also one or more analytics and ML CoE teams. Each LoB provides different services:\n\n- The consumer banking LoB provides a variety of services to consumers and businesses, including credit and mortgage, cash management, payment solutions, deposit and investment products, and more\n- The commercial or investment banking LoB offers comprehensive financial solutions, such as lending, bankruptcy risk, and wholesale payments to clients, including small businesses, mid-sized companies, and large corporations\n- The asset management LoB provides retirement products and investment services across all asset classes\n\nEach LoB defines their own data products, which are curated by people who understand the data and are best suited to specify who is authorized to use it, and how it can be used. In contrast, other LoBs and application domains such as the analytics and ML CoE are interested in discovering and consuming qualified data products, blending it together to generate insights, and making data-driven decisions.\n\nThe following illustration depicts some LoBs and examples of data products that they can share. It also shows the consumers of data products such as the analytics and ML CoE, who build ML models that can be deployed to customer-facing applications to further enhance the end-customer’s experience.\n\n![image.png](https://dev-media.amazoncloud.cn/279f596fe7474f24b1800a69e8c7f116_image.png)\n\nFollowing the data mesh socio-technical concept, we start with the social aspect with a set of organizational steps, such as the following:\n\n- Utilizing domain experts to define boundaries for each domain, so each data product can be mapped to a specific domain\n- Identifying owners for data products provided from each domain, so each data product has a strategy defined by their owner\n- Identifying governance polices from global and local or federated incentives, so when data consumers access a specific data product, the access policy associated with the product can be automatically enforced through a central data governance layer\n\nThen we move to the technical aspect, which includes the following end-to-end scenario defined in the previous diagram:\n\n1. Empower the consumer banking LoB with tools to build a ready-to-use consumer credit profile data product.\n2. Allow the consumer banking LoB to share data products into the central governance layer.\n3. Embed global and federated definitions of data access policies that should be enforced while accessing the consumer credit profile data product through the central data governance.\n4. Allow the analytics and ML CoE to discover and access the data product through the central governance layer.\n5. Empower the analytics and ML CoE with tools to utilize the data product for building and training a credit risk prediction model.We don’t cover the final steps (6 and 7 in the preceding diagram) in this series. However, to show the business value such an ML model can bring to the organization in an end-to-end scenario, we illustrate the following:\n6. This model could later be deployed back to customer-facing systems such as a consumer banking web portal or mobile application.\n7. It can be specifically used within the loan application to assess the risk profile of credit and mortgage requests.\n\nNext, we describe the technical needs of each of the components.\n\n### **Deep dive into technical needs**\n\nTo make data products available for everyone, organizations need to make it easy to share data between different entities across the organization while maintaining appropriate control over it, or in other words, to balance agility with proper governance.\n\n#### **Data consumer: Analytics and ML CoE**\n\nThe data consumers such as data scientists from the analytics and ML CoE need to be able to do the following:\n\n- Discover and access relevant datasets for a given use case\n- Be confident that datasets they want to access are already curated, up to date, and have robust descriptions\n- Request access to datasets of interest to their business cases\n- Use their preferred tools to query and process such datasets within their environment for ML without the need for replicating data from the original remote location or for worrying about engineering or infrastructure complexities associated with processing data physically stored in a remote site\n- Get notified of any data updates made by the data owners\n\n#### **Data producer: Domain ownership**\n\nThe data producers, such as domain teams from different LoBs in the financial services org, need to register and share curated datasets that contain the following:\n\n- Technical and operational metadata, such as database and table names and sizes, column schemas, and keys\n- Business metadata such as data description, classification, and sensitivity\n- Tracking metadata such as schema evolution from the source to the target form and any intermediate forms\n- Data quality metadata such as correctness and completeness ratios and data bias\n- Access policies and procedures\n\nThese are needed to allow data consumers to discover and access data without relying on manual procedures or having to contact the data product’s domain experts to gain more knowledge about the meaning of the data and how it can be accessed.\n\n#### **Data governance: Discoverability, accessibility, and auditability**\n\nOrganizations need to balance the agilities illustrated earlier with proper mitigation of the risks associated with data leaks. Particularly in regulated industries like financial services, there is a need to maintain central data governance to provide overall data access and audit control while reducing the storage footprint by avoiding multiple copies of the same data across different locations.\n\nIn traditional centralized data lake architectures, the data producers often publish raw data and pass on the responsibility of data curation, data quality management, and access control to data and infrastructure engineers in a centralized data platform team. However, these data platform teams may be less familiar with the various data domains, and still rely on support from the data producers to be able to properly curate and govern access to data according to the policies enforced at each data domain. In contrast, the data producers themselves are best positioned to provide curated, qualified data assets and are aware of the domain-specific access polices that need to be enforced while accessing data assets.\n\n### **Solution overview**\n\nThe following diagram shows the high-level architecture of the proposed solution.\n\n![image.png](https://dev-media.amazoncloud.cn/6b6df397a12a4a39a4fd5441366ebd29_image.png)\n\nWe address data consumption by the analytics and ML CoE with [Amazon Athena](https://aws.amazon.com/athena/) and [Amazon SageMaker](https://aws.amazon.com/sagemaker/) in [part 2](https://aws.amazon.com/blogs/machine-learning/part-2-build-and-train-ml-models-using-a-data-mesh-architecture-on-aws/) of this series.\n\nIn this post, we focus on the data onboarding process into the data mesh and describe how an individual LoB such as the consumer banking domain data team can use AWS tools such as [AWS Glue](https://aws.amazon.com/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc) and [AWS Glue DataBrew ](https://aws.amazon.com/glue/features/databrew/)to prepare, curate, and enhance the quality of their data products, and then register those data products into the central data governance account through [AWS Lake Formation](https://aws.amazon.com/lake-formation/).\n\n#### **Consumer banking LoB (data producer)**\n\nOne of the core principles of data mesh is the concept of data as a product. It’s very important that the consumer banking domain data team work on preparing data products that are ready for use by data consumers. This can be done by using AWS extract, transform, and load (ETL) tools like AWS Glue to process raw data collected on [Amazon Simple Storage Service](http://aws.amazon.com/s3) (Amazon S3), or alternatively connect to the operational data stores where the data is produced. You can also use [DataBrew](https://aws.amazon.com/glue/features/databrew/), which is a no-code visual data preparation tool that makes it easy to clean and normalize data.\n\nFor example, while preparing the consumer credit profile data product, the consumer banking domain data team can make a simple curation to translate from German to English the attribute names of the raw data retrieved from the open-source dataset [Statlog German credit data](https://archive.ics.uci.edu/ml/datasets/South+German+Credit+%28UPDATE%29), which consists of 20 attributes and 1,000 rows.\n\n![image.png](https://dev-media.amazoncloud.cn/72c6268255564095b1b44e42019e7484_image.png)\n\n#### **Data governance**\n\nThe core AWS service for enabling data mesh governance is Lake Formation. Lake Formation offers the ability to enforce data governance within each data domain and across domains to ensure data is easily discoverable and secure. It provides a federated security model that can be administered centrally, with best practices for data discovery, security, and compliance, while allowing high agility within each domain.\n\nLake Formation offers an API to simplify how data is ingested, stored, and managed, together with row-level security to protect your data. It also provides functionality like granular access control, governed tables, and storage optimization.\n\nIn addition, Lake Formations offers a [Data Sharing API](https://docs.aws.amazon.com/lake-formation/latest/dg/sharing-catalog-resources.html) that you can use to share data [across different accounts](https://docs.aws.amazon.com/lake-formation/latest/dg/crosss-account-how-works.html). This allows the analytics and ML CoE consumer to run Athena queries that query and join tables across multiple accounts. For more information, refer to the [AWS Lake Formation Developer Guide](https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html)\n\n[AWS Resource Access Manager](https://aws.amazon.com/ram/) (AWS RAM) provides a secure way to share resources via [AWS Identity and Access Manager](https://aws.amazon.com/iam/) (IAM) roles and users across AWS accounts within an organization or organizational units (OUs) in [AWS Organizations.](https://aws.amazon.com/organizations/)\n\nLake Formation together with AWS RAM provides one way to manage data sharing and access across AWS accounts. We refer to this approach as [RAM-based access control](https://docs.aws.amazon.com/lake-formation/latest/dg/crosss-account-how-works.html). For more details about this approach, refer to [Build a data sharing workflow with AWS Lake Formation for your data mesh](https://aws.amazon.com/blogs/big-data/build-a-modern-data-architecture-and-data-mesh-pattern-at-scale-using-aws-lake-formation-tag-based-access-control/).\n\nThroughout this post, we use the tag-based access control approach because it simplifies the creation of policies on a smaller number of logical tags that are commonly found in different LoBs instead of specifying policies on named resources at the infrastructure level.\n\n### **Prerequisites**\n\nTo set up a data mesh architecture, you need at least three AWS accounts: a producer account, a central account, and a consumer account.\n\n### **Deploy the data mesh environment**\n\nTo deploy a data mesh environment, you can use the following [GitHub repository](https://github.com/aws-samples/amazon-sagemaker-lakeformation-datamesh). This repository contains three [AWS CloudFormation](http://aws.amazon.com/cloudformation) templates that deploy a data mesh environment that includes each of the accounts (producer, central, and consumer). Within each account, you can run its corresponding CloudFormation template.\n\n#### **Central account**\n\n1. Launch the CloudFormation stack:\n ![image.png](https://dev-media.amazoncloud.cn/9af2d0ca687b4348a67f431bd822fc9a_image.png)\n2. Create two IAM users:\na. ```DataMeshOwner```\nb. ```ProducerSteward```\n3. Grant ```DataMeshOwner``` as the Lake Formation admin.\n4. Create one IAM role:\na. ```LFRegisterLocationServiceRole```\n5. Create two IAM policies:\na. ```ProducerStewardPolicy```\nb. ```S3DataLakePolicy```\n6. Create the database credit-card for ```ProducerSteward``` at the producer account.\n7. Share the data location permission to the producer account.\n\n#### **Producer account**\n\nIn the producer account, complete the following steps:\n\n1. Launch the CloudFormation stack:\n ![image.png](https://dev-media.amazoncloud.cn/6774279564424ca2b2bb5151af2d98a4_image.png)\n2. Create the S3 bucket ```credit-card```, which holds the table ```credit_card```.\n3. Allow S3 bucket access for the central account Lake Formation service role.\n4. Create the AWS Glue ```crawler creditCrawler-<ProducerAccountID>```.\n5. Create an AWS Glue crawler service role.\n6. Grant permissions on the S3 bucket location ```credit-card-<ProducerAccountID>-<aws-region>``` to the AWS Glue crawler role.\n7. Create a producer steward IAM user.\n\n#### **Consumer account**\n\n1. In the consumer account, complete the following steps:\n ![image.png](https://dev-media.amazoncloud.cn/3f36413a3c364543a0887fb270910b03_image.png)\n2. Launch the CloudFormation stack:\n3. Create the S3 bucket ```<AWS Account ID>-<aws-region>-athena-logs```.\n4. Create the Athena workgroup ```consumer-workgroup```.\n5. Create the IAM user ```ConsumerAdmin```.\n\n### **Add a database and subscribe the consumer account to it**\n\nAfter you run the templates, you can go through the s[tep-by-step guide](https://github.com/aws-samples/amazon-sagemaker-lakeformation-datamesh/tree/main/Data%20mesh%20step-by-step%20guide) to add a product in the data catalog and have the consumer subscribed to it. The guide starts by setting up a database where the producer can place its products and then explains how the consumer can subscribe to that database and access the data. All of this is performed while using [LF-tags](https://docs.aws.amazon.com/lake-formation/latest/dg/TBAC-creating-tags.html), which is the [tag-based access control](https://docs.aws.amazon.com/lake-formation/latest/dg/tag-based-access-control.html) for Lake Formation.\n\n### **Data product registration**\n\nThe following architecture describes the detailed steps of how the consumer banking LoB team acting as data producers can register their data products in the central data governance account (onboard data products to the organization data mesh).\n\n![image.png](https://dev-media.amazoncloud.cn/8b3539d36acd4c5e9aa58d781ecc42d6_image.png)\n\nThe general steps to register a data product are as follows:\n\n1. Create a target database for the data product in the central governance account. As an example, the CloudFormation template from the central account already creates the target database ```credit-card```.\n2. Share the created target database with the origin in the producer account.\n3. Create a resource link of the shared database in the producer account. In the following screenshot, we see on the Lake Formation console in the producer account that ```rl_credit-card``` is the resource link of the ```credit-card``` database.\n\n![image.png](https://dev-media.amazoncloud.cn/18a4a4b50d0149ceada73b35e24be47b_image.png)\n\n4. Populate tables (with the data curated in the producer account) inside the resource link database (```rl_credit-card```) using an AWS Glue crawler in the producer account.\n\n![image.png](https://dev-media.amazoncloud.cn/fa158fa6a71242d687bd3d883301eb89_image.png)\n\nThe created table automatically appears in the central governance account. The following screenshot shows an example of the table in Lake Formation in the central account. This is after performing the earlier steps to populate the resource link database ```rl_credit-card``` in the producer account.\n\n![image.png](https://dev-media.amazoncloud.cn/c692f9fff2ec48119ee032bcf3044669_image.png)\n\n### **Conclusion**\n\nIn part 1 of this series, we discussed the goals of financial services organizations to achieve more agility for their analytics and ML teams and reduce the time from data to insights. We also focused on building a data mesh architecture on AWS, where we’ve introduced easy-to-use, scalable, and cost-effective AWS services such as AWS Glue, DataBrew, and Lake Formation. Data producing teams can use these services to build and share curated, high-quality, interoperable, and secure data products that are ready to use by different data consumers for analytical purposes.\n\nIn [part 2](https://aws.amazon.com/blogs/machine-learning/part-2-build-and-train-ml-models-using-a-data-mesh-architecture-on-aws/), we focus on analytics and ML CoE teams who consume data products shared by the consumer banking LoB to build a credit risk prediction model using AWS services such as Athena and SageMaker.\n\n### **About the authors**\n\n![image.png](https://dev-media.amazoncloud.cn/de25110a90d84ac59d6aa9adb312104a_image.png)\n\n**Karim Hammouda** is a Specialist Solutions Architect for Analytics at AWS with a passion for data integration, data analysis, and BI. He works with AWS customers to design and build analytics solutions that contribute to their business growth. In his free time, he likes to watch TV documentaries and play video games with his son.\n\n![image.png](https://dev-media.amazoncloud.cn/3508f5eef2a0407e9bf7da296a1a0f9f_image.png)\n\n**Hasan Poonawala** is a Senior AI/ML Specialist Solutions Architect at AWS, Hasan helps customers design and deploy machine learning applications in production on AWS. He has over 12 years of work experience as a data scientist, machine learning practitioner, and software developer. In his spare time, Hasan loves to explore nature and spend time with friends and family.\n\n![image.png](https://dev-media.amazoncloud.cn/1cc754e9c0a94e3594d61fa914b64bdc_image.png)\n\n**Benoit de Patoul** is an AI/ML Specialist Solutions Architect at AWS. He helps customers by providing guidance and technical assistance to build solutions related to AI/ML using AWS. In his free time, he likes to play piano and spend time with friends.","render":"<p>Organizations across various industries are using artificial intelligence (AI) and machine learning (ML) to solve business challenges specific to their industry. For example, in the financial services industry, you can use AI and ML to solve challenges around fraud detection, credit risk prediction, direct marketing, and many others.</p>\n<p>Large enterprises sometimes set up a center of excellence (CoE) to tackle the needs of different lines of business (LoBs) with innovative analytics and ML projects.</p>\n<p>To generate high-quality and performant ML models at scale, they need to do the following:</p>\n<ol>\n<li>Provide an easy way to access relevant data to their analytics and ML CoE</li>\n<li>Create accountability on data providers from individual LoBs to share curated data assets that are discoverable, understandable, interoperable, and trustworthy</li>\n</ol>\n<p>This can reduce the long cycle time for converting ML use cases from experiment to production and generate business value across the organization.</p>\n<p>A data mesh architecture strives to solve these technical and organizational challenges by introducing a decentralized socio-technical approach to share, access, and manage data in complex and large-scale environments—within or across organizations. The data mesh design pattern creates a responsible data-sharing model that aligns with the organizational growth to achieve the ultimate goal of increasing the return of business investments in the data teams, process, and technology.</p>\n<p>In this two-part series, we provide guidance on how organizations can build a modern data architecture using a data mesh design pattern on AWS and enable an analytics and ML CoE to build and train ML models with data across multiple LoBs. We use an example of a financial service organization to set the context and the use case for this series.</p>\n<h3><a id=\"Build_and_train_ML_models_using_a_data_mesh_architecture_on_AWS_15\"></a><strong>Build and train ML models using a data mesh architecture on AWS:</strong></h3>\n<ul>\n<li>Part 1: Data mesh set up and Data product registration</li>\n<li><a href=\"https://aws.amazon.com/blogs/machine-learning/part-2-build-and-train-ml-models-using-a-data-mesh-architecture-on-aws/\" target=\"_blank\">Part 2: Data product consumption by Analytics and ML CoE</a></li>\n</ul>\n<p>In this first post, we show the procedures of setting up a data mesh architecture with multiple AWS data producer and consumer accounts. Then we focus on one data product, which is owned by one LoB within the financial organization, and how it can be shared into a data mesh environment to allow other LoBs to consume and use this data product. This is mainly targeting the data steward persona, who is responsible for streamlining and standardizing the process of sharing data between data producers and consumers and ensuring compliance with data governance rules.</p>\n<p>In the second post, we show one example of how an analytics and ML CoE can consume the data product for a risk prediction use case. This is mainly targeting the data scientist persona, who is responsible for utilizing both organizational-wide and third-party data assets to build and train ML models that extract business insights to enhance the experience of financial services customers.</p>\n<h3><a id=\"Data_mesh_overview_24\"></a><strong>Data mesh overview</strong></h3>\n<p>The founder of the data mesh pattern, Zhamak Dehghani in her book <a href=\"https://www.oreilly.com/library/view/data-mesh/9781492092384/\" target=\"_blank\">Data Mesh Delivering Data-Driven Value at Scale</a>, defined four principles towards the objective of the data mesh:</p>\n<ul>\n<li><strong>Distributed domain ownership</strong> – To pursue an organizational shift from centralized ownership of data by specialists who run the data platform technologies to a decentralized data ownership model, pushing ownership and accountability of the data back to the LoBs where data is produced (source-aligned domains) or consumed (consumption-aligned domains).</li>\n<li><strong>Data as a product</strong> – To push upstream the accountability of sharing curated, high-quality, interoperable, and secure data assets. Therefore, data producers from different LoBs are responsible for making data in a consumable form right at the source.</li>\n<li><strong>Self-service analytics</strong> – To streamline the experience of data users of analytics and ML so they can discover, access, and use data products with their preferred tools. Additionally, to streamline the experience of LoB data providers to build, deploy, and maintain data products via recipes and reusable components and templates.</li>\n<li><strong>Federated computational governance</strong> – To federate and automate the decision-making involved in managing and controlling data access to be on the level of data owners from the different LoBs, which is still in line with the wider organization’s legal, compliance, and security policies that are ultimately enforced through the mesh.</li>\n</ul>\n<p>AWS introduced its vision for building a data mesh on top of AWS in various posts:</p>\n<ul>\n<li>First, we focused on the organizational part associated with distributed domain ownership and data as a product principles. The authors described the vision of aligning multiple LOBs across the organization towards a data product strategy that provides the consumption-aligned domains with tools to find and obtain the data they need, while guaranteeing the necessary control around the use of that data by introducing accountability for the source-aligned domains to provide data products ready to be used right at the source. For more information, refer to How <a href=\"https://aws.amazon.com/blogs/big-data/how-jpmorgan-chase-built-a-data-mesh-architecture-to-drive-significant-value-to-enhance-their-enterprise-data-platform/\" target=\"_blank\">JPMorgan Chase built a data mesh architecture to drive significant value to enhance their enterprise data platform.</a></li>\n<li>Then we focused on the technical part associated with building data products, self-service analytics, and federated computational governance principles. The authors described the core AWS services that empower the source-aligned domains to build and share data products, a wide variety of services that can enable consumer-aligned domains to consume data products in different ways based on their preferred tools and the use cases they are working towards, and finally the AWS services that govern the data sharing procedure by enforcing data access polices. For more information, refer to <a href=\"https://aws.amazon.com/blogs/big-data/design-a-data-mesh-architecture-using-aws-lake-formation-and-aws-glue/\" target=\"_blank\">Design a data mesh architecture using AWS Lake Formation and AWS Glue</a>.</li>\n<li>We also showed a solution to automate data discovery and access control through a centralized data mesh UI. For more details, refer to <a href=\"https://aws.amazon.com/blogs/big-data/build-a-data-sharing-workflow-with-aws-lake-formation-for-your-data-mesh/\" target=\"_blank\">Build a data sharing workflow with AWS Lake Formation for your data mesh</a>.</li>\n</ul>\n<h3><a id=\"Financial_services_use_case_39\"></a><strong>Financial services use case</strong></h3>\n<p>Typically, large financial services organizations have multiple LoBs, such as consumer banking, investment banking, and asset management, and also one or more analytics and ML CoE teams. Each LoB provides different services:</p>\n<ul>\n<li>The consumer banking LoB provides a variety of services to consumers and businesses, including credit and mortgage, cash management, payment solutions, deposit and investment products, and more</li>\n<li>The commercial or investment banking LoB offers comprehensive financial solutions, such as lending, bankruptcy risk, and wholesale payments to clients, including small businesses, mid-sized companies, and large corporations</li>\n<li>The asset management LoB provides retirement products and investment services across all asset classes</li>\n</ul>\n<p>Each LoB defines their own data products, which are curated by people who understand the data and are best suited to specify who is authorized to use it, and how it can be used. In contrast, other LoBs and application domains such as the analytics and ML CoE are interested in discovering and consuming qualified data products, blending it together to generate insights, and making data-driven decisions.</p>\n<p>The following illustration depicts some LoBs and examples of data products that they can share. It also shows the consumers of data products such as the analytics and ML CoE, who build ML models that can be deployed to customer-facing applications to further enhance the end-customer’s experience.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/279f596fe7474f24b1800a69e8c7f116_image.png\" alt=\"image.png\" /></p>\n<p>Following the data mesh socio-technical concept, we start with the social aspect with a set of organizational steps, such as the following:</p>\n<ul>\n<li>Utilizing domain experts to define boundaries for each domain, so each data product can be mapped to a specific domain</li>\n<li>Identifying owners for data products provided from each domain, so each data product has a strategy defined by their owner</li>\n<li>Identifying governance polices from global and local or federated incentives, so when data consumers access a specific data product, the access policy associated with the product can be automatically enforced through a central data governance layer</li>\n</ul>\n<p>Then we move to the technical aspect, which includes the following end-to-end scenario defined in the previous diagram:</p>\n<ol>\n<li>Empower the consumer banking LoB with tools to build a ready-to-use consumer credit profile data product.</li>\n<li>Allow the consumer banking LoB to share data products into the central governance layer.</li>\n<li>Embed global and federated definitions of data access policies that should be enforced while accessing the consumer credit profile data product through the central data governance.</li>\n<li>Allow the analytics and ML CoE to discover and access the data product through the central governance layer.</li>\n<li>Empower the analytics and ML CoE with tools to utilize the data product for building and training a credit risk prediction model.We don’t cover the final steps (6 and 7 in the preceding diagram) in this series. However, to show the business value such an ML model can bring to the organization in an end-to-end scenario, we illustrate the following:</li>\n<li>This model could later be deployed back to customer-facing systems such as a consumer banking web portal or mobile application.</li>\n<li>It can be specifically used within the loan application to assess the risk profile of credit and mortgage requests.</li>\n</ol>\n<p>Next, we describe the technical needs of each of the components.</p>\n<h3><a id=\"Deep_dive_into_technical_needs_71\"></a><strong>Deep dive into technical needs</strong></h3>\n<p>To make data products available for everyone, organizations need to make it easy to share data between different entities across the organization while maintaining appropriate control over it, or in other words, to balance agility with proper governance.</p>\n<h4><a id=\"Data_consumer_Analytics_and_ML_CoE_75\"></a><strong>Data consumer: Analytics and ML CoE</strong></h4>\n<p>The data consumers such as data scientists from the analytics and ML CoE need to be able to do the following:</p>\n<ul>\n<li>Discover and access relevant datasets for a given use case</li>\n<li>Be confident that datasets they want to access are already curated, up to date, and have robust descriptions</li>\n<li>Request access to datasets of interest to their business cases</li>\n<li>Use their preferred tools to query and process such datasets within their environment for ML without the need for replicating data from the original remote location or for worrying about engineering or infrastructure complexities associated with processing data physically stored in a remote site</li>\n<li>Get notified of any data updates made by the data owners</li>\n</ul>\n<h4><a id=\"Data_producer_Domain_ownership_85\"></a><strong>Data producer: Domain ownership</strong></h4>\n<p>The data producers, such as domain teams from different LoBs in the financial services org, need to register and share curated datasets that contain the following:</p>\n<ul>\n<li>Technical and operational metadata, such as database and table names and sizes, column schemas, and keys</li>\n<li>Business metadata such as data description, classification, and sensitivity</li>\n<li>Tracking metadata such as schema evolution from the source to the target form and any intermediate forms</li>\n<li>Data quality metadata such as correctness and completeness ratios and data bias</li>\n<li>Access policies and procedures</li>\n</ul>\n<p>These are needed to allow data consumers to discover and access data without relying on manual procedures or having to contact the data product’s domain experts to gain more knowledge about the meaning of the data and how it can be accessed.</p>\n<h4><a id=\"Data_governance_Discoverability_accessibility_and_auditability_97\"></a><strong>Data governance: Discoverability, accessibility, and auditability</strong></h4>\n<p>Organizations need to balance the agilities illustrated earlier with proper mitigation of the risks associated with data leaks. Particularly in regulated industries like financial services, there is a need to maintain central data governance to provide overall data access and audit control while reducing the storage footprint by avoiding multiple copies of the same data across different locations.</p>\n<p>In traditional centralized data lake architectures, the data producers often publish raw data and pass on the responsibility of data curation, data quality management, and access control to data and infrastructure engineers in a centralized data platform team. However, these data platform teams may be less familiar with the various data domains, and still rely on support from the data producers to be able to properly curate and govern access to data according to the policies enforced at each data domain. In contrast, the data producers themselves are best positioned to provide curated, qualified data assets and are aware of the domain-specific access polices that need to be enforced while accessing data assets.</p>\n<h3><a id=\"Solution_overview_103\"></a><strong>Solution overview</strong></h3>\n<p>The following diagram shows the high-level architecture of the proposed solution.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/6b6df397a12a4a39a4fd5441366ebd29_image.png\" alt=\"image.png\" /></p>\n<p>We address data consumption by the analytics and ML CoE with <a href=\"https://aws.amazon.com/athena/\" target=\"_blank\">Amazon Athena</a> and <a href=\"https://aws.amazon.com/sagemaker/\" target=\"_blank\">Amazon SageMaker</a> in <a href=\"https://aws.amazon.com/blogs/machine-learning/part-2-build-and-train-ml-models-using-a-data-mesh-architecture-on-aws/\" target=\"_blank\">part 2</a> of this series.</p>\n<p>In this post, we focus on the data onboarding process into the data mesh and describe how an individual LoB such as the consumer banking domain data team can use AWS tools such as <a href=\"https://aws.amazon.com/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc\" target=\"_blank\">AWS Glue</a> and <a href=\"https://aws.amazon.com/glue/features/databrew/\" target=\"_blank\">AWS Glue DataBrew </a>to prepare, curate, and enhance the quality of their data products, and then register those data products into the central data governance account through <a href=\"https://aws.amazon.com/lake-formation/\" target=\"_blank\">AWS Lake Formation</a>.</p>\n<h4><a id=\"Consumer_banking_LoB_data_producer_113\"></a><strong>Consumer banking LoB (data producer)</strong></h4>\n<p>One of the core principles of data mesh is the concept of data as a product. It’s very important that the consumer banking domain data team work on preparing data products that are ready for use by data consumers. This can be done by using AWS extract, transform, and load (ETL) tools like AWS Glue to process raw data collected on <a href=\"http://aws.amazon.com/s3\" target=\"_blank\">Amazon Simple Storage Service</a> (Amazon S3), or alternatively connect to the operational data stores where the data is produced. You can also use <a href=\"https://aws.amazon.com/glue/features/databrew/\" target=\"_blank\">DataBrew</a>, which is a no-code visual data preparation tool that makes it easy to clean and normalize data.</p>\n<p>For example, while preparing the consumer credit profile data product, the consumer banking domain data team can make a simple curation to translate from German to English the attribute names of the raw data retrieved from the open-source dataset <a href=\"https://archive.ics.uci.edu/ml/datasets/South+German+Credit+%28UPDATE%29\" target=\"_blank\">Statlog German credit data</a>, which consists of 20 attributes and 1,000 rows.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/72c6268255564095b1b44e42019e7484_image.png\" alt=\"image.png\" /></p>\n<h4><a id=\"Data_governance_121\"></a><strong>Data governance</strong></h4>\n<p>The core AWS service for enabling data mesh governance is Lake Formation. Lake Formation offers the ability to enforce data governance within each data domain and across domains to ensure data is easily discoverable and secure. It provides a federated security model that can be administered centrally, with best practices for data discovery, security, and compliance, while allowing high agility within each domain.</p>\n<p>Lake Formation offers an API to simplify how data is ingested, stored, and managed, together with row-level security to protect your data. It also provides functionality like granular access control, governed tables, and storage optimization.</p>\n<p>In addition, Lake Formations offers a <a href=\"https://docs.aws.amazon.com/lake-formation/latest/dg/sharing-catalog-resources.html\" target=\"_blank\">Data Sharing API</a> that you can use to share data <a href=\"https://docs.aws.amazon.com/lake-formation/latest/dg/crosss-account-how-works.html\" target=\"_blank\">across different accounts</a>. This allows the analytics and ML CoE consumer to run Athena queries that query and join tables across multiple accounts. For more information, refer to the <a href=\"https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html\" target=\"_blank\">AWS Lake Formation Developer Guide</a></p>\n<p><a href=\"https://aws.amazon.com/ram/\" target=\"_blank\">AWS Resource Access Manager</a> (AWS RAM) provides a secure way to share resources via <a href=\"https://aws.amazon.com/iam/\" target=\"_blank\">AWS Identity and Access Manager</a> (IAM) roles and users across AWS accounts within an organization or organizational units (OUs) in <a href=\"https://aws.amazon.com/organizations/\" target=\"_blank\">AWS Organizations.</a></p>\n<p>Lake Formation together with AWS RAM provides one way to manage data sharing and access across AWS accounts. We refer to this approach as <a href=\"https://docs.aws.amazon.com/lake-formation/latest/dg/crosss-account-how-works.html\" target=\"_blank\">RAM-based access control</a>. For more details about this approach, refer to <a href=\"https://aws.amazon.com/blogs/big-data/build-a-modern-data-architecture-and-data-mesh-pattern-at-scale-using-aws-lake-formation-tag-based-access-control/\" target=\"_blank\">Build a data sharing workflow with AWS Lake Formation for your data mesh</a>.</p>\n<p>Throughout this post, we use the tag-based access control approach because it simplifies the creation of policies on a smaller number of logical tags that are commonly found in different LoBs instead of specifying policies on named resources at the infrastructure level.</p>\n<h3><a id=\"Prerequisites_135\"></a><strong>Prerequisites</strong></h3>\n<p>To set up a data mesh architecture, you need at least three AWS accounts: a producer account, a central account, and a consumer account.</p>\n<h3><a id=\"Deploy_the_data_mesh_environment_139\"></a><strong>Deploy the data mesh environment</strong></h3>\n<p>To deploy a data mesh environment, you can use the following <a href=\"https://github.com/aws-samples/amazon-sagemaker-lakeformation-datamesh\" target=\"_blank\">GitHub repository</a>. This repository contains three <a href=\"http://aws.amazon.com/cloudformation\" target=\"_blank\">AWS CloudFormation</a> templates that deploy a data mesh environment that includes each of the accounts (producer, central, and consumer). Within each account, you can run its corresponding CloudFormation template.</p>\n<h4><a id=\"Central_account_143\"></a><strong>Central account</strong></h4>\n<ol>\n<li>Launch the CloudFormation stack:<br />\n<img src=\"https://dev-media.amazoncloud.cn/9af2d0ca687b4348a67f431bd822fc9a_image.png\" alt=\"image.png\" /></li>\n<li>Create two IAM users:<br />\na. <code>DataMeshOwner</code><br />\nb. <code>ProducerSteward</code></li>\n<li>Grant <code>DataMeshOwner</code> as the Lake Formation admin.</li>\n<li>Create one IAM role:<br />\na. <code>LFRegisterLocationServiceRole</code></li>\n<li>Create two IAM policies:<br />\na. <code>ProducerStewardPolicy</code><br />\nb. <code>S3DataLakePolicy</code></li>\n<li>Create the database credit-card for <code>ProducerSteward</code> at the producer account.</li>\n<li>Share the data location permission to the producer account.</li>\n</ol>\n<h4><a id=\"Producer_account_159\"></a><strong>Producer account</strong></h4>\n<p>In the producer account, complete the following steps:</p>\n<ol>\n<li>Launch the CloudFormation stack:<br />\n<img src=\"https://dev-media.amazoncloud.cn/6774279564424ca2b2bb5151af2d98a4_image.png\" alt=\"image.png\" /></li>\n<li>Create the S3 bucket <code>credit-card</code>, which holds the table <code>credit_card</code>.</li>\n<li>Allow S3 bucket access for the central account Lake Formation service role.</li>\n<li>Create the AWS Glue <code>crawler creditCrawler-<ProducerAccountID></code>.</li>\n<li>Create an AWS Glue crawler service role.</li>\n<li>Grant permissions on the S3 bucket location <code>credit-card-<ProducerAccountID>-<aws-region></code> to the AWS Glue crawler role.</li>\n<li>Create a producer steward IAM user.</li>\n</ol>\n<h4><a id=\"Consumer_account_172\"></a><strong>Consumer account</strong></h4>\n<ol>\n<li>In the consumer account, complete the following steps:<br />\n<img src=\"https://dev-media.amazoncloud.cn/3f36413a3c364543a0887fb270910b03_image.png\" alt=\"image.png\" /></li>\n<li>Launch the CloudFormation stack:</li>\n<li>Create the S3 bucket <code><AWS Account ID>-<aws-region>-athena-logs</code>.</li>\n<li>Create the Athena workgroup <code>consumer-workgroup</code>.</li>\n<li>Create the IAM user <code>ConsumerAdmin</code>.</li>\n</ol>\n<h3><a id=\"Add_a_database_and_subscribe_the_consumer_account_to_it_181\"></a><strong>Add a database and subscribe the consumer account to it</strong></h3>\n<p>After you run the templates, you can go through the s<a href=\"https://github.com/aws-samples/amazon-sagemaker-lakeformation-datamesh/tree/main/Data%20mesh%20step-by-step%20guide\" target=\"_blank\">tep-by-step guide</a> to add a product in the data catalog and have the consumer subscribed to it. The guide starts by setting up a database where the producer can place its products and then explains how the consumer can subscribe to that database and access the data. All of this is performed while using <a href=\"https://docs.aws.amazon.com/lake-formation/latest/dg/TBAC-creating-tags.html\" target=\"_blank\">LF-tags</a>, which is the <a href=\"https://docs.aws.amazon.com/lake-formation/latest/dg/tag-based-access-control.html\" target=\"_blank\">tag-based access control</a> for Lake Formation.</p>\n<h3><a id=\"Data_product_registration_185\"></a><strong>Data product registration</strong></h3>\n<p>The following architecture describes the detailed steps of how the consumer banking LoB team acting as data producers can register their data products in the central data governance account (onboard data products to the organization data mesh).</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/8b3539d36acd4c5e9aa58d781ecc42d6_image.png\" alt=\"image.png\" /></p>\n<p>The general steps to register a data product are as follows:</p>\n<ol>\n<li>Create a target database for the data product in the central governance account. As an example, the CloudFormation template from the central account already creates the target database <code>credit-card</code>.</li>\n<li>Share the created target database with the origin in the producer account.</li>\n<li>Create a resource link of the shared database in the producer account. In the following screenshot, we see on the Lake Formation console in the producer account that <code>rl_credit-card</code> is the resource link of the <code>credit-card</code> database.</li>\n</ol>\n<p><img src=\"https://dev-media.amazoncloud.cn/18a4a4b50d0149ceada73b35e24be47b_image.png\" alt=\"image.png\" /></p>\n<ol start=\"4\">\n<li>Populate tables (with the data curated in the producer account) inside the resource link database (<code>rl_credit-card</code>) using an AWS Glue crawler in the producer account.</li>\n</ol>\n<p><img src=\"https://dev-media.amazoncloud.cn/fa158fa6a71242d687bd3d883301eb89_image.png\" alt=\"image.png\" /></p>\n<p>The created table automatically appears in the central governance account. The following screenshot shows an example of the table in Lake Formation in the central account. This is after performing the earlier steps to populate the resource link database <code>rl_credit-card</code> in the producer account.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/c692f9fff2ec48119ee032bcf3044669_image.png\" alt=\"image.png\" /></p>\n<h3><a id=\"Conclusion_207\"></a><strong>Conclusion</strong></h3>\n<p>In part 1 of this series, we discussed the goals of financial services organizations to achieve more agility for their analytics and ML teams and reduce the time from data to insights. We also focused on building a data mesh architecture on AWS, where we’ve introduced easy-to-use, scalable, and cost-effective AWS services such as AWS Glue, DataBrew, and Lake Formation. Data producing teams can use these services to build and share curated, high-quality, interoperable, and secure data products that are ready to use by different data consumers for analytical purposes.</p>\n<p>In <a href=\"https://aws.amazon.com/blogs/machine-learning/part-2-build-and-train-ml-models-using-a-data-mesh-architecture-on-aws/\" target=\"_blank\">part 2</a>, we focus on analytics and ML CoE teams who consume data products shared by the consumer banking LoB to build a credit risk prediction model using AWS services such as Athena and SageMaker.</p>\n<h3><a id=\"About_the_authors_213\"></a><strong>About the authors</strong></h3>\n<p><img src=\"https://dev-media.amazoncloud.cn/de25110a90d84ac59d6aa9adb312104a_image.png\" alt=\"image.png\" /></p>\n<p><strong>Karim Hammouda</strong> is a Specialist Solutions Architect for Analytics at AWS with a passion for data integration, data analysis, and BI. He works with AWS customers to design and build analytics solutions that contribute to their business growth. In his free time, he likes to watch TV documentaries and play video games with his son.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/3508f5eef2a0407e9bf7da296a1a0f9f_image.png\" alt=\"image.png\" /></p>\n<p><strong>Hasan Poonawala</strong> is a Senior AI/ML Specialist Solutions Architect at AWS, Hasan helps customers design and deploy machine learning applications in production on AWS. He has over 12 years of work experience as a data scientist, machine learning practitioner, and software developer. In his spare time, Hasan loves to explore nature and spend time with friends and family.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/1cc754e9c0a94e3594d61fa914b64bdc_image.png\" alt=\"image.png\" /></p>\n<p><strong>Benoit de Patoul</strong> is an AI/ML Specialist Solutions Architect at AWS. He helps customers by providing guidance and technical assistance to build solutions related to AI/ML using AWS. In his free time, he likes to play piano and spend time with friends.</p>\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家