New — Amazon SageMaker Data Wrangler Supports SaaS Applications as Data Sources

海外精选
re:Invent
Amazon SageMaker
海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时,内容中提到的“AWS” 是 “Amazon Web Services” 的缩写,在此网站不作为商标展示。
0
0
{"value":"Data fuels machine learning. In machine learning, data preparation is the process of transforming raw data into a format that is suitable for further processing and analysis. The common process for data preparation starts with collecting data, then cleaning it, labeling it, and finally validating and visualizing it. Getting the data right with high quality can often be a complex and time-consuming process.\n\nThis is why customers who build machine learning (ML) workloads on AWS appreciate the ability of [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler. With SageMaker Data Wrangler, customers can simplify the process of data preparation and complete the required processes of the data preparation workflow on a single visual interface. [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler helps to reduce the time it takes to aggregate and prepare data for ML.\n\n\nHowever, due to the proliferation of data, customers generally have data spread out into multiple systems, including external software-as-a-service (SaaS) applications like SAP OData for manufacturing data, Salesforce for customer pipeline, and Google Analytics for web application data. To solve business problems using ML, customers have to bring all of these data sources together. They currently have to build their own solution or use third-party solutions to ingest data into [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) or [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail). These solutions can be complex to set up and not cost-effective.\n\n### ++Introducing [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler Supports SaaS Applications as Data Sources++\n\n\nI’m happy to share that starting today, you can aggregate external SaaS application data for ML in [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler to prepare data for ML. With this feature, you can use more than 40 SaaS applications as data sources via [Amazon AppFlow](https://aws.amazon.com/cn/appflow/?trk=cndc-detail) and have these data available on [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler. Once the data sources are registered in [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Data Catalog by AppFlow, you can browse tables and schemas from these data sources using Data Wrangler SQL explorer. This feature provides seamless data integration between SaaS applications and SageMaker Data Wrangler using [Amazon AppFlow](https://aws.amazon.com/cn/appflow/?trk=cndc-detail).\n\nHere is a quick preview of this new feature:\n\n![dwintro.gif](https://dev-media.amazoncloud.cn/fef0230364044bfa8e2d95ad4f1fd742_dw-intro.gif)\t\n\nThis new feature of [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler works by using integration with [Amazon AppFlow](https://aws.amazon.com/cn/appflow/?trk=cndc-detail), a fully managed integration service that enables you to securely exchange data between SaaS applications and AWS services. With [Amazon AppFlow](https://aws.amazon.com/cn/appflow/?trk=cndc-detail), you can establish bidirectional data integration between SaaS applications, such as Salesforce, SAP, and Amplitude and all supported services, into your [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) or [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail).\n\nThen, with [Amazon AppFlow](https://aws.amazon.com/cn/appflow/?trk=cndc-detail), you can catalog the data in [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Data Catalog. This is a new feature where with [Amazon AppFlow](https://aws.amazon.com/cn/appflow/?trk=cndc-detail), you can create an integration with [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Data Catalog for [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) destination connector. With this new integration, customers can catalog SaaS data applications into [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Data Catalog with a few clicks, directly from the [Amazon AppFlow](https://aws.amazon.com/cn/appflow/?trk=cndc-detail) Flow configuration, without the need to run any crawlers.\n\nOnce you’ve established a flow and inserted it into the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Data Catalog, you can use this data inside the [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler. Then, you can do the data preparation as you usually do. You can write [Amazon Athena](https: //aws.amazon.com/cn/athena/?trk=cndc-detail) queries to preview data, join data from multiple sources, or import data to prepare for ML model training.\n\nWith this feature, you need to do a few simple steps to perform seamless data integration between SaaS applications into [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler via [Amazon AppFlow](https://aws.amazon.com/cn/appflow/?trk=cndc-detail). This integration supports more than 40 SaaS applications, and for a complete list of supported applications, please check the [Supported source and destination applications](https://docs.aws.amazon.com/appflow/latest/userguide/app-specific.html) documentation.\n\n### ++Get Started with [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler Support for [Amazon AppFlow](https://aws.amazon.com/cn/appflow/?trk=cndc-detail)++\nLet’s see how this feature works in detail. In my scenario, I need to get data from Salesforce, and do the data preparation using [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler.\n\nTo start using this feature, the first thing I need to do is to create a flow in [Amazon AppFlow](https://aws.amazon.com/cn/appflow/?trk=cndc-detail) that registers the data source into the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Data Catalog. I already have an existing connection with my Salesforce account, and all I need now is to create a flow.\n\n\n![image.png](https://dev-media.amazoncloud.cn/c949c8561b05425cac5fda2c3b14b4df_image.png)\n\nOne important thing to note is that to make SaaS application data available in [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler, I need to create a flow with [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) as the destination. Then, I need to enable **Create a Data Catalog table** in the **[AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Data Catalog settings**. This option will automatically catalog my Salesforce data into [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Data Catalog.\n\n![image.png](https://dev-media.amazoncloud.cn/600e85352b234235956e569b2916cbd5_image.png)\n\nOn this page, I need to select a user role with the required [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Data Catalog permissions and define the **database name** and the **table name prefix**. In addition, in this section, I can define the **data format preference,** be it in JSON, CSV, or Apache Parquet formats, and **filename preference** if I want to add a timestamp into the file name section.\n\n![image.png](https://dev-media.amazoncloud.cn/6e4ebd0d2fe04b92ad8f455665756472_image.png)\n\n\nTo learn more about how to register SaaS data in [Amazon AppFlow](https://aws.amazon.com/cn/appflow/?trk=cndc-detail) and [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Data Catalog, you can read **[Cataloging the data output from an Amazon AppFlow flow](https://docs.aws.amazon.com/appflow/latest/userguide/flows-catalog.html)** documentation page.\n\nOnce I’ve finished registering SaaS data, I need to make sure the IAM role can view the data sources in Data Wrangler from AppFlow. Here is an example of a policy in the IAM role:\n\n```\\n\\n{\\n \\"Version\\": \\"2012-10-17\\",\\n \\"Statement\\": [\\n {\\n \\"Effect\\": \\"Allow\\",\\n \\"Action\\": \\"glue:SearchTables\\",\\n \\"Resource\\": [\\n \\"arn:aws:glue:*:*:table/*/*\\",\\n \\"arn:aws:glue:*:*:database/*\\",\\n \\"arn:aws:glue:*:*:catalog\\"\\n ]\\n }\\n ]\\n} \\n```\n\nBy enabling data cataloging with [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Data Catalog, from this point on, [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler will be able to automatically discover this new data source and I can browse tables and schema using the Data Wrangler SQL Explorer.\n\nNow it’s time to switch to the [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler dashboard then select **Connect to data sources**.\n\n![image.png](https://dev-media.amazoncloud.cn/857a7a0804ed43db99cdbec239becc8e_image.png)\n\nOn the following page, I need to** Create connectio**n and select the data source I want to import. In this section, I can see all the available connections for me to use. Here I see the Salesforce connection is already available for me to use.\n\n![image.png](https://dev-media.amazoncloud.cn/468244649fc54d48a4210aa56431e4a7_image.png)\n\nIf I would like to add additional data sources, I can see a list of external SaaS applications that I can integrate into the **Set up new data sources** section. To learn how to recognize external SaaS applications as data sources, I can learn more with the select **How to enable access**.\n\n![image.png](https://dev-media.amazoncloud.cn/7639b494966e462dbf37177184c0316c_image.png)\n\nNow I will import datasets and select the Salesforce connection.\n\n![image.png](https://dev-media.amazoncloud.cn/58caf4e4861e456dbe645e61b1bf1d2c_image.png)\n\nOn the next page, I can define connection settings and import data from Salesforce. When I’m done with this configuration, I select **Connect**.\n\n\n![image.png](https://dev-media.amazoncloud.cn/e1eab54ea5594a529aa9051f0528d9c2_image.png)\n\nOn the following page, I see my Salesforce data that I already configured with [Amazon AppFlow](https://aws.amazon.com/cn/appflow/?trk=cndc-detail) and [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Data Catalog called ```appflowdatasourcedb```. I can also see a **table preview** and **schema** for me to review if this is the data I need.\n\n![image.png](https://dev-media.amazoncloud.cn/19215ef9c0184c438dd6911286ebd38f_image.png)\n\n\nThen, I start building my dataset using this data by performing SQL queries inside the SageMaker Data Wrangler SQL Explorer. Then, I select **Import query**.\n\n![image.png](https://dev-media.amazoncloud.cn/1bf3482e70604b418a790599bc3d6821_image.png)\n\nThen, I define a name for my dataset.\n\n![image.png](https://dev-media.amazoncloud.cn/1038931e12454bdaa0ffdbd256a26504_image.png)\n\nAt this point, I can start doing the data preparation process. I can navigate to the **Analysis** tab to run the data insight report. The analysis will provide me with a report on the data quality issues and what transform I need to use next to fix the issues based on the ML problem I want to predict. To learn more about how to use the data analysis feature, see [Accelerate data preparation with data quality and insights in the Amazon SageMaker Data Wrangler](https://aws.amazon.com/blogs/machine-learning/accelerate-data-preparation-with-data-quality-and-insights-in-amazon-sagemaker-data-wrangler/) blog post.\n\nIn my case, there are several columns I don’t need, and I need to drop these columns. I select **Add step**.\n\n![image.png](https://dev-media.amazoncloud.cn/6714bf93685046828bcabff38199d080_image.png)\n\n\nOne feature I like is that [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler provides numerous ML data transforms. It helps me to streamline the process of cleaning, transforming and feature engineering my data in one dashboard. For more about what SageMaker Data Wrangler provides for transformation data, please read this [Transform Data](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-transform.html) documentation page.\n\nIn this list, I select **Manage columns**.\n\n![image.png](https://dev-media.amazoncloud.cn/028b33591265457a9f2892bc6cf03ccd_image.png)\n\nThen, in the **Transform** section, I select the** Drop column** option. Then, I select a few columns that I don’t need.\n\n![image.png](https://dev-media.amazoncloud.cn/d088bea0f5814dfbbffe4e177d98d092_image.png)\n\nOnce I’m done, the columns I don’t need are removed and the **Drop column** data preparation step I just created is listed in the **Add step** section.\n\n![image.png](https://dev-media.amazoncloud.cn/6bd7d4cd873c40bebd3e9963e1747d25_image.png)\n\nI can also see the visual of my **data flow** inside the [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler. In this example, my data flow is quite basic. But when my data preparation process becomes complex, this visual view makes it easy for me to see all the data preparation steps.\n\n![image.png](https://dev-media.amazoncloud.cn/47f3a6631f4c4231ac194c64cff3845e_image.png)\n\nFrom this point on, I can do what I require with my Salesforce data. For example, I can export data directly to [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) by selecting **Export to** and choosing **[Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail)** from the **Add destination** menu. In my case, I specify **Data Wrangler** to store the data in [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) after it has processed it by selecting **Add destination** and then **[Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail)**.\n\n![image.png](https://dev-media.amazoncloud.cn/e4935180bda9423a9532aa8a63c2a276_image.png)\n\n[Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler provides me flexibility to automate the same data preparation flow using [scheduled jobs](https://aws.amazon.com/blogs/machine-learning/get-more-control-of-your-amazon-sagemaker-data-wrangler-workloads-with-parameterized-datasets-and-scheduled-jobs/). I can also automate [feature engineering](https://aws.amazon.com/blogs/machine-learning/automate-feature-engineering-pipelines-with-amazon-sagemaker/) with **SageMaker Pipelines (via Jupyter Notebook) and SageMaker Feature Store (via Jupyter Notebook)**, and deploy to Inference end point with **SageMaker Inference Pipeline (via Jupyter Notebook).**\n\n![image.png](https://dev-media.amazoncloud.cn/df60f71bc6c241448052f57f0d505f0f_image.png)\n\n\n### ++Things to Know++\n**Related news** – This feature will make it easy for you to do data aggregation and preparation with [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler. As this feature is an integration with [Amazon AppFlow](https://aws.amazon.com/cn/appflow/?trk=cndc-detail) and also [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Data Catalog, you might want to learn more on [Amazon AppFlow now supports AWS Glue Data Catalog integration and provides enhanced data preparation](https://aws.amazon.com/about-aws/whats-new/2022/11/amazon-appflow-aws-glue-data-catalog-integration-provides-enhanced-data-preparation/) page.\n\n**Availability** – [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler supports SaaS applications as data sources available in all the Regions currently supported by [Amazon AppFlow](https://aws.amazon.com/cn/appflow/?trk=cndc-detail).\n\n**Pricing** – There is no additional cost to use SaaS applications supports in [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler, but there is a cost to running [Amazon AppFlow](https://aws.amazon.com/cn/appflow/?trk=cndc-detail) to get the data in [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler.\n\nVisit [Import Data From Software as a Service (SaaS) Platforms](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-import.html#data-wrangler-import-saas) documentation page to learn more about this feature, and follow the getting started guide to start data aggregating and preparing SaaS applications data with [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler.\n\nHappy building!\n— [Donnie](https://donnie.id/)\n\n![89d6442457c51c067e7be46d60afd82.png](https://dev-media.amazoncloud.cn/7add2715fc22456497f1768569e0233f_89d6442457c51c067e7be46d60afd82.png)\n\n### Donnie Prakoso\nDonnie Prakoso is a software engineer, self-proclaimed barista, and Principal Developer Advocate at AWS. With more than 17 years of experience in the technology industry, from telecommunications, banking to startups. He is now focusing on helping the developers to understand varieties of technology to transform their ideas into execution. He loves coffee and any discussion of any topics from microservices to AI / ML.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n","render":"<p>Data fuels machine learning. In machine learning, data preparation is the process of transforming raw data into a format that is suitable for further processing and analysis. The common process for data preparation starts with collecting data, then cleaning it, labeling it, and finally validating and visualizing it. Getting the data right with high quality can often be a complex and time-consuming process.</p>\n<p>This is why customers who build machine learning (ML) workloads on AWS appreciate the ability of Amazon SageMaker Data Wrangler. With SageMaker Data Wrangler, customers can simplify the process of data preparation and complete the required processes of the data preparation workflow on a single visual interface. Amazon SageMaker Data Wrangler helps to reduce the time it takes to aggregate and prepare data for ML.</p>\n<p>However, due to the proliferation of data, customers generally have data spread out into multiple systems, including external software-as-a-service (SaaS) applications like SAP OData for manufacturing data, Salesforce for customer pipeline, and Google Analytics for web application data. To solve business problems using ML, customers have to bring all of these data sources together. They currently have to build their own solution or use third-party solutions to ingest data into Amazon S3 or Amazon Redshift. These solutions can be complex to set up and not cost-effective.</p>\n<h3><a id=\\"Introducing_Amazon_SageMaker_Data_Wrangler_Supports_SaaS_Applications_as_Data_Sources_7\\"></a><ins>Introducing Amazon SageMaker Data Wrangler Supports SaaS Applications as Data Sources</ins></h3>\\n<p>I’m happy to share that starting today, you can aggregate external SaaS application data for ML in Amazon SageMaker Data Wrangler to prepare data for ML. With this feature, you can use more than 40 SaaS applications as data sources via Amazon AppFlow and have these data available on Amazon SageMaker Data Wrangler. Once the data sources are registered in AWS Glue Data Catalog by AppFlow, you can browse tables and schemas from these data sources using Data Wrangler SQL explorer. This feature provides seamless data integration between SaaS applications and SageMaker Data Wrangler using Amazon AppFlow.</p>\n<p>Here is a quick preview of this new feature:</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/fef0230364044bfa8e2d95ad4f1fd742_dw-intro.gif\\" alt=\\"dwintro.gif\\" /></p>\n<p>This new feature of Amazon SageMaker Data Wrangler works by using integration with Amazon AppFlow, a fully managed integration service that enables you to securely exchange data between SaaS applications and AWS services. With Amazon AppFlow, you can establish bidirectional data integration between SaaS applications, such as Salesforce, SAP, and Amplitude and all supported services, into your Amazon S3 or Amazon Redshift.</p>\n<p>Then, with Amazon AppFlow, you can catalog the data in AWS Glue Data Catalog. This is a new feature where with Amazon AppFlow, you can create an integration with AWS Glue Data Catalog for Amazon S3 destination connector. With this new integration, customers can catalog SaaS data applications into AWS Glue Data Catalog with a few clicks, directly from the Amazon AppFlow Flow configuration, without the need to run any crawlers.</p>\n<p>Once you’ve established a flow and inserted it into the AWS Glue Data Catalog, you can use this data inside the Amazon SageMaker Data Wrangler. Then, you can do the data preparation as you usually do. You can write Amazon Athena queries to preview data, join data from multiple sources, or import data to prepare for ML model training.</p>\n<p>With this feature, you need to do a few simple steps to perform seamless data integration between SaaS applications into Amazon SageMaker Data Wrangler via Amazon AppFlow. This integration supports more than 40 SaaS applications, and for a complete list of supported applications, please check the <a href=\\"https://docs.aws.amazon.com/appflow/latest/userguide/app-specific.html\\" target=\\"_blank\\">Supported source and destination applications</a> documentation.</p>\\n<h3><a id=\\"Get_Started_with_Amazon_SageMaker_Data_Wrangler_Support_for_Amazon_AppFlow_24\\"></a><ins>Get Started with Amazon SageMaker Data Wrangler Support for Amazon AppFlow</ins></h3>\\n<p>Let’s see how this feature works in detail. In my scenario, I need to get data from Salesforce, and do the data preparation using Amazon SageMaker Data Wrangler.</p>\n<p>To start using this feature, the first thing I need to do is to create a flow in Amazon AppFlow that registers the data source into the AWS Glue Data Catalog. I already have an existing connection with my Salesforce account, and all I need now is to create a flow.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/c949c8561b05425cac5fda2c3b14b4df_image.png\\" alt=\\"image.png\\" /></p>\n<p>One important thing to note is that to make SaaS application data available in Amazon SageMaker Data Wrangler, I need to create a flow with Amazon S3 as the destination. Then, I need to enable <strong>Create a Data Catalog table</strong> in the <strong>AWS Glue Data Catalog settings</strong>. This option will automatically catalog my Salesforce data into [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Data Catalog.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/600e85352b234235956e569b2916cbd5_image.png\\" alt=\\"image.png\\" /></p>\n<p>On this page, I need to select a user role with the required AWS Glue Data Catalog permissions and define the <strong>database name</strong> and the <strong>table name prefix</strong>. In addition, in this section, I can define the <strong>data format preference,</strong> be it in JSON, CSV, or Apache Parquet formats, and <strong>filename preference</strong> if I want to add a timestamp into the file name section.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/6e4ebd0d2fe04b92ad8f455665756472_image.png\\" alt=\\"image.png\\" /></p>\n<p>To learn more about how to register SaaS data in Amazon AppFlow and AWS Glue Data Catalog, you can read <strong><a href=\\"https://docs.aws.amazon.com/appflow/latest/userguide/flows-catalog.html\\" target=\\"_blank\\">Cataloging the data output from an Amazon AppFlow flow</a></strong> documentation page.</p>\n<p>Once I’ve finished registering SaaS data, I need to make sure the IAM role can view the data sources in Data Wrangler from AppFlow. Here is an example of a policy in the IAM role:</p>\n<pre><code class=\\"lang-\\">\\n{\\n &quot;Version&quot;: &quot;2012-10-17&quot;,\\n &quot;Statement&quot;: [\\n {\\n &quot;Effect&quot;: &quot;Allow&quot;,\\n &quot;Action&quot;: &quot;glue:SearchTables&quot;,\\n &quot;Resource&quot;: [\\n &quot;arn:aws:glue:*:*:table/*/*&quot;,\\n &quot;arn:aws:glue:*:*:database/*&quot;,\\n &quot;arn:aws:glue:*:*:catalog&quot;\\n ]\\n }\\n ]\\n} \\n</code></pre>\\n<p>By enabling data cataloging with AWS Glue Data Catalog, from this point on, Amazon SageMaker Data Wrangler will be able to automatically discover this new data source and I can browse tables and schema using the Data Wrangler SQL Explorer.</p>\n<p>Now it’s time to switch to the Amazon SageMaker Data Wrangler dashboard then select <strong>Connect to data sources</strong>.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/857a7a0804ed43db99cdbec239becc8e_image.png\\" alt=\\"image.png\\" /></p>\n<p>On the following page, I need to** Create connectio**n and select the data source I want to import. In this section, I can see all the available connections for me to use. Here I see the Salesforce connection is already available for me to use.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/468244649fc54d48a4210aa56431e4a7_image.png\\" alt=\\"image.png\\" /></p>\n<p>If I would like to add additional data sources, I can see a list of external SaaS applications that I can integrate into the <strong>Set up new data sources</strong> section. To learn how to recognize external SaaS applications as data sources, I can learn more with the select <strong>How to enable access</strong>.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/7639b494966e462dbf37177184c0316c_image.png\\" alt=\\"image.png\\" /></p>\n<p>Now I will import datasets and select the Salesforce connection.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/58caf4e4861e456dbe645e61b1bf1d2c_image.png\\" alt=\\"image.png\\" /></p>\n<p>On the next page, I can define connection settings and import data from Salesforce. When I’m done with this configuration, I select <strong>Connect</strong>.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/e1eab54ea5594a529aa9051f0528d9c2_image.png\\" alt=\\"image.png\\" /></p>\n<p>On the following page, I see my Salesforce data that I already configured with Amazon AppFlow and AWS Glue Data Catalog called <code>appflowdatasourcedb</code>. I can also see a <strong>table preview</strong> and <strong>schema</strong> for me to review if this is the data I need.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/19215ef9c0184c438dd6911286ebd38f_image.png\\" alt=\\"image.png\\" /></p>\n<p>Then, I start building my dataset using this data by performing SQL queries inside the SageMaker Data Wrangler SQL Explorer. Then, I select <strong>Import query</strong>.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/1bf3482e70604b418a790599bc3d6821_image.png\\" alt=\\"image.png\\" /></p>\n<p>Then, I define a name for my dataset.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/1038931e12454bdaa0ffdbd256a26504_image.png\\" alt=\\"image.png\\" /></p>\n<p>At this point, I can start doing the data preparation process. I can navigate to the <strong>Analysis</strong> tab to run the data insight report. The analysis will provide me with a report on the data quality issues and what transform I need to use next to fix the issues based on the ML problem I want to predict. To learn more about how to use the data analysis feature, see <a href=\\"https://aws.amazon.com/blogs/machine-learning/accelerate-data-preparation-with-data-quality-and-insights-in-amazon-sagemaker-data-wrangler/\\" target=\\"_blank\\">Accelerate data preparation with data quality and insights in the Amazon SageMaker Data Wrangler</a> blog post.</p>\\n<p>In my case, there are several columns I don’t need, and I need to drop these columns. I select <strong>Add step</strong>.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/6714bf93685046828bcabff38199d080_image.png\\" alt=\\"image.png\\" /></p>\n<p>One feature I like is that Amazon SageMaker Data Wrangler provides numerous ML data transforms. It helps me to streamline the process of cleaning, transforming and feature engineering my data in one dashboard. For more about what SageMaker Data Wrangler provides for transformation data, please read this <a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-transform.html\\" target=\\"_blank\\">Transform Data</a> documentation page.</p>\\n<p>In this list, I select <strong>Manage columns</strong>.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/028b33591265457a9f2892bc6cf03ccd_image.png\\" alt=\\"image.png\\" /></p>\n<p>Then, in the <strong>Transform</strong> section, I select the** Drop column** option. Then, I select a few columns that I don’t need.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/d088bea0f5814dfbbffe4e177d98d092_image.png\\" alt=\\"image.png\\" /></p>\n<p>Once I’m done, the columns I don’t need are removed and the <strong>Drop column</strong> data preparation step I just created is listed in the <strong>Add step</strong> section.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/6bd7d4cd873c40bebd3e9963e1747d25_image.png\\" alt=\\"image.png\\" /></p>\n<p>I can also see the visual of my <strong>data flow</strong> inside the [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler. In this example, my data flow is quite basic. But when my data preparation process becomes complex, this visual view makes it easy for me to see all the data preparation steps.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/47f3a6631f4c4231ac194c64cff3845e_image.png\\" alt=\\"image.png\\" /></p>\n<p>From this point on, I can do what I require with my Salesforce data. For example, I can export data directly to Amazon S3 by selecting <strong>Export to</strong> and choosing <strong>Amazon S3</strong> from the <strong>Add destination</strong> menu. In my case, I specify <strong>Data Wrangler</strong> to store the data in [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) after it has processed it by selecting <strong>Add destination</strong> and then <strong>Amazon S3</strong>.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/e4935180bda9423a9532aa8a63c2a276_image.png\\" alt=\\"image.png\\" /></p>\n<p>Amazon SageMaker Data Wrangler provides me flexibility to automate the same data preparation flow using <a href=\\"https://aws.amazon.com/blogs/machine-learning/get-more-control-of-your-amazon-sagemaker-data-wrangler-workloads-with-parameterized-datasets-and-scheduled-jobs/\\" target=\\"_blank\\">scheduled jobs</a>. I can also automate <a href=\\"https://aws.amazon.com/blogs/machine-learning/automate-feature-engineering-pipelines-with-amazon-sagemaker/\\" target=\\"_blank\\">feature engineering</a> with <strong>SageMaker Pipelines (via Jupyter Notebook) and SageMaker Feature Store (via Jupyter Notebook)</strong>, and deploy to Inference end point with <strong>SageMaker Inference Pipeline (via Jupyter Notebook).</strong></p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/df60f71bc6c241448052f57f0d505f0f_image.png\\" alt=\\"image.png\\" /></p>\n<h3><a id=\\"Things_to_Know_133\\"></a><ins>Things to Know</ins></h3>\\n<p><strong>Related news</strong> – This feature will make it easy for you to do data aggregation and preparation with [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler. As this feature is an integration with [Amazon AppFlow](https://aws.amazon.com/cn/appflow/?trk=cndc-detail) and also [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Data Catalog, you might want to learn more on <a href=\\"https://aws.amazon.com/about-aws/whats-new/2022/11/amazon-appflow-aws-glue-data-catalog-integration-provides-enhanced-data-preparation/\\" target=\\"_blank\\">Amazon AppFlow now supports AWS Glue Data Catalog integration and provides enhanced data preparation</a> page.</p>\\n<p><strong>Availability</strong> – [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler supports SaaS applications as data sources available in all the Regions currently supported by [Amazon AppFlow](https://aws.amazon.com/cn/appflow/?trk=cndc-detail).</p>\\n<p><strong>Pricing</strong> – There is no additional cost to use SaaS applications supports in [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler, but there is a cost to running [Amazon AppFlow](https://aws.amazon.com/cn/appflow/?trk=cndc-detail) to get the data in [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler.</p>\\n<p>Visit <a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-import.html#data-wrangler-import-saas\\" target=\\"_blank\\">Import Data From Software as a Service (SaaS) Platforms</a> documentation page to learn more about this feature, and follow the getting started guide to start data aggregating and preparing SaaS applications data with [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler.</p>\\n<p>Happy building!<br />\\n— <a href=\\"https://donnie.id/\\" target=\\"_blank\\">Donnie</a></p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/7add2715fc22456497f1768569e0233f_89d6442457c51c067e7be46d60afd82.png\\" alt=\\"89d6442457c51c067e7be46d60afd82.png\\" /></p>\n<h3><a id=\\"Donnie_Prakoso_147\\"></a>Donnie Prakoso</h3>\\n<p>Donnie Prakoso is a software engineer, self-proclaimed barista, and Principal Developer Advocate at AWS. With more than 17 years of experience in the technology industry, from telecommunications, banking to startups. He is now focusing on helping the developers to understand varieties of technology to transform their ideas into execution. He loves coffee and any discussion of any topics from microservices to AI / ML.</p>\n"}
目录
亚马逊云科技解决方案 基于行业客户应用场景及技术领域的解决方案
联系亚马逊云科技专家
亚马逊云科技解决方案
基于行业客户应用场景及技术领域的解决方案
联系专家
0
目录
关闭