Next Generation SageMaker Notebooks – Now with Built-in Data Preparation, Real-Time Collaboration, and Notebook Automation

海外精选
re:Invent
Amazon SageMaker
海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时,内容中提到的“AWS” 是 “Amazon Web Services” 的缩写,在此网站不作为商标展示。
0
0
{"value":"In 2019, we [introduced Amazon SageMaker Studio](https://aws.amazon.com/blogs/aws/amazon-sagemaker-studio-the-first-fully-integrated-development-environment-for-machine-learning/), the first fully integrated development environment (IDE) for data science and machine learning (ML). SageMaker Studio gives you access to fully managed [Jupyter](https://jupyter.org/) Notebooks that integrate with purpose-built tools to perform all ML steps, from preparing data to training and debugging models, tracking experiments, deploying and monitoring models, and managing pipelines.\n\n\nToday, I’m excited to announce** the next generation of [Amazon SageMaker Notebooks](https://aws.amazon.com/sagemaker/notebooks/)** to increase efficiency across the ML development workflow. You can now improve data quality in minutes with the built-in data preparation capability, edit the same notebooks with your teams in real time, and automatically convert notebook code to production-ready jobs.\n\nLet me show you what’s new!\n\n### ++New Notebook Capability for Simplified Data Preparation++\n\nThe new built-in data preparation capability is powered by [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Data Wrangler and is available in SageMaker Studio notebooks. SageMaker Studio notebooks automatically generate key visualizations on top of [Pandas data frames](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) to help you understand data distribution and identify data quality issues, like missing values, invalid data, and outliers. You can also select the target column for ML models and generate ML-specific insights such as imbalanced class or high correlation columns. You then receive recommendations for data transformations to resolve the issues. You can apply the data transformations right in the UI, and SageMaker Studio notebooks automatically generate the corresponding transformation code in the notebook cells that you can use to replay your data preparation pipeline.\n\n### ++Using the Built-in Data Preparation Capability++\nTo get started, pip install and import ```sagemaker_datawrangler```along with the ```pandas``` Python package. Then, download the dataset you want to analyze to the notebook working directory, and read the dataset with pandas.\n\n```\\nimport pandas as pd \\nimport sagemaker_datawrangler \\n\\n!aws s3 cp s3://<YOUR_S3_BUCKET>/data.csv . \\n\\ndf = pd.read_csv(\\"data.csv\\")\\n```\n\nNow, when you display the data frame, it automatically shows key data visualizations at the top of each column, surfaces data insights, detects data quality issues, and suggests solutions to improve data quality. When you select a column as the target column for ML predictions, you get target-specific insights and warnings, such as mixed data types in target (for regression use cases) or too few instances per class (for classification use cases).\n\nIn this example, I’m using the [Women’s E-Commerce Clothing Reviews](https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews) dataset that contains customer reviews and ratings for women’s clothing. This dataset was obtained from [Kaggle](https://www.kaggle.com/) and has been modified by Amazon to add synthetic data quality issues.\n\n![image.png](https://dev-media.amazoncloud.cn/d08b2199d6034771bfbfe9b11b1f52d8_image.png)\n\n\nYou can review the suggested data transformations to improve the data quality and apply them right in the UI. For a list of all supported data transformations, have a look at the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-notebook-dataprep-assistant.html). Once you apply a data transformation, SageMaker Studio notebooks automatically generate the code to reproduce those data preparation steps in another notebook cell.\n\n\nFor my example, I select ```Rating``` as my target column. Target column insights tells me in a high-priority warning that this column has too few instances per class and with a medium-priority warning that classes are too imbalanced. Let’s follow the suggestions and drop rare target values and drop missing values. I will also follow the suggestions for some of the feature columns and drop missing values in the ```Review Text``` column and drop the ```Division Name``` column.\n\nOnce I apply the transformations, the notebook generates this code for me:\n\n# Pandas code generated by sagemaker_datawrangler\noutput_df = df.copy(deep=True)\n\n\n# Code to Drop rare target values for column: Rating to resolve warning: Too few instances per class \nrare_target_labels_to_drop = ['-100', '100']\noutput_df = output_df[~output_df['Rating'].isin(rare_target_labels_to_drop)]\n\n\n```\\n# Code to Drop missing for column: Rating to resolve warning: Missing values \\noutput_df = output_df[output_df['Rating'].notnull()]\\n\\n\\n# Code to Drop missing for column: Review Text to resolve warning: Missing values \\noutput_df = output_df[output_df['Review Text'].notnull()]\\n\\n\\n# Code to Drop column for column: Division Name to resolve warning: Missing values \\noutput_df=output_df.drop(columns=['Division Name'])\\n```\n\nI can now review and modify the code if needed or start integrating the data transformations as part of my ML development workflow.\n\n### ++Introducing Shared Spaces for Team-Based Sharing and Real-Time Collaboration++\n\nSageMaker Studio now offers shared spaces that give data science and ML teams a workspace where they can read, edit, and run notebooks together in real time to streamline collaboration and communication during the development process. Shared spaces provide a shared [Amazon EFS](https://aws.amazon.com/efs/) directory that you can utilize to share files within a shared space. All taggable SageMaker resources that you create in a shared space are automatically tagged to help you organize and have a filtered view of your ML resources, such as training jobs, experiments, and models, that are relevant to the business problem you work on in the space. This also helps you monitor costs and plan budgets using tools such as [AWS Budgets](https://aws.amazon.com/aws-cost-management/aws-budgets/) and [AWS Cost Explorer](https://aws.amazon.com/aws-cost-management/aws-cost-explorer/).\n\n\nAnd that’s not all. You can now also create multiple SageMaker domains within the same AWS account to scope access and isolate resources to different teams or business units in your organization. Now, let me show you how to create a shared space for users within a SageMaker domain.\n\n### ++Using Shared Spaces++\nYou can use the [SageMaker console](https://console.aws.amazon.com/sagemaker) or the [AWS CLI](https://aws.amazon.com/cli/) to create shared spaces for a SageMaker domain. To get started in the SageMaker console, go to **Domains**, select or create a new domain, and select **Space management** on the **Domain details** page. Then, select **Create** and give the shared space a name.\n\n\n![image.png](https://dev-media.amazoncloud.cn/14f2b271ea7640bf9c44f61514b3e9d5_image.png)\n\n\nUsers in this SageMaker domain can now launch and join the shared space through their SageMaker domain user profiles.\n\n![image.png](https://dev-media.amazoncloud.cn/b3c7f53123064c84a8f9d69227027382_image.png)\n\nIn a shared space, select the new **Collaborators** icon in the left navigation menu. You can now see who else is currently active in this space. The following screenshot shows user **tom** on the left, editing a notebook file. On the right, user **antje** sees the edits in real time, together with an annotation of the user name that currently edits that notebook cell.\n\n![image.png](https://dev-media.amazoncloud.cn/121dfeddb5a94127b79eb17d3e4c9e46_image.png)\n\n### ++New Notebook Capability to Automatically Convert Notebook Code to Production-Ready Jobs++\n\n\nYou can now select a notebook and automate it as a job that can run in a production environment without the need to manage the underlying infrastructure. When you create a **SageMaker Notebook Job**, SageMaker Studio takes a snapshot of the entire notebook, packages its dependencies in a container, builds the infrastructure, runs the notebook as an automated job on a schedule you define, and deprovisions the infrastructure upon job completion. This notebook capability is now also available in [SageMaker Studio Lab](https://studiolab.sagemaker.aws/), our free ML development environment that provides the compute, storage, and security to learn and experiment with ML.\n\n### ++Using the Notebook Capability to Automate Notebooks++\nTo get started, open a notebook file in SageMaker Studio. Then, right-click your notebook file and select **Create Notebook** Job or select the **Create Notebook Job** icon, as highlighted in the following screenshot.\n\n![image.png](https://dev-media.amazoncloud.cn/2ea26e71d87d47f7a96c630db9ccc9d4_image.png)\n\nDefine a name for the **Notebook Job**, review the input file location, specify the compute type to use, and whether to run the job immediately or on a schedule. Then, select **Create**.\n\n![image.png](https://dev-media.amazoncloud.cn/e6f1fbeeec234038bf59a918e0728fb5_image.png)\n\nThe Notebook Job has been created, and you can review all Notebook Job Definitions in the UI.\n\n![image.png](https://dev-media.amazoncloud.cn/ae1880d40c3743dcbdbf82bcd48483c8_image.png)\n\n### ++Now Available++\nThe new [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Studio notebook capabilities are now available in all [AWS Regions](https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/) where [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Studio is available except for the AWS China Regions.\n\nAt launch, the built-in data preparation capability powered by SageMaker Data Wrangler is supported for SageMaker Studio notebooks and the following notebook kernel images:\n\n- Python 3 (Data Science) with Python 3.7\n- Python 3 (Data Science 2.0) with Python 3.8\n- Python 3 (Data Science 3.0) with Python 3.10\n- Spark Analytics 1.0 and 2.0\n\n\nFor more information, visit [Amazon SageMaker Notebooks](https://aws.amazon.com/sagemaker/notebooks/).\n\n**[Start building your ML projects with the next generation of Amazon SageMaker Notebooks today!](https://console.aws.amazon.com/sagemaker/home)**\n\n— [Antje](https://twitter.com/anbarth)\n\n![image.png](https://dev-media.amazoncloud.cn/2367ed3d6fe84f77a6fd3f6dd154d2e6_image.png)\n\n### Antje Barth\nAntje Barth is a Principal Developer Advocate for AI and ML at AWS. She is co-author of the O’Reilly book – Data Science on AWS. Antje frequently speaks at AI/ML conferences, events, and meetups around the world. She also co-founded the Düsseldorf chapter of Women in Big Data.\n\n\n\n","render":"<p>In 2019, we <a href=\\"https://aws.amazon.com/blogs/aws/amazon-sagemaker-studio-the-first-fully-integrated-development-environment-for-machine-learning/\\" target=\\"_blank\\">introduced Amazon SageMaker Studio</a>, the first fully integrated development environment (IDE) for data science and machine learning (ML). SageMaker Studio gives you access to fully managed <a href=\\"https://jupyter.org/\\" target=\\"_blank\\">Jupyter</a> Notebooks that integrate with purpose-built tools to perform all ML steps, from preparing data to training and debugging models, tracking experiments, deploying and monitoring models, and managing pipelines.</p>\\n<p>Today, I’m excited to announce** the next generation of <a href=\\"https://aws.amazon.com/sagemaker/notebooks/\\" target=\\"_blank\\">Amazon SageMaker Notebooks</a>** to increase efficiency across the ML development workflow. You can now improve data quality in minutes with the built-in data preparation capability, edit the same notebooks with your teams in real time, and automatically convert notebook code to production-ready jobs.</p>\\n<p>Let me show you what’s new!</p>\n<h3><a id=\\"New_Notebook_Capability_for_Simplified_Data_Preparation_7\\"></a><ins>New Notebook Capability for Simplified Data Preparation</ins></h3>\\n<p>The new built-in data preparation capability is powered by Amazon SageMaker Data Wrangler and is available in SageMaker Studio notebooks. SageMaker Studio notebooks automatically generate key visualizations on top of <a href=\\"https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html\\" target=\\"_blank\\">Pandas data frames</a> to help you understand data distribution and identify data quality issues, like missing values, invalid data, and outliers. You can also select the target column for ML models and generate ML-specific insights such as imbalanced class or high correlation columns. You then receive recommendations for data transformations to resolve the issues. You can apply the data transformations right in the UI, and SageMaker Studio notebooks automatically generate the corresponding transformation code in the notebook cells that you can use to replay your data preparation pipeline.</p>\\n<h3><a id=\\"Using_the_Builtin_Data_Preparation_Capability_11\\"></a><ins>Using the Built-in Data Preparation Capability</ins></h3>\\n<p>To get started, pip install and import <code>sagemaker_datawrangler</code>along with the <code>pandas</code> Python package. Then, download the dataset you want to analyze to the notebook working directory, and read the dataset with pandas.</p>\\n<pre><code class=\\"lang-\\">import pandas as pd \\nimport sagemaker_datawrangler \\n\\n!aws s3 cp s3://&lt;YOUR_S3_BUCKET&gt;/data.csv . \\n\\ndf = pd.read_csv(&quot;data.csv&quot;)\\n</code></pre>\\n<p>Now, when you display the data frame, it automatically shows key data visualizations at the top of each column, surfaces data insights, detects data quality issues, and suggests solutions to improve data quality. When you select a column as the target column for ML predictions, you get target-specific insights and warnings, such as mixed data types in target (for regression use cases) or too few instances per class (for classification use cases).</p>\n<p>In this example, I’m using the <a href=\\"https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews\\" target=\\"_blank\\">Women’s E-Commerce Clothing Reviews</a> dataset that contains customer reviews and ratings for women’s clothing. This dataset was obtained from <a href=\\"https://www.kaggle.com/\\" target=\\"_blank\\">Kaggle</a> and has been modified by Amazon to add synthetic data quality issues.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/d08b2199d6034771bfbfe9b11b1f52d8_image.png\\" alt=\\"image.png\\" /></p>\n<p>You can review the suggested data transformations to improve the data quality and apply them right in the UI. For a list of all supported data transformations, have a look at the <a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-notebook-dataprep-assistant.html\\" target=\\"_blank\\">documentation</a>. Once you apply a data transformation, SageMaker Studio notebooks automatically generate the code to reproduce those data preparation steps in another notebook cell.</p>\\n<p>For my example, I select <code>Rating</code> as my target column. Target column insights tells me in a high-priority warning that this column has too few instances per class and with a medium-priority warning that classes are too imbalanced. Let’s follow the suggestions and drop rare target values and drop missing values. I will also follow the suggestions for some of the feature columns and drop missing values in the <code>Review Text</code> column and drop the <code>Division Name</code> column.</p>\\n<p>Once I apply the transformations, the notebook generates this code for me:</p>\n<h1><a id=\\"Pandas_code_generated_by_sagemaker_datawrangler_37\\"></a>Pandas code generated by sagemaker_datawrangler</h1>\\n<p>output_df = df.copy(deep=True)</p>\n<h1><a id=\\"Code_to_Drop_rare_target_values_for_column_Rating_to_resolve_warning_Too_few_instances_per_class_41\\"></a>Code to Drop rare target values for column: Rating to resolve warning: Too few instances per class</h1>\\n<p>rare_target_labels_to_drop = [’-100’, ‘100’]<br />\\noutput_df = output_df[~output_df[‘Rating’].isin(rare_target_labels_to_drop)]</p>\n<pre><code class=\\"lang-\\"># Code to Drop missing for column: Rating to resolve warning: Missing values \\noutput_df = output_df[output_df['Rating'].notnull()]\\n\\n\\n# Code to Drop missing for column: Review Text to resolve warning: Missing values \\noutput_df = output_df[output_df['Review Text'].notnull()]\\n\\n\\n# Code to Drop column for column: Division Name to resolve warning: Missing values \\noutput_df=output_df.drop(columns=['Division Name'])\\n</code></pre>\\n<p>I can now review and modify the code if needed or start integrating the data transformations as part of my ML development workflow.</p>\n<h3><a id=\\"Introducing_Shared_Spaces_for_TeamBased_Sharing_and_RealTime_Collaboration_61\\"></a><ins>Introducing Shared Spaces for Team-Based Sharing and Real-Time Collaboration</ins></h3>\\n<p>SageMaker Studio now offers shared spaces that give data science and ML teams a workspace where they can read, edit, and run notebooks together in real time to streamline collaboration and communication during the development process. Shared spaces provide a shared <a href=\\"https://aws.amazon.com/efs/\\" target=\\"_blank\\">Amazon EFS</a> directory that you can utilize to share files within a shared space. All taggable SageMaker resources that you create in a shared space are automatically tagged to help you organize and have a filtered view of your ML resources, such as training jobs, experiments, and models, that are relevant to the business problem you work on in the space. This also helps you monitor costs and plan budgets using tools such as <a href=\\"https://aws.amazon.com/aws-cost-management/aws-budgets/\\" target=\\"_blank\\">AWS Budgets</a> and <a href=\\"https://aws.amazon.com/aws-cost-management/aws-cost-explorer/\\" target=\\"_blank\\">AWS Cost Explorer</a>.</p>\\n<p>And that’s not all. You can now also create multiple SageMaker domains within the same AWS account to scope access and isolate resources to different teams or business units in your organization. Now, let me show you how to create a shared space for users within a SageMaker domain.</p>\n<h3><a id=\\"Using_Shared_Spaces_68\\"></a><ins>Using Shared Spaces</ins></h3>\\n<p>You can use the <a href=\\"https://console.aws.amazon.com/sagemaker\\" target=\\"_blank\\">SageMaker console</a> or the <a href=\\"https://aws.amazon.com/cli/\\" target=\\"_blank\\">AWS CLI</a> to create shared spaces for a SageMaker domain. To get started in the SageMaker console, go to <strong>Domains</strong>, select or create a new domain, and select <strong>Space management</strong> on the <strong>Domain details</strong> page. Then, select <strong>Create</strong> and give the shared space a name.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/14f2b271ea7640bf9c44f61514b3e9d5_image.png\\" alt=\\"image.png\\" /></p>\n<p>Users in this SageMaker domain can now launch and join the shared space through their SageMaker domain user profiles.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/b3c7f53123064c84a8f9d69227027382_image.png\\" alt=\\"image.png\\" /></p>\n<p>In a shared space, select the new <strong>Collaborators</strong> icon in the left navigation menu. You can now see who else is currently active in this space. The following screenshot shows user <strong>tom</strong> on the left, editing a notebook file. On the right, user <strong>antje</strong> sees the edits in real time, together with an annotation of the user name that currently edits that notebook cell.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/121dfeddb5a94127b79eb17d3e4c9e46_image.png\\" alt=\\"image.png\\" /></p>\n<h3><a id=\\"New_Notebook_Capability_to_Automatically_Convert_Notebook_Code_to_ProductionReady_Jobs_83\\"></a><ins>New Notebook Capability to Automatically Convert Notebook Code to Production-Ready Jobs</ins></h3>\\n<p>You can now select a notebook and automate it as a job that can run in a production environment without the need to manage the underlying infrastructure. When you create a <strong>SageMaker Notebook Job</strong>, SageMaker Studio takes a snapshot of the entire notebook, packages its dependencies in a container, builds the infrastructure, runs the notebook as an automated job on a schedule you define, and deprovisions the infrastructure upon job completion. This notebook capability is now also available in <a href=\\"https://studiolab.sagemaker.aws/\\" target=\\"_blank\\">SageMaker Studio Lab</a>, our free ML development environment that provides the compute, storage, and security to learn and experiment with ML.</p>\\n<h3><a id=\\"Using_the_Notebook_Capability_to_Automate_Notebooks_88\\"></a><ins>Using the Notebook Capability to Automate Notebooks</ins></h3>\\n<p>To get started, open a notebook file in SageMaker Studio. Then, right-click your notebook file and select <strong>Create Notebook</strong> Job or select the <strong>Create Notebook Job</strong> icon, as highlighted in the following screenshot.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/2ea26e71d87d47f7a96c630db9ccc9d4_image.png\\" alt=\\"image.png\\" /></p>\n<p>Define a name for the <strong>Notebook Job</strong>, review the input file location, specify the compute type to use, and whether to run the job immediately or on a schedule. Then, select <strong>Create</strong>.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/e6f1fbeeec234038bf59a918e0728fb5_image.png\\" alt=\\"image.png\\" /></p>\n<p>The Notebook Job has been created, and you can review all Notebook Job Definitions in the UI.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/ae1880d40c3743dcbdbf82bcd48483c8_image.png\\" alt=\\"image.png\\" /></p>\n<h3><a id=\\"Now_Available_101\\"></a><ins>Now Available</ins></h3>\\n<p>The new Amazon SageMaker Studio notebook capabilities are now available in all <a href=\\"https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/\\" target=\\"_blank\\">AWS Regions</a> where [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Studio is available except for the AWS China Regions.</p>\\n<p>At launch, the built-in data preparation capability powered by SageMaker Data Wrangler is supported for SageMaker Studio notebooks and the following notebook kernel images:</p>\n<ul>\\n<li>Python 3 (Data Science) with Python 3.7</li>\n<li>Python 3 (Data Science 2.0) with Python 3.8</li>\n<li>Python 3 (Data Science 3.0) with Python 3.10</li>\n<li>Spark Analytics 1.0 and 2.0</li>\n</ul>\\n<p>For more information, visit <a href=\\"https://aws.amazon.com/sagemaker/notebooks/\\" target=\\"_blank\\">Amazon SageMaker Notebooks</a>.</p>\\n<p><strong><a href=\\"https://console.aws.amazon.com/sagemaker/home\\" target=\\"_blank\\">Start building your ML projects with the next generation of Amazon SageMaker Notebooks today!</a></strong></p>\n<p>— <a href=\\"https://twitter.com/anbarth\\" target=\\"_blank\\">Antje</a></p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/2367ed3d6fe84f77a6fd3f6dd154d2e6_image.png\\" alt=\\"image.png\\" /></p>\n<h3><a id=\\"Antje_Barth_120\\"></a>Antje Barth</h3>\\n<p>Antje Barth is a Principal Developer Advocate for AI and ML at AWS. She is co-author of the O’Reilly book – Data Science on AWS. Antje frequently speaks at AI/ML conferences, events, and meetups around the world. She also co-founded the Düsseldorf chapter of Women in Big Data.</p>\n"}
目录
亚马逊云科技解决方案 基于行业客户应用场景及技术领域的解决方案
联系亚马逊云科技专家
亚马逊云科技解决方案
基于行业客户应用场景及技术领域的解决方案
联系专家
0
目录
关闭