Simplify iterative machine learning model development by adding features to existing feature groups in Amazon SageMaker Feature Store

海外精选
海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时,内容中提到的“AWS” 是 “Amazon Web Services” 的缩写,在此网站不作为商标展示。
0
0
{"value":"Feature engineering is one of the most challenging aspects of the machine learning (ML) lifecycle and a phase where the most amount of time is spent—data scientists and ML engineers spend 60–70% of their time on feature engineering. AWS introduced [Amazon SageMaker Feature Store](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store.html) during AWS re:Invent 2020, which is a purpose-built, fully managed, centralized store for features and associated metadata. Features are signals extracted from data to train ML models. The advantage of Feature Store is that the feature engineering logic is authored one time, and the features generated are stored on a central platform. The central store of features can be used for training and inference and be reused across different data engineering teams.\n\nFeatures in a feature store are stored in a collection called feature group. A feature group is analogous to a database table schema where columns represent features and rows represent individual records. Feature groups have been immutable since Feature Store was introduced. If we had to add features to an existing feature group, the process was cumbersome—we had to create a new feature group, backfill the new feature group with historical data, and modify downstream systems to use this new feature group. ML development is an iterative process of trial and error where we may identify new features continuously that can improve model performance. It’s evident that not being able to add features to feature groups can lead to a complex ML model development lifecycle.\n\nFeature Store [recently introduced](https://aws.amazon.com/about-aws/whats-new/2022/07/amazon-sagemaker-feature-store-new-features-existing-feature-groups/) the ability to add new features to existing feature groups. A feature group schema evolves over time as a result of new business requirements or because new features have been identified that yield better model performance. Data scientists and ML engineers need to easily add features to an existing feature group. This ability reduces the overhead associated with creating and maintaining multiple feature groups and therefore lends itself to iterative ML model development. Model training and inference can take advantage of new features using the same feature group by making minimal changes.\n\nIn this post, we demonstrate how to add features to a feature group using the newly released [UpdateFeatureGroup API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateFeatureGroup.html).\n\n#### **Overview of solution**\n\nFeature Store acts as a single source of truth for feature engineered data that is used in ML training and inference. When we store features in Feature Store, we store them in feature groups.\n\nWe can enable feature groups for [offline only mode, online only mode, or online and offline modes.](https://aws.amazon.com/blogs/machine-learning/getting-started-with-amazon-sagemaker-feature-store/)\n\nAn online store is a low-latency data store and always has the latest snapshot of the data. An offline store has a historical set of records persisted in [Amazon Simple Storage Service](http://aws.amazon.com/s3) ([Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail)). Feature Store automatically creates an [AWS Glue](https://aws.amazon.com/glue) [Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/components-overview.html#data-catalog-intro) for the offline store, which enables us to run SQL queries against the offline data using [Amazon Athena](https://aws.amazon.com/athena).\n\nThe following diagram illustrates the process of feature creation and ingestion into Feature Store.\n\n![image.png](https://dev-media.amazoncloud.cn/1a17e4a9f31541ea8e6bf7bf15a5e4cf_image.png)\n\nThe workflow contains the following steps:\n\n1. Define a feature group and create the feature group in Feature Store.\n2. Ingest data into the feature group, which writes to the online store immediately and then to the offline store.\n3. Use the offline store data stored in [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) for training one or more models.\n4. Use the offline store for batch inference.\n5. Use the online store supporting low-latency reads for real-time inference.\n6. To update the feature group to add a new feature, we use the new [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) UpdateFeatureGroup API. This also updates the underlying [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Data Catalog. After the schema has been updated, we can ingest data into this updated feature group and use the updated offline and online store for inference and model training.\n\n#### **Dataset**\n\nTo demonstrate this new functionality, we use a synthetically generated customer dataset. The dataset has unique IDs for customer, sex, marital status, age range, and how long since they have been actively purchasing.\n\n![image.png](https://dev-media.amazoncloud.cn/4f83e59d7a3e4dc68f17b9f0300aaba9_image.png)\n\nLet’s assume a scenario where a business is trying to predict the propensity of a customer purchasing a certain product, and data scientists have developed a model to predict this intended outcome. Let’s also assume that the data scientists have identified a new signal for the customer that could potentially improve model performance and better predict the outcome. We work through this use case to understand how to update feature group definition to add the new feature, ingest data into this new feature, and finally explore the online and offline feature store to verify the changes.\n\n#### **Prerequisites**\n\nFor this walkthrough, you should have the following prerequisites:\n\n- An [AWS account](https://signin.aws.amazon.com/signin?redirect_uri=https%3A%2F%2Fportal.aws.amazon.com%2Fbilling%2Fsignup%2Fresume&client_id=signup).\n- A [SageMaker Jupyter notebook instance](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-console.html). Access the code from the [Amazon SageMaker Feature Store Update Feature Group GitHub repository](https://github.com/aws-samples/amazon-sagemaker-feature-store-update-feature-group) and upload it to your notebook instance.\n- You can also run the notebook in the [Amazon SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html) environment, which is an IDE for ML development. You can clone the GitHub repo [via a terminal](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-tasks-git.html) inside the Studio environment using the following command:\n\n```\\ngit clone https://github.com/aws-samples/amazon-sagemaker-feature-store-update-feature-group.git\\n```\n\n#### **Add features to a feature group**\n\nIn this post, we walk through the [update_feature_group.ipynb](https://github.com/aws-samples/amazon-sagemaker-feature-store-update-feature-group/blob/main/update_feature_group.ipynb) notebook, in which we create a feature group, ingest an initial dataset, update the feature group to add a new feature, and re-ingest data that includes the new feature. At the end, we verify the online and offline store for the updates. The fully functional notebook and sample data can be found in the [GitHub repository](https://github.com/aws-samples/amazon-sagemaker-feature-store-update-feature-group). Let’s explore some of the key parts of the notebook here.\n\n1. We create a feature group to store the feature-engineered customer data using the ```FeatureGroup.create```API of the SageMaker SDK.\n\n```\\ncustomers_feature_group = FeatureGroup(name=customers_feature_group_name, \\n sagemaker_session=sagemaker_session)\\n\\ncustomers_feature_group.create(s3_uri=f's3://{default_bucket}/{prefix}', \\n record_identifier_name='customer_id', \\n event_time_feature_name='event_time', \\n role_arn=role, \\n enable_online_store=True)\\n```\n\n2. We create a Pandas DataFrame with the initial CSV data. We use the current time as the timestamp for the ```event_time```feature. This corresponds to the time when the event occurred, which implies when the record is added or updated in the feature group.\n3. We ingest the DataFrame into the feature group using the SageMaker SDK ```FeatureGroup.ingest```API. This is a small dataset and therefore can be loaded into a Pandas DataFrame. When we work with large amounts of data and millions of rows, there are other scalable mechanisms to ingest data into Feature Store, such as [batch ingestion with Apache Spark](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-ingestion-spark-connector-setup.html).\n\n```\\ncustomers_feature_group.ingest(data_frame=customers_df,\\n max_workers=3,\\n wait=True)\\n```\n\n4. We can verify that data has been ingested into the feature group by running Athena queries in the notebook or running queries on the Athena console.\n5. After we verify that the offline feature store has the initial data, we add the new feature ```has_kids```to the feature group using the Boto3 update_feature_group API\n\n```\\nsagemaker_runtime.update_feature_group(\\n FeatureGroupName=customers_feature_group_name,\\n FeatureAdditions=[\\n {\\"FeatureName\\": \\"has_kids\\", \\"FeatureType\\": \\"Integral\\"}\\n ])\\n```\n\nThe Data Catalog gets automatically updated as part of this API call. The API supports adding multiple features at a time by specifying them in the ```FeatureAdditions```dictionary.\n\n6. We verify that feature has been added by checking the updated feature group definition\n\n```\\ndescribe_feature_group_result = sagemaker_runtime.describe_feature_group(\\n FeatureGroupName=customers_feature_group_name)\\npretty_printer.pprint(describe_feature_group_result)\\n```\n\nThe ```LastUpdateStatus```in the ```describe_feature_group```API response initially shows the status ```InProgress```. After the operation is successful, the```LastUpdateStatus```status changes to ```Successful```. If for any reason the operation encounters an error, the```lastUpdateStatus```status shows as Failed, with the detailed error message in ```FailureReason```.\n\n![image.png](https://dev-media.amazoncloud.cn/df9279f67b2540d1b201e11700870d35_image.png)\n\nWhen the ```update_feature_group```API is invoked, the control plane reflects the schema change immediately, but the data plane takes up to 5 minutes to update its feature group schema. We must ensure that enough time is given for the update operation before proceeding to data ingestion.\n\nWe prepare data for the ```has_kids```feature by generating random 1s and 0s to indicate whether a customer has kids or not.\n\n```\\ncustomers_df['has_kids'] =np.random.randint(0, 2, customers_df.shape[0])\\n```\n\n8. We ingest the DataFrame that has the newly added column into the feature group using the SageMaker SDK ```FeatureGroup.ingest```API\n\n```\\ncustomers_feature_group.ingest(data_frame=customers_df,\\n max_workers=3,\\n wait=True)\\n```\n\n9. Next, we verify the feature record in the online store for a single customer using the Boto3 ```get_record```API.\n\n```\\nget_record_result = featurestore_runtime.get_record(\\n FeatureGroupName=customers_feature_group_name,\\n RecordIdentifierValueAsString=customer_id)\\npretty_printer.pprint(get_record_result)\\n```\n\n![image.png](https://dev-media.amazoncloud.cn/c68cc392248647c5af973dfeccda6a0b_image.png)\n\n10. Let’s query the same customer record on the Athena console to verify the offline data store. The data is appended to the offline store to maintain historical writes and updates. Therefore, we see two records here: a newer record that has the feature updated to value 1, and an older record that doesn’t have this feature and therefore shows the value as empty. The offline store persistence happens in batches within 15 minutes, so this step could take time.\n\n![image.png](https://dev-media.amazoncloud.cn/114e7370a1754aa3a7ffeaf63aebfbc7_image.png)\n\nNow that we have this feature added to our feature group, we can extract this new feature into our training dataset and retrain models. The goal of the post is to highlight the ease of modifying a feature group, ingesting data into the new feature, and then using the updated data in the feature group for model training and inference.\n\n#### **Clean up**\n\nDon’t forget to clean up the resources created as part of this post to avoid incurring ongoing charges.\n\n1. Delete the S3 objects in the offline store:\n\n```\\ns3_config = describe_feature_group_result['OfflineStoreConfig']['S3StorageConfig']\\ns3_uri = s3_config['ResolvedOutputS3Uri']\\nfull_prefix = '/'.join(s3_uri.split('/')[3:])\\nbucket = s3.Bucket(default_bucket)\\noffline_objects = bucket.objects.filter(Prefix=full_prefix)\\noffline_objects.delete()\\n```\n\n2. Delete the feature group:\n\n```\\ncustomers_feature_group.delete()\\n```\n\n3. Stop the SageMaker Jupyter notebook instance. For instructions, refer to [Clean Up](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html).\n\n#### **Conclusion**\n\nFeature Store is a fully managed, purpose-built repository to store, share, and manage features for ML models. Being able to add features to existing feature groups simplifies iterative model development and alleviates the challenges we see in creating and maintaining multiple feature groups.\n\nIn this post, we showed you how to add features to existing feature groups via the newly released SageMaker ```UpdateFeatureGroup```API. The steps shown in this post are available as a Jupyter notebook in the [GitHub repository](https://github.com/aws-samples/amazon-sagemaker-feature-store-update-feature-group). Give it a try and let us know your feedback in the comments.\n\n#### **Further reading**\n\nIf you’re interested in exploring the complete scenario mentioned earlier in this post of predicting a customer ordering a certain product, check out the following [notebook](https://github.com/aws-samples/amazon-sagemaker-feature-store-end-to-end-workshop/blob/main/03-module-training-and-batch-scoring/m3_nb4_update_feature_group.ipynb), which modifies the feature group, ingests data, and trains an XGBoost model with the data from the updated offline store. This notebook is part of a [comprehensive workshop](https://github.com/aws-samples/amazon-sagemaker-feature-store-end-to-end-workshop) developed to demonstrate Feature Store functionality.\n\n#### **References**\n\nMore information is available at the following resources:\n\n- [Create, Store, and Share Features with Amazon SageMaker Feature Store](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store.html)\n- [Amazon Athena User Guide](https://docs.aws.amazon.com/athena/latest/ug/what-is.html)\n- [Get Started with Amazon SageMaker Notebook Instances](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-console.html)\n- [UpdateFeatureGroup API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateFeatureGroup.html)\n- [SageMaker Boto3 update_feature_group API](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.update_feature_group)\n- [Getting started with Amazon SageMaker Feature Store](https://aws.amazon.com/blogs/machine-learning/getting-started-with-amazon-sagemaker-feature-store/)\n\n#### **About the authors**\n\n![image.png](https://dev-media.amazoncloud.cn/7fce3fc89f1c4830981d352fb3410f49_image.png)\n\n**Chaitra Mathur** is a Principal Solutions Architect at AWS. She guides customers and partners in building highly scalable, reliable, secure, and cost-effective solutions on AWS. She is passionate about Machine Learning and helps customers translate their ML needs into solutions using AWS AI/ML services. She holds 5 certifications including the ML Specialty certification. In her spare time, she enjoys reading, yoga, and spending time with her daughters.\n\n![image.png](https://dev-media.amazoncloud.cn/d5cafca62465439ba17da17444cd4b39_image.png)\n\n**Mark Roy** is a Principal Machine Learning Architect for AWS, helping customers design and build AI/ML solutions. Mark’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Mark holds six AWS certifications, including the ML Specialty Certification. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services.\n\n![image.png](https://dev-media.amazoncloud.cn/5967fec4713b4880965b846762bca04a_image.png)\n\n**Charu Sareen** is a Sr. Product Manager for [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Feature Store. Prior to AWS, she was leading growth and monetization strategy for SaaS services at VMware. She is a data and machine learning enthusiast and has over a decade of experience spanning product management, data engineering, and advanced analytics. She has a bachelor’s degree in Information Technology from National Institute of Technology, India and an MBA from University of Michigan, Ross School of Business.\n\n![image.png](https://dev-media.amazoncloud.cn/b4effb9d50c249d2833bf78794601db2_image.png)\n\n**Frank McQuillan** is Principal Product Manager for [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Feature Store. For the last 10 years he has worked in product management in data, analytics and AI/ML. Prior to that, he worked in engineering roles in robotics, flight simulation, and online advertising technology. He has a master’s degree from the University of Toronto and a bachelor’s degree from the University of Waterloo, both in Mechanical Engineering.","render":"<p>Feature engineering is one of the most challenging aspects of the machine learning (ML) lifecycle and a phase where the most amount of time is spent—data scientists and ML engineers spend 60–70% of their time on feature engineering. AWS introduced <a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store.html\\" target=\\"_blank\\">Amazon SageMaker Feature Store</a> during AWS re:Invent 2020, which is a purpose-built, fully managed, centralized store for features and associated metadata. Features are signals extracted from data to train ML models. The advantage of Feature Store is that the feature engineering logic is authored one time, and the features generated are stored on a central platform. The central store of features can be used for training and inference and be reused across different data engineering teams.</p>\\n<p>Features in a feature store are stored in a collection called feature group. A feature group is analogous to a database table schema where columns represent features and rows represent individual records. Feature groups have been immutable since Feature Store was introduced. If we had to add features to an existing feature group, the process was cumbersome—we had to create a new feature group, backfill the new feature group with historical data, and modify downstream systems to use this new feature group. ML development is an iterative process of trial and error where we may identify new features continuously that can improve model performance. It’s evident that not being able to add features to feature groups can lead to a complex ML model development lifecycle.</p>\n<p>Feature Store <a href=\\"https://aws.amazon.com/about-aws/whats-new/2022/07/amazon-sagemaker-feature-store-new-features-existing-feature-groups/\\" target=\\"_blank\\">recently introduced</a> the ability to add new features to existing feature groups. A feature group schema evolves over time as a result of new business requirements or because new features have been identified that yield better model performance. Data scientists and ML engineers need to easily add features to an existing feature group. This ability reduces the overhead associated with creating and maintaining multiple feature groups and therefore lends itself to iterative ML model development. Model training and inference can take advantage of new features using the same feature group by making minimal changes.</p>\\n<p>In this post, we demonstrate how to add features to a feature group using the newly released <a href=\\"https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateFeatureGroup.html\\" target=\\"_blank\\">UpdateFeatureGroup API</a>.</p>\\n<h4><a id=\\"Overview_of_solution_8\\"></a><strong>Overview of solution</strong></h4>\\n<p>Feature Store acts as a single source of truth for feature engineered data that is used in ML training and inference. When we store features in Feature Store, we store them in feature groups.</p>\n<p>We can enable feature groups for <a href=\\"https://aws.amazon.com/blogs/machine-learning/getting-started-with-amazon-sagemaker-feature-store/\\" target=\\"_blank\\">offline only mode, online only mode, or online and offline modes.</a></p>\\n<p>An online store is a low-latency data store and always has the latest snapshot of the data. An offline store has a historical set of records persisted in <a href=\\"http://aws.amazon.com/s3\\" target=\\"_blank\\">Amazon Simple Storage Service</a> ([Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail)). Feature Store automatically creates an <a href=\\"https://aws.amazon.com/glue\\" target=\\"_blank\\">AWS Glue</a> <a href=\\"https://docs.aws.amazon.com/glue/latest/dg/components-overview.html#data-catalog-intro\\" target=\\"_blank\\">Data Catalog</a> for the offline store, which enables us to run SQL queries against the offline data using <a href=\\"https://aws.amazon.com/athena\\" target=\\"_blank\\">Amazon Athena</a>.</p>\\n<p>The following diagram illustrates the process of feature creation and ingestion into Feature Store.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/1a17e4a9f31541ea8e6bf7bf15a5e4cf_image.png\\" alt=\\"image.png\\" /></p>\n<p>The workflow contains the following steps:</p>\n<ol>\\n<li>Define a feature group and create the feature group in Feature Store.</li>\n<li>Ingest data into the feature group, which writes to the online store immediately and then to the offline store.</li>\n<li>Use the offline store data stored in Amazon S3 for training one or more models.</li>\n<li>Use the offline store for batch inference.</li>\n<li>Use the online store supporting low-latency reads for real-time inference.</li>\n<li>To update the feature group to add a new feature, we use the new Amazon SageMaker UpdateFeatureGroup API. This also updates the underlying AWS Glue Data Catalog. After the schema has been updated, we can ingest data into this updated feature group and use the updated offline and online store for inference and model training.</li>\n</ol>\\n<h4><a id=\\"Dataset_29\\"></a><strong>Dataset</strong></h4>\\n<p>To demonstrate this new functionality, we use a synthetically generated customer dataset. The dataset has unique IDs for customer, sex, marital status, age range, and how long since they have been actively purchasing.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/4f83e59d7a3e4dc68f17b9f0300aaba9_image.png\\" alt=\\"image.png\\" /></p>\n<p>Let’s assume a scenario where a business is trying to predict the propensity of a customer purchasing a certain product, and data scientists have developed a model to predict this intended outcome. Let’s also assume that the data scientists have identified a new signal for the customer that could potentially improve model performance and better predict the outcome. We work through this use case to understand how to update feature group definition to add the new feature, ingest data into this new feature, and finally explore the online and offline feature store to verify the changes.</p>\n<h4><a id=\\"Prerequisites_37\\"></a><strong>Prerequisites</strong></h4>\\n<p>For this walkthrough, you should have the following prerequisites:</p>\n<ul>\\n<li>An <a href=\\"https://signin.aws.amazon.com/signin?redirect_uri=https%3A%2F%2Fportal.aws.amazon.com%2Fbilling%2Fsignup%2Fresume&amp;client_id=signup\\" target=\\"_blank\\">AWS account</a>.</li>\\n<li>A <a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/gs-console.html\\" target=\\"_blank\\">SageMaker Jupyter notebook instance</a>. Access the code from the <a href=\\"https://github.com/aws-samples/amazon-sagemaker-feature-store-update-feature-group\\" target=\\"_blank\\">Amazon SageMaker Feature Store Update Feature Group GitHub repository</a> and upload it to your notebook instance.</li>\\n<li>You can also run the notebook in the <a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html\\" target=\\"_blank\\">Amazon SageMaker Studio</a> environment, which is an IDE for ML development. You can clone the GitHub repo <a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/studio-tasks-git.html\\" target=\\"_blank\\">via a terminal</a> inside the Studio environment using the following command:</li>\\n</ul>\n<pre><code class=\\"lang-\\">git clone https://github.com/aws-samples/amazon-sagemaker-feature-store-update-feature-group.git\\n</code></pre>\\n<h4><a id=\\"Add_features_to_a_feature_group_49\\"></a><strong>Add features to a feature group</strong></h4>\\n<p>In this post, we walk through the <a href=\\"https://github.com/aws-samples/amazon-sagemaker-feature-store-update-feature-group/blob/main/update_feature_group.ipynb\\" target=\\"_blank\\">update_feature_group.ipynb</a> notebook, in which we create a feature group, ingest an initial dataset, update the feature group to add a new feature, and re-ingest data that includes the new feature. At the end, we verify the online and offline store for the updates. The fully functional notebook and sample data can be found in the <a href=\\"https://github.com/aws-samples/amazon-sagemaker-feature-store-update-feature-group\\" target=\\"_blank\\">GitHub repository</a>. Let’s explore some of the key parts of the notebook here.</p>\\n<ol>\\n<li>We create a feature group to store the feature-engineered customer data using the <code>FeatureGroup.create</code>API of the SageMaker SDK.</li>\\n</ol>\n<pre><code class=\\"lang-\\">customers_feature_group = FeatureGroup(name=customers_feature_group_name, \\n sagemaker_session=sagemaker_session)\\n\\ncustomers_feature_group.create(s3_uri=f's3://{default_bucket}/{prefix}', \\n record_identifier_name='customer_id', \\n event_time_feature_name='event_time', \\n role_arn=role, \\n enable_online_store=True)\\n</code></pre>\\n<ol start=\\"2\\">\\n<li>We create a Pandas DataFrame with the initial CSV data. We use the current time as the timestamp for the <code>event_time</code>feature. This corresponds to the time when the event occurred, which implies when the record is added or updated in the feature group.</li>\\n<li>We ingest the DataFrame into the feature group using the SageMaker SDK <code>FeatureGroup.ingest</code>API. This is a small dataset and therefore can be loaded into a Pandas DataFrame. When we work with large amounts of data and millions of rows, there are other scalable mechanisms to ingest data into Feature Store, such as <a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/batch-ingestion-spark-connector-setup.html\\" target=\\"_blank\\">batch ingestion with Apache Spark</a>.</li>\\n</ol>\n<pre><code class=\\"lang-\\">customers_feature_group.ingest(data_frame=customers_df,\\n max_workers=3,\\n wait=True)\\n</code></pre>\\n<ol start=\\"4\\">\\n<li>We can verify that data has been ingested into the feature group by running Athena queries in the notebook or running queries on the Athena console.</li>\n<li>After we verify that the offline feature store has the initial data, we add the new feature <code>has_kids</code>to the feature group using the Boto3 update_feature_group API</li>\\n</ol>\n<pre><code class=\\"lang-\\">sagemaker_runtime.update_feature_group(\\n FeatureGroupName=customers_feature_group_name,\\n FeatureAdditions=[\\n {&quot;FeatureName&quot;: &quot;has_kids&quot;, &quot;FeatureType&quot;: &quot;Integral&quot;}\\n ])\\n</code></pre>\\n<p>The Data Catalog gets automatically updated as part of this API call. The API supports adding multiple features at a time by specifying them in the <code>FeatureAdditions</code>dictionary.</p>\\n<ol start=\\"6\\">\\n<li>We verify that feature has been added by checking the updated feature group definition</li>\n</ol>\\n<pre><code class=\\"lang-\\">describe_feature_group_result = sagemaker_runtime.describe_feature_group(\\n FeatureGroupName=customers_feature_group_name)\\npretty_printer.pprint(describe_feature_group_result)\\n</code></pre>\\n<p>The <code>LastUpdateStatus</code>in the <code>describe_feature_group</code>API response initially shows the status <code>InProgress</code>. After the operation is successful, the<code>LastUpdateStatus</code>status changes to <code>Successful</code>. If for any reason the operation encounters an error, the<code>lastUpdateStatus</code>status shows as Failed, with the detailed error message in <code>FailureReason</code>.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/df9279f67b2540d1b201e11700870d35_image.png\\" alt=\\"image.png\\" /></p>\n<p>When the <code>update_feature_group</code>API is invoked, the control plane reflects the schema change immediately, but the data plane takes up to 5 minutes to update its feature group schema. We must ensure that enough time is given for the update operation before proceeding to data ingestion.</p>\\n<p>We prepare data for the <code>has_kids</code>feature by generating random 1s and 0s to indicate whether a customer has kids or not.</p>\\n<pre><code class=\\"lang-\\">customers_df['has_kids'] =np.random.randint(0, 2, customers_df.shape[0])\\n</code></pre>\\n<ol start=\\"8\\">\\n<li>We ingest the DataFrame that has the newly added column into the feature group using the SageMaker SDK <code>FeatureGroup.ingest</code>API</li>\\n</ol>\n<pre><code class=\\"lang-\\">customers_feature_group.ingest(data_frame=customers_df,\\n max_workers=3,\\n wait=True)\\n</code></pre>\\n<ol start=\\"9\\">\\n<li>Next, we verify the feature record in the online store for a single customer using the Boto3 <code>get_record</code>API.</li>\\n</ol>\n<pre><code class=\\"lang-\\">get_record_result = featurestore_runtime.get_record(\\n FeatureGroupName=customers_feature_group_name,\\n RecordIdentifierValueAsString=customer_id)\\npretty_printer.pprint(get_record_result)\\n</code></pre>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/c68cc392248647c5af973dfeccda6a0b_image.png\\" alt=\\"image.png\\" /></p>\n<ol start=\\"10\\">\\n<li>Let’s query the same customer record on the Athena console to verify the offline data store. The data is appended to the offline store to maintain historical writes and updates. Therefore, we see two records here: a newer record that has the feature updated to value 1, and an older record that doesn’t have this feature and therefore shows the value as empty. The offline store persistence happens in batches within 15 minutes, so this step could take time.</li>\n</ol>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/114e7370a1754aa3a7ffeaf63aebfbc7_image.png\\" alt=\\"image.png\\" /></p>\n<p>Now that we have this feature added to our feature group, we can extract this new feature into our training dataset and retrain models. The goal of the post is to highlight the ease of modifying a feature group, ingesting data into the new feature, and then using the updated data in the feature group for model training and inference.</p>\n<h4><a id=\\"Clean_up_133\\"></a><strong>Clean up</strong></h4>\\n<p>Don’t forget to clean up the resources created as part of this post to avoid incurring ongoing charges.</p>\n<ol>\\n<li>Delete the S3 objects in the offline store:</li>\n</ol>\\n<pre><code class=\\"lang-\\">s3_config = describe_feature_group_result['OfflineStoreConfig']['S3StorageConfig']\\ns3_uri = s3_config['ResolvedOutputS3Uri']\\nfull_prefix = '/'.join(s3_uri.split('/')[3:])\\nbucket = s3.Bucket(default_bucket)\\noffline_objects = bucket.objects.filter(Prefix=full_prefix)\\noffline_objects.delete()\\n</code></pre>\\n<ol start=\\"2\\">\\n<li>Delete the feature group:</li>\n</ol>\\n<pre><code class=\\"lang-\\">customers_feature_group.delete()\\n</code></pre>\\n<ol start=\\"3\\">\\n<li>Stop the SageMaker Jupyter notebook instance. For instructions, refer to <a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html\\" target=\\"_blank\\">Clean Up</a>.</li>\\n</ol>\n<h4><a id=\\"Conclusion_156\\"></a><strong>Conclusion</strong></h4>\\n<p>Feature Store is a fully managed, purpose-built repository to store, share, and manage features for ML models. Being able to add features to existing feature groups simplifies iterative model development and alleviates the challenges we see in creating and maintaining multiple feature groups.</p>\n<p>In this post, we showed you how to add features to existing feature groups via the newly released SageMaker <code>UpdateFeatureGroup</code>API. The steps shown in this post are available as a Jupyter notebook in the <a href=\\"https://github.com/aws-samples/amazon-sagemaker-feature-store-update-feature-group\\" target=\\"_blank\\">GitHub repository</a>. Give it a try and let us know your feedback in the comments.</p>\\n<h4><a id=\\"Further_reading_162\\"></a><strong>Further reading</strong></h4>\\n<p>If you’re interested in exploring the complete scenario mentioned earlier in this post of predicting a customer ordering a certain product, check out the following <a href=\\"https://github.com/aws-samples/amazon-sagemaker-feature-store-end-to-end-workshop/blob/main/03-module-training-and-batch-scoring/m3_nb4_update_feature_group.ipynb\\" target=\\"_blank\\">notebook</a>, which modifies the feature group, ingests data, and trains an XGBoost model with the data from the updated offline store. This notebook is part of a <a href=\\"https://github.com/aws-samples/amazon-sagemaker-feature-store-end-to-end-workshop\\" target=\\"_blank\\">comprehensive workshop</a> developed to demonstrate Feature Store functionality.</p>\\n<h4><a id=\\"References_166\\"></a><strong>References</strong></h4>\\n<p>More information is available at the following resources:</p>\n<ul>\\n<li><a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store.html\\" target=\\"_blank\\">Create, Store, and Share Features with Amazon SageMaker Feature Store</a></li>\\n<li><a href=\\"https://docs.aws.amazon.com/athena/latest/ug/what-is.html\\" target=\\"_blank\\">Amazon Athena User Guide</a></li>\\n<li><a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/gs-console.html\\" target=\\"_blank\\">Get Started with Amazon SageMaker Notebook Instances</a></li>\\n<li><a href=\\"https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateFeatureGroup.html\\" target=\\"_blank\\">UpdateFeatureGroup API</a></li>\\n<li><a href=\\"https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.update_feature_group\\" target=\\"_blank\\">SageMaker Boto3 update_feature_group API</a></li>\\n<li><a href=\\"https://aws.amazon.com/blogs/machine-learning/getting-started-with-amazon-sagemaker-feature-store/\\" target=\\"_blank\\">Getting started with Amazon SageMaker Feature Store</a></li>\\n</ul>\n<h4><a id=\\"About_the_authors_177\\"></a><strong>About the authors</strong></h4>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/7fce3fc89f1c4830981d352fb3410f49_image.png\\" alt=\\"image.png\\" /></p>\n<p><strong>Chaitra Mathur</strong> is a Principal Solutions Architect at AWS. She guides customers and partners in building highly scalable, reliable, secure, and cost-effective solutions on AWS. She is passionate about Machine Learning and helps customers translate their ML needs into solutions using AWS AI/ML services. She holds 5 certifications including the ML Specialty certification. In her spare time, she enjoys reading, yoga, and spending time with her daughters.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/d5cafca62465439ba17da17444cd4b39_image.png\\" alt=\\"image.png\\" /></p>\n<p><strong>Mark Roy</strong> is a Principal Machine Learning Architect for AWS, helping customers design and build AI/ML solutions. Mark’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Mark holds six AWS certifications, including the ML Specialty Certification. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/5967fec4713b4880965b846762bca04a_image.png\\" alt=\\"image.png\\" /></p>\n<p><strong>Charu Sareen</strong> is a Sr. Product Manager for [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Feature Store. Prior to AWS, she was leading growth and monetization strategy for SaaS services at VMware. She is a data and machine learning enthusiast and has over a decade of experience spanning product management, data engineering, and advanced analytics. She has a bachelor’s degree in Information Technology from National Institute of Technology, India and an MBA from University of Michigan, Ross School of Business.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/b4effb9d50c249d2833bf78794601db2_image.png\\" alt=\\"image.png\\" /></p>\n<p><strong>Frank McQuillan</strong> is Principal Product Manager for [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Feature Store. For the last 10 years he has worked in product management in data, analytics and AI/ML. Prior to that, he worked in engineering roles in robotics, flight simulation, and online advertising technology. He has a master’s degree from the University of Toronto and a bachelor’s degree from the University of Waterloo, both in Mechanical Engineering.</p>\n"}
目录
亚马逊云科技解决方案 基于行业客户应用场景及技术领域的解决方案
联系亚马逊云科技专家
亚马逊云科技解决方案
基于行业客户应用场景及技术领域的解决方案
联系专家
0
目录
关闭