Manage AutoML workflows with Amazon Step Functions and AutoGluon on Amazon SageMaker

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"Running machine learning (ML) experiments in the cloud can span across many services and components. The ability to structure, automate, and track ML experiments is essential to enable rapid development of ML models. With the latest advancements in the field of automated machine learning (AutoML), namely the area of ML dedicated to the automation of ML processes, you can build accurate decision-making models without needing deep ML knowledge. In this post, we loo at AutoGluon, an open-source AutoML framework that allows you to build accurate ML models with just a few lines of Python.\n\nAWS offers a wide range of services to manage and run ML workflows, allowing you to select a solution based on your skills and application. For example, if you already use [AWS Step Functions](https://aws.amazon.com/step-functions/) to orchestrate the components of distributed applications, you can use the same service to build and automate your ML workflows. Other MLOps tools offered by AWS include [Amazon SageMaker Pipelines](https://aws.amazon.com/sagemaker/pipelines/), which enables you to build ML models in [Amazon SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html) with MLOps capabilities (such as CI/CD compatibility, model monitoring, and model approvals). Open-source tools, such as [Apache Airflow](https://airflow.apache.org/)—available on AWS through [Amazon Managed Workflows for Apache Airflow](https://aws.amazon.com/managed-workflows-for-apache-airflow/)—and [KubeFlow](https://www.kubeflow.org/docs/distributions/aws/), as well as hybrid solutions, are also supported. For example, you can manage data ingestion and processing with Step Functions while training and deploying your ML models with SageMaker Pipelines.\n\nIn this post, we show how even developers without ML expertise can easily build and maintain state-of-the-art ML models using AutoGluon on [Amazon SageMaker](https://aws.amazon.com/sagemaker/) and Step Functions to orchestrate workflow components.\n\nAfter an overview of the AutoGluon algorithm, we present the workflow definitions along with examples and a [code tutorial](https://github.com/aws-samples/aws-stepfunctions-automl-workflow) that you can apply to your own data.\n\n### **AutoGluon**\n\nAutoGluon is an open-source AutoML framework that accelerates the adoption of ML by training accurate ML models with just a few lines of Python code. Although this post focuses on tabular data, AutoGluon also allows you to train state-of-the-art models for image classification, object detection, and text classification. AutoGluon tabular creates and combines different models to find the optimal solution.\n\nThe AutoGluon team at AWS released a [paper ](https://arxiv.org/abs/2003.06505)that presents the principles that structure the library:\n\n- **Simplicity**– You can create classification and regression models directly from raw data without having to analyze the data or perform feature engineering\n- **Robustness**– The overall training process should succeed even if some of the individual models fail\n- **Predictable timing** – You can get optimal results within the time that you want to invest for training\n- **Fault tolerance** – You can stop the training and resume it at any time, which optimizes the costs if the process runs on spot images in the cloud\n\nFor more details about the algorithm, refer to the [paper ](https://arxiv.org/abs/2003.06505)released by the AutoGluon team at AWS.\n\nAfter you install the [AutoGluon package](https://auto.gluon.ai/stable/index.html#installation) and its dependencies, training a model is as easy as writing three lines of code:\n\n```\\nfrom autogluon.tabular import TabularDataset, TabularPredictor\\n\\ntrain_data = TabularDataset('s3://my-bucket/datasets/my-csv.csv')\\npredictor = TabularPredictor(label=\\"my-label\\", path=\\"my-output-folder\\").fit(train_data)\\n```\n\nThe AutoGluon team proved the strength of the framework by reaching the top 10 leaderboard in multiple Kaggle competitions.\n\n### **Solution overview**\n\nWe use Step Functions to implement an ML workflow that covers training, evaluation, and deployment. The pipeline design enables fast and configurable experiments by modifying the input parameters that you feed into the pipeline at runtime.\n\nYou can configure the pipeline to implement different workflows, such as the following:\n\n- Train a new ML model and store it in the SageMaker model registry, if no deployment is needed at this point\n- Deploy a pre-trained ML model, either for online ([SageMaker endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html)) or offline ([SageMaker batch transform](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html)) inference\n- Run a complete pipeline to train, evaluate, and deploy an ML model from scratch\n\nThe solutions consist of a general [state machine](https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-state-machine-structure.html) (see the following diagram) that orchestrates the set of actions to be run based on a set of input parameters.\n\n![image.png](https://dev-media.amazoncloud.cn/82bef8fda488434caf3a122a840c8dec_image.png)\n\nThe steps of the state machine are as follows:\n\n1. The first step ```IsTraining``` decides whether we’re using a pre-trained model or training a model from scratch. If using a pre-trained model, the state machine skips to Step 7.\n2. When a new ML model is required, ```TrainSteps``` triggers a second state machine that performs all the necessary actions and returns the result to the current state machine. We go into more detail of the training state machine in the next section.\n3. When training is finished, ```PassModelName``` stores the training job name in a specified location of the state machine context to be reused in the following states.\n4. If an evaluation phase is selected, ```IsEvaluation``` redirects the state machine towards the evaluation branch. Otherwise, it skips to Step 7.\n5. The evaluation phase is then implemented using an [AWS Lambda](http://aws.amazon.com/lambda) function invoked by the ```ModelValidation``` step. The Lambda function retrieves model performances on a test set and compares it with a user-configurable threshold specified in the input parameters. The following code is an example of evaluation results:\n\n```\\n\\"Payload\\":{\\n \\"IsValid\\":true,\\n \\"Scores\\":{\\n \\"accuracy\\":0.9187,\\n \\"balanced_accuracy\\":0.7272,\\n \\"mcc\\":0.5403,\\n \\"roc_auc\\":0.9489,\\n \\"f1\\":0.5714,\\n \\"precision\\":0.706,\\n \\"recall\\":0.4799\\n }\\n}\\n```\n\n6. If the model evaluation at ```EvaluationResults``` is successful, the state machine continues with eventual deployment steps. If the model is performing below a user-define criteria, the state machine stops and deployment is skipped.\n7. If deployment is selected, ```IsDeploy``` starts a third state machine through ```DeploySteps```, which we describe later in this post. If deployment is not needed, the state machine stops here.\n\nA set of input parameter samples is available on the [GitHub repo](https://github.com/aws-samples/aws-stepfunctions-automl-workflow).\n\n### **Training state machine**\n\nThe state machine for training a new ML model using AutoGluon is comprised of two steps, as illustrated in the following diagram. The first step is a SageMaker training job that creates the model. The second saves the entries in the SageMaker model registry.\n\n![image.png](https://dev-media.amazoncloud.cn/28c9b9754ae54b1bbc64c87cfd8f9842_image.png)\n\nYou can run these steps either automatically as part of the main state machine, or as a standalone process.\n\n### **Deployment state machine**\n\nLet’s now look at the state machine dedicated to the deployment phase (see the following diagram). As mentioned earlier, the architecture supports both online and offline deployment. The former consists of deploying a SageMaker endpoint, whereas the latter runs a SageMaker batch transform Job.\n\n![image.png](https://dev-media.amazoncloud.cn/032316e43ea24f738625dd462383cc25_image.png)\n\nThe implementation steps are as follows:\n\n1. ```ChoiceDeploymentMode``` looks into the input parameters to define which deployment mode is needed and directs the state machine towards the corresponding branch.\n2. If an endpoint is chosen, the ```EndpointConfig``` step defines its configuration, while ```CreateEndpoint``` starts the process of allocating the required computing resources. This allocation can take several minutes, so the state machine pauses at ```WaitForEndpoint``` and uses a Lambda function to poll the endpoint status.\n3. While the endpoint is being configured, ```ChoiceEndpointStatus``` returns to the ```WaitForEndpoint``` state, otherwise it continues to either ```DeploymentFailed``` or ```DeploymentSucceeded```.\n4. If offline deployment is selected, the state machine runs a SageMaker batch transform job, after which the state machine stops.\n\n### **Conclusion**\n\nThis post presents an easy-to-use pipeline to orchestrate AutoML workflows and enable fast experiments in the cloud, allowing for accurate ML solutions without requiring advanced ML knowledge.\n\nWe provide a general pipeline as well as two modular ones that allow you to perform training and deployment separately if needed. Moreover, the solution is fully integrated with SageMaker, benefitting from its features and computational resources.\n\nGet started now with this [code tutorial](https://github.com/aws-samples/aws-stepfunctions-automl-workflow) to deploy the resources presented in this post into your AWS account and run your first AutoML experiments.\n\n### **About the Authors**\n\n![image.png](https://dev-media.amazoncloud.cn/d69919856b0340a39c3cc77c348b6eab_image.png)\n\n**Federico Piccinini** is a Deep Learning Architect for the Amazon Machine Learning Solutions Lab. He is passionate about machine learning, explainable AI, and MLOps. He focuses on designing ML pipelines for AWS customers. Outside of work, he enjoys sports and pizza.\n\n![image.png](https://dev-media.amazoncloud.cn/98b9479f4b3e405ab8d2f95814c6e4f5_image.png)\n\n**Paolo Irrera** is a Data Scientist at the Amazon Machine Learning Solutions Lab, where he helps customers address business problems with ML and cloud capabilities. He holds a PhD in Computer Vision from Telecom ParisTech, Paris.","render":"Running machine learning (ML) experiments in the cloud can span across many services and components. The ability to structure, automate, and track ML experiments is essential to enable rapid development of ML models. With the latest advancements in the field of automated machine learning (AutoML), namely the area of ML dedicated to the automation of ML processes, you can build accurate decision-making models without needing deep ML knowledge. In this post, we loo at AutoGluon, an open-source AutoML framework that allows you to build accurate ML models with just a few lines of Python.\nAWS offers a wide range of services to manage and run ML workflows, allowing you to select a solution based on your skills and application. For example, if you already use <a href=\\"https://aws.amazon.com/step-functions/\\" target=\\"_blank\\">AWS Step Functions</a> to orchestrate the components of distributed applications, you can use the same service to build and automate your ML workflows. Other MLOps tools offered by AWS include <a href=\\"https://aws.amazon.com/sagemaker/pipelines/\\" target=\\"_blank\\">Amazon SageMaker Pipelines</a>, which enables you to build ML models in <a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html\\" target=\\"_blank\\">Amazon SageMaker Studio</a> with MLOps capabilities (such as CI/CD compatibility, model monitoring, and model approvals). Open-source tools, such as <a href=\\"https://airflow.apache.org/\\" target=\\"_blank\\">Apache Airflow</a>—available on AWS through <a href=\\"https://aws.amazon.com/managed-workflows-for-apache-airflow/\\" target=\\"_blank\\">Amazon Managed Workflows for Apache Airflow</a>—and <a href=\\"https://www.kubeflow.org/docs/distributions/aws/\\" target=\\"_blank\\">KubeFlow</a>, as well as hybrid solutions, are also supported. For example, you can manage data ingestion and processing with Step Functions while training and deploying your ML models with SageMaker Pipelines.\\nIn this post, we show how even developers without ML expertise can easily build and maintain state-of-the-art ML models using AutoGluon on <a href=\\"https://aws.amazon.com/sagemaker/\\" target=\\"_blank\\">Amazon SageMaker</a> and Step Functions to orchestrate workflow components.\\nAfter an overview of the AutoGluon algorithm, we present the workflow definitions along with examples and a <a href=\\"https://github.com/aws-samples/aws-stepfunctions-automl-workflow\\" target=\\"_blank\\">code tutorial</a> that you can apply to your own data.\\n<h3><a id=\\"AutoGluon_8\\"></a>AutoGluon</h3>\\nAutoGluon is an open-source AutoML framework that accelerates the adoption of ML by training accurate ML models with just a few lines of Python code. Although this post focuses on tabular data, AutoGluon also allows you to train state-of-the-art models for image classification, object detection, and text classification. AutoGluon tabular creates and combines different models to find the optimal solution.\nThe AutoGluon team at AWS released a <a href=\\"https://arxiv.org/abs/2003.06505\\" target=\\"_blank\\">paper </a>that presents the principles that structure the library:\\n<ul>\\n<li>Simplicity– You can create classification and regression models directly from raw data without having to analyze the data or perform feature engineering</li>\\n<li>Robustness– The overall training process should succeed even if some of the individual models fail</li>\\n<li>Predictable timing – You can get optimal results within the time that you want to invest for training</li>\\n<li>Fault tolerance – You can stop the training and resume it at any time, which optimizes the costs if the process runs on spot images in the cloud</li>\\n</ul>\nFor more details about the algorithm, refer to the <a href=\\"https://arxiv.org/abs/2003.06505\\" target=\\"_blank\\">paper </a>released by the AutoGluon team at AWS.\\nAfter you install the <a href=\\"https://auto.gluon.ai/stable/index.html#installation\\" target=\\"_blank\\">AutoGluon package</a> and its dependencies, training a model is as easy as writing three lines of code:\\n<pre><code class=\\"lang-\\">from autogluon.tabular import TabularDataset, TabularPredictor\\n\\ntrain_data = TabularDataset('s3://my-bucket/datasets/my-csv.csv')\\npredictor = TabularPredictor(label="my-label", path="my-output-folder").fit(train_data)\\n</code></pre>\\nThe AutoGluon team proved the strength of the framework by reaching the top 10 leaderboard in multiple Kaggle competitions.\n<h3><a id=\\"Solution_overview_32\\"></a>Solution overview</h3>\\nWe use Step Functions to implement an ML workflow that covers training, evaluation, and deployment. The pipeline design enables fast and configurable experiments by modifying the input parameters that you feed into the pipeline at runtime.\nYou can configure the pipeline to implement different workflows, such as the following:\n<ul>\\n<li>Train a new ML model and store it in the SageMaker model registry, if no deployment is needed at this point</li>\n<li>Deploy a pre-trained ML model, either for online (<a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html\\" target=\\"_blank\\">SageMaker endpoint</a>) or offline (<a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html\\" target=\\"_blank\\">SageMaker batch transform</a>) inference</li>\\n<li>Run a complete pipeline to train, evaluate, and deploy an ML model from scratch</li>\n</ul>\\nThe solutions consist of a general <a href=\\"https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-state-machine-structure.html\\" target=\\"_blank\\">state machine</a> (see the following diagram) that orchestrates the set of actions to be run based on a set of input parameters.\\n<img src=\\"https://dev-media.amazoncloud.cn/82bef8fda488434caf3a122a840c8dec_image.png\\" alt=\\"image.png\\" />\nThe steps of the state machine are as follows:\n<ol>\\n<li>The first step <code>IsTraining</code> decides whether we’re using a pre-trained model or training a model from scratch. If using a pre-trained model, the state machine skips to Step 7.</li>\\n<li>When a new ML model is required, <code>TrainSteps</code> triggers a second state machine that performs all the necessary actions and returns the result to the current state machine. We go into more detail of the training state machine in the next section.</li>\\n<li>When training is finished, <code>PassModelName</code> stores the training job name in a specified location of the state machine context to be reused in the following states.</li>\\n<li>If an evaluation phase is selected, <code>IsEvaluation</code> redirects the state machine towards the evaluation branch. Otherwise, it skips to Step 7.</li>\\n<li>The evaluation phase is then implemented using an <a href=\\"http://aws.amazon.com/lambda\\" target=\\"_blank\\">AWS Lambda</a> function invoked by the <code>ModelValidation</code> step. The Lambda function retrieves model performances on a test set and compares it with a user-configurable threshold specified in the input parameters. The following code is an example of evaluation results:</li>\\n</ol>\n<pre><code class=\\"lang-\\">"Payload":{\\n "IsValid":true,\\n "Scores":{\\n "accuracy":0.9187,\\n "balanced_accuracy":0.7272,\\n "mcc":0.5403,\\n "roc_auc":0.9489,\\n "f1":0.5714,\\n "precision":0.706,\\n "recall":0.4799\\n }\\n}\\n</code></pre>\\n<ol start=\\"6\\">\\n<li>If the model evaluation at <code>EvaluationResults</code> is successful, the state machine continues with eventual deployment steps. If the model is performing below a user-define criteria, the state machine stops and deployment is skipped.</li>\\n<li>If deployment is selected, <code>IsDeploy</code> starts a third state machine through <code>DeploySteps</code>, which we describe later in this post. If deployment is not needed, the state machine stops here.</li>\\n</ol>\nA set of input parameter samples is available on the <a href=\\"https://github.com/aws-samples/aws-stepfunctions-automl-workflow\\" target=\\"_blank\\">GitHub repo</a>.\\n<h3><a id=\\"Training_state_machine_74\\"></a>Training state machine</h3>\\nThe state machine for training a new ML model using AutoGluon is comprised of two steps, as illustrated in the following diagram. The first step is a SageMaker training job that creates the model. The second saves the entries in the SageMaker model registry.\n<img src=\\"https://dev-media.amazoncloud.cn/28c9b9754ae54b1bbc64c87cfd8f9842_image.png\\" alt=\\"image.png\\" />\nYou can run these steps either automatically as part of the main state machine, or as a standalone process.\n<h3><a id=\\"Deployment_state_machine_82\\"></a>Deployment state machine</h3>\\nLet’s now look at the state machine dedicated to the deployment phase (see the following diagram). As mentioned earlier, the architecture supports both online and offline deployment. The former consists of deploying a SageMaker endpoint, whereas the latter runs a SageMaker batch transform Job.\n<img src=\\"https://dev-media.amazoncloud.cn/032316e43ea24f738625dd462383cc25_image.png\\" alt=\\"image.png\\" />\nThe implementation steps are as follows:\n<ol>\\n<li><code>ChoiceDeploymentMode</code> looks into the input parameters to define which deployment mode is needed and directs the state machine towards the corresponding branch.</li>\\n<li>If an endpoint is chosen, the <code>EndpointConfig</code> step defines its configuration, while <code>CreateEndpoint</code> starts the process of allocating the required computing resources. This allocation can take several minutes, so the state machine pauses at <code>WaitForEndpoint</code> and uses a Lambda function to poll the endpoint status.</li>\\n<li>While the endpoint is being configured, <code>ChoiceEndpointStatus</code> returns to the <code>WaitForEndpoint</code> state, otherwise it continues to either <code>DeploymentFailed</code> or <code>DeploymentSucceeded</code>.</li>\\n<li>If offline deployment is selected, the state machine runs a SageMaker batch transform job, after which the state machine stops.</li>\n</ol>\\n<h3><a id=\\"Conclusion_95\\"></a>Conclusion</h3>\\nThis post presents an easy-to-use pipeline to orchestrate AutoML workflows and enable fast experiments in the cloud, allowing for accurate ML solutions without requiring advanced ML knowledge.\nWe provide a general pipeline as well as two modular ones that allow you to perform training and deployment separately if needed. Moreover, the solution is fully integrated with SageMaker, benefitting from its features and computational resources.\nGet started now with this <a href=\\"https://github.com/aws-samples/aws-stepfunctions-automl-workflow\\" target=\\"_blank\\">code tutorial</a> to deploy the resources presented in this post into your AWS account and run your first AutoML experiments.\\n<h3><a id=\\"About_the_Authors_103\\"></a>About the Authors</h3>\\n<img src=\\"https://dev-media.amazoncloud.cn/d69919856b0340a39c3cc77c348b6eab_image.png\\" alt=\\"image.png\\" />\nFederico Piccinini is a Deep Learning Architect for the Amazon Machine Learning Solutions Lab. He is passionate about machine learning, explainable AI, and MLOps. He focuses on designing ML pipelines for AWS customers. Outside of work, he enjoys sports and pizza.\\n<img src=\\"https://dev-media.amazoncloud.cn/98b9479f4b3e405ab8d2f95814c6e4f5_image.png\\" alt=\\"image.png\\" />\nPaolo Irrera is a Data Scientist at the Amazon Machine Learning Solutions Lab, where he helps customers address business problems with ML and cloud capabilities. He holds a PhD in Computer Vision from Telecom ParisTech, Paris.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家