{"value":"AWS estimates that inference (the process of using a trained machine learning [ML] algorithm to make a prediction) makes up [90 percent of the cost of an ML model](https://youtu.be/ZOIkOnW640A?t=5362). Given with AWS you pay for what you use, we estimate that inference also generally equates to most of the resource usage within an ML lifecycle.\n\nIn this series, we’re following the phases of the [Well-Architected machine learning lifecycle](https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/well-architected-machine-learning-lifecycle.html) (Figure 1) to optimize your artificial intelligence (AI)/ML workloads. In Part 3, our final piece in the series, we show you how to reduce the environmental impact of your ML workload once your model is in production.\n\nIf you missed the first parts of this series, in [Part 1](https://aws.amazon.com/blogs/architecture/optimize-ai-ml-workloads-for-sustainability-part-1-identify-business-goals-validate-ml-use-and-process-data/), we showed you how to examine your workload to help you 1) evaluate the impact of your workload, 2) identify alternatives to training your own model, and 3) optimize data processing. In [Part 2](https://aws.amazon.com/blogs/architecture/optimize-ai-ml-workloads-for-sustainability-part-2-model-development/), we identified ways to reduce the environmental impact of developing, training, and tuning ML models.\n\n![image.png](https://dev-media.amazoncloud.cn/e19de7a504a64929a0c7d51cc6fb4f9f_image.png)\n\nFigure 1. ML lifecycle\n\n### **Deployment**\n##### **Select sustainable AWS Regions**\nAs mentioned in Part 1, select an AWS Region with sustainable energy sources. When regulations and legal aspects allow, choose Regions [near Amazon renewable energy projects](https://sustainability.aboutamazon.com/about/around-the-globe?energyType=true) and Regions where the grid has low published carbon intensity to deploy your model.\n\n##### **Align SLAs with sustainability goals**\nDefine SLAs that support your sustainability goals while meeting your business requirements:\n\n- If your users can tolerate some latency, deploy your model on [asynchronous endpoints](https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html) to [reduce resources that are idle between tasks and minimize the impact of load spikes.](https://docs.aws.amazon.com/wellarchitected/latest/sustainability-pillar/optimize-software-and-architecture-for-asynchronous-and-scheduled-jobs.html) Asynchronous endpoints will automatically scale the instance count to zero when there are no requests to process, so you only maintain an inference infrastructure when your endpoint is processing requests.\n- If your workload doesn’t require high availability, [deploy it to a single Availability Zone](https://docs.aws.amazon.com/sagemaker/latest/dg/deployment-best-practices.html#deployment-best-practices-availability-zones) to reduce the cloud resources you consume. Adjusting availability is an example of a [conscious trade off](https://docs.aws.amazon.com/wellarchitected/latest/sustainability-pillar/sustainability-as-a-non-functional-requirement.html) you can make to meet your sustainability targets.\n- When you don’t need real-time inference, use Amazon [SageMaker batch transform. ](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html)Unlike persistent endpoints, clusters are decommissioned when batch transform jobs finish so you don’t continuously maintain an inference infrastructure.\n##### **Use efficient silicon**\nFor CPU-based ML inference, use AWS [Graviton3](https://aws.amazon.com/ec2/graviton/). These processors offer the best performance per watt in [Amazon Elastic Compute Cloud (Amazon EC2)](http://aws.amazon.com/ec2). They use up to [60% less energy](https://aws.amazon.com/ec2/graviton/) than comparable EC2 instances. Graviton3 processors deliver up to three times better performance compared to Graviton2 processors for ML workloads, and they [support bfloat16.](https://youtu.be/9NEQbFLtDmg?t=3613)\n\nFor deep learning workloads, the [Amazon EC2 ](https://aws.amazon.com/cn/ec2/?trk=cndc-detail)Inf1 instances (based on custom designed [AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/) chips) deliver 2.3 times higher throughput and 80% lower cost compared to g4dn instances. [Inf1 has 50% higher performance per watt than g4dn,](https://youtu.be/Yv3B_Zey83Y?t=732) which makes it the most sustainable ML accelerator [Amazon EC2 ](https://aws.amazon.com/cn/ec2/?trk=cndc-detail)offers.\n\n##### **Make efficient use of GPU**\nUse [Amazon Elastic Inference ](https://aws.amazon.com/machine-learning/elastic-inference/)to attach just the right amount of GPU-powered inference acceleration to any EC2 or SageMaker instance type or [Amazon Elastic Container Service (Amazon ECS)](http://aws.amazon.com/ecs) task.\n\nWhile training jobs batch process hundreds of data samples in parallel, inference jobs usually process a single input in real time, and thus consume a small amount of GPU compute. Elastic Inference allows you to reduce the cost and environmental impact of your inference by using GPU resources more efficiently.\n\n##### **Optimize models for inference**\nImprove efficiency of your models by compiling them into optimized forms with the following:\n\n- Various open-source libraries (like [Treelite](https://treelite.readthedocs.io/en/latest/) for decision tree ensembles)\n- Third-party tools like [Hugging Face Infinity](https://aws.amazon.com/marketplace/pp/prodview-vprkfzlr3xljo), which allows you to speed up transformer models and run inference not only on GPU but also on CPU.\n- [SageMaker Neo](https://aws.amazon.com/sagemaker/neo/)’s runtime consumes as little as one-tenth the footprint of a deep learning framework and optimizes models to perform up to 25 time faster with no loss in accuracy ([example with XGBoost](https://aws.amazon.com/blogs/machine-learning/unlock-performance-gains-with-xgboost-amazon-sagemaker-neo-and-serverless-artillery/)).\n\nDeploying more efficient models means you need fewer resources for inference.\n\n##### **Deploy multiple models behind a single endpoint**\nSageMaker provides three methods to deploy multiple models to a single endpoint to improve endpoint utilization:\n\n1. Host [multiple models in one container behind one endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html). Multi-model endpoints are served using a single container. This can help you [cut up to 90 percent](https://aws.amazon.com/blogs/machine-learning/save-on-inference-costs-by-using-amazon-sagemaker-multi-model-endpoints/) of your inference costs and carbon emissions.\n2. Host [multiple models that use different containers behind one endpoint.](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-container-endpoints.html)\n3. Host a linear sequence of containers in an [inference pipeline behind a single endpoint.](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html)\n\nSharing endpoint resources is more sustainable and less expensive than deploying a single model behind one endpoint.\n\n##### **Right-size your inference environment**\nRight-size your endpoints by using metrics from [Amazon CloudWatch](http://aws.amazon.com/cloudwatch) or by using the [Amazon SageMaker Inference Recommender.](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender.html) This tool can run load testing jobs and recommend the proper instance type to host your model. When you use the appropriate instance type, you limit the carbon emission associated with over-provisioning.\n\nIf your workload has intermittent or unpredictable traffic, [configure autoscaling inference endpoints in SageMaker](https://aws.amazon.com/blogs/machine-learning/configuring-autoscaling-inference-endpoints-in-amazon-sagemaker/) to optimize your endpoints. Autoscaling monitors your endpoints and dynamically adjusts their capacity to maintain steady and predictable performance using as few resources as possible. You can also try [Serverless Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html) (in preview), which automatically launches compute resources and scales them in and out depending on traffic, which eliminates idle resources.\n\n##### **Consider inference at the edge**\nWhen working on Internet of Things (IoT) use cases, evaluate if ML inference at the edge can reduce the carbon footprint of your workload. To do this, consider factors like the compute capacity of your devices, their energy consumption, or the emissions related to data transfer to the cloud. When [deploying ML models to edge devices](https://aws.amazon.com/blogs/machine-learning/build-machine-learning-at-the-edge-applications-using-amazon-sagemaker-edge-manager-and-aws-iot-greengrass-v2/), consider using [SageMaker Edge Manager](https://aws.amazon.com/sagemaker/edge-manager/), which integrates with SageMaker Neo and [AWS IoT Greengrass](https://aws.amazon.com/greengrass/) (Figure 2).\n\n![image.png](https://dev-media.amazoncloud.cn/9b3682cc1e60487ba629865f10538c9b_image.png)\n\nFigure 2. Run inference at the edge with SageMaker Edge\n\nDevice manufacturing represents [32-57 percent of the global Information Communication Technology carbon footprint](https://www.cell.com/patterns/fulltext/S2666-3899(21)00188-4). If your ML model is optimized, it requires less compute resources. You can then perform inference on [lower specification machines](https://aws.amazon.com/blogs/architecture/optimizing-your-iot-devices-for-environmental-sustainability/), which minimizes the environmental impact of the device manufacturing and uses less energy.\n\nThe following techniques compress the size of models for deployment, which speeds up inference and saves energy without significant loss of accuracy:\n\n- [Pruning](https://aws.amazon.com/blogs/machine-learning/pruning-machine-learning-models-with-amazon-sagemaker-debugger-and-amazon-sagemaker-experiments/) removes weights (learnable parameters) that don’t contribute much to the model.\n- [Quantization](https://aws.amazon.com/blogs/machine-learning/reduce-ml-inference-costs-on-amazon-sagemaker-with-hardware-and-software-acceleration/) represents numbers with the low-bit integers without incurring significant loss in accuracy. Specifically, you can reduce resource usage by replacing the parameters in an inference model with half-precision (16 bit), bfloat16 (16 bit, but the same dynamic range as 32 bit), or 8-bit integers instead of the usual single-precision floating-point (32 bit) values.\n##### **Archive or delete unnecessary artifacts**\nCompress and reduce the volume of logs you keep during the inference phase. By default, CloudWatch retains logs indefinitely. By [setting limited retention time](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Working-with-log-groups-and-streams.html#SettingLogRetention) for your inference logs, you’ll avoid the carbon footprint of unnecessary log storage. Also delete unused versions of your models and [custom container images](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html) from your repositories.\n\n### **Monitoring**\n##### **Retrain only when necessary**\nMonitor your ML model in production and only retrain if it’s required. Because of model drift, robustness, or new ground truth data being available, models usually need to be retrained. Instead of retraining arbitrarily, monitor your ML model in production, automate your [model drift detection](https://aws.amazon.com/sagemaker/model-monitor/) and only retrain when your model’s predictive performance has fallen below [defined KPIs.](https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/mlper-01.html)\n\nConsider [SageMaker Pipelines](http://aws.amazon.com/sagemaker/pipelines/), [AWS Step Functions Data Science SDK for Amazon SageMaker](https://aws.amazon.com/about-aws/whats-new/2019/11/introducing-aws-step-functions-data-science-sdk-amazon-sagemaker/), or third-party tools to automate your retraining pipelines.\n\n### **Measure results and improve**\nTo monitor and quantify improvements during the inference phase, track the following metrics:\n\n- [Resources provisioned for your endpoints](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariant.html)(```InstanceType```and```AcceleratorType```)\n- [Efficient use of these resources](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html#cloudwatch-metrics-jobs)(```CPUUtilization```,```GPUUtilization```,```GPUMemoryUtilization```,```MemoryUtilization```, and ```DiskUtilization```) in the [CloudWatch Console](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html)\n\nFor storage:\n\n- The total size of the [data captured](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-data-capture.html) by [Amazon SageMaker Model Monitor](https://aws.amazon.com/sagemaker/model-monitor/) using [Amazon S3 Storage Lens](https://aws.amazon.com/s3/storage-analytics-insights/)\n- The [size of your CloudWatch log groups](https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/API_LogGroup.html#CWL-Type-LogGroup-storedBytes)\n### **Conclusion**\nAI/ML workloads can be energy intensive, but as [called out by UN](https://www.sparkblue.org/event/CODES) and mentioned in the last [IPCC report](https://www.ipcc.ch/report/ar6/wg3/), AI can contribute to mitigation of climate change and the achievement of several [Sustainable Development Goals](https://sdgs.un.org/goals). As technology builders, it’s our [responsibility](https://docs.aws.amazon.com/wellarchitected/latest/sustainability-pillar/the-shared-responsibility-model.html) to make sustainable use of AI and ML.\n\nIn this blog post series, we presented best practices you can use to make [sustainability-conscious architectural decisions](https://docs.aws.amazon.com/wellarchitected/latest/sustainability-pillar/sustainability-pillar.html) and reduce the environmental impact for your AI/ML workloads.\n\n### **Other posts in this series**\n- [Optimize AI/ML workloads for sustainability: Part 1, identify business goals, validate ML use, and process data](https://aws.amazon.com/blogs/architecture/optimize-ai-ml-workloads-for-sustainability-part-1-identify-business-goals-validate-ml-use-and-process-data/)\n- [Optimize AI/ML workloads for sustainability: Part 2, model development](https://aws.amazon.com/blogs/architecture/optimize-ai-ml-workloads-for-sustainability-part-2-model-development/)\n### **About the Well-Architected Framework**\nThese practices are part of the [Sustainability Pillar of the AWS Well-Architected Framework](https://docs.aws.amazon.com/wellarchitected/latest/sustainability-pillar/sustainability-pillar.html). AWS Well-Architected is a set of guiding design principles developed by AWS to help organizations build secure, high-performing, resilient, and efficient infrastructure for a variety of applications and workloads. Use the [AWS Well-Architected Tool](https://aws.amazon.com/well-architected-tool/) to review your workloads periodically to address important design considerations and ensure that they follow the best practices and guidance of the AWS Well-Architected Framework. For follow up questions or comments, join our growing community on [AWS re:Post.](https://repost.aws/)\n##### **Benoit de Chateauvieux**\n![image.png](https://dev-media.amazoncloud.cn/38105f6a219946719892b9c914ff88ce_image.png)\n\nBenoit de Chateauvieux is a Startup Solutions Architect at AWS, based in Montreal, Canada. As a former CTO, he enjoys helping startups build great and sustainable products using the cloud. Outside of work, you’ll find Benoit in canoe-camping expeditions, paddling across Canadian rivers.\n##### **Eddie Pick**\n![image.png](https://dev-media.amazoncloud.cn/0493b67e91874d038ecbcf568ecbece6_image.png)\n\nEddie Pick is a Senior Startup Solutions Architect at AWS based in Montréal, Canada. As an ex co-founder and former CTO, his goal is to help startups build great products faster on the Cloud, particularly using machine learning.\n##### **Dan Ferguson**\n![image.png](https://dev-media.amazoncloud.cn/31e124387330489092b4d804e9f7f7eb_image.png)\n\nDan Ferguson is a Solutions Architect at AWS, based in New York, USA. As a machine learning services expert, Dan works to support customers on their journey to integrating ML workflows efficiently, effectively, and sustainably.\n##### **Brendan Sisson**\n![image.png](https://dev-media.amazoncloud.cn/2ba779bde30a416a8c1c58043477d3d1_image.png)\n\nBrendan Sisson is a Principal Sustainability Solutions Architect at AWS based in London, UK. As a contributor to the Sustainability Pillar of the AWS Well-Architected Framework, he supports customers on how they can optimize their workloads running in the AWS Cloud and how they can use the AWS Cloud to help solve their wider sustainability challenges.","render":"<p>AWS estimates that inference (the process of using a trained machine learning [ML] algorithm to make a prediction) makes up <a href=\\"https://youtu.be/ZOIkOnW640A?t=5362\\" target=\\"_blank\\">90 percent of the cost of an ML model</a>. Given with AWS you pay for what you use, we estimate that inference also generally equates to most of the resource usage within an ML lifecycle.</p>\\n<p>In this series, we’re following the phases of the <a href=\\"https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/well-architected-machine-learning-lifecycle.html\\" target=\\"_blank\\">Well-Architected machine learning lifecycle</a> (Figure 1) to optimize your artificial intelligence (AI)/ML workloads. In Part 3, our final piece in the series, we show you how to reduce the environmental impact of your ML workload once your model is in production.</p>\\n<p>If you missed the first parts of this series, in <a href=\\"https://aws.amazon.com/blogs/architecture/optimize-ai-ml-workloads-for-sustainability-part-1-identify-business-goals-validate-ml-use-and-process-data/\\" target=\\"_blank\\">Part 1</a>, we showed you how to examine your workload to help you 1) evaluate the impact of your workload, 2) identify alternatives to training your own model, and 3) optimize data processing. In <a href=\\"https://aws.amazon.com/blogs/architecture/optimize-ai-ml-workloads-for-sustainability-part-2-model-development/\\" target=\\"_blank\\">Part 2</a>, we identified ways to reduce the environmental impact of developing, training, and tuning ML models.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/e19de7a504a64929a0c7d51cc6fb4f9f_image.png\\" alt=\\"image.png\\" /></p>\n<p>Figure 1. ML lifecycle</p>\n<h3><a id=\\"Deployment_10\\"></a><strong>Deployment</strong></h3>\\n<h5><a id=\\"Select_sustainable_AWS_Regions_11\\"></a><strong>Select sustainable AWS Regions</strong></h5>\\n<p>As mentioned in Part 1, select an AWS Region with sustainable energy sources. When regulations and legal aspects allow, choose Regions <a href=\\"https://sustainability.aboutamazon.com/about/around-the-globe?energyType=true\\" target=\\"_blank\\">near Amazon renewable energy projects</a> and Regions where the grid has low published carbon intensity to deploy your model.</p>\\n<h5><a id=\\"Align_SLAs_with_sustainability_goals_14\\"></a><strong>Align SLAs with sustainability goals</strong></h5>\\n<p>Define SLAs that support your sustainability goals while meeting your business requirements:</p>\n<ul>\\n<li>If your users can tolerate some latency, deploy your model on <a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html\\" target=\\"_blank\\">asynchronous endpoints</a> to <a href=\\"https://docs.aws.amazon.com/wellarchitected/latest/sustainability-pillar/optimize-software-and-architecture-for-asynchronous-and-scheduled-jobs.html\\" target=\\"_blank\\">reduce resources that are idle between tasks and minimize the impact of load spikes.</a> Asynchronous endpoints will automatically scale the instance count to zero when there are no requests to process, so you only maintain an inference infrastructure when your endpoint is processing requests.</li>\\n<li>If your workload doesn’t require high availability, <a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/deployment-best-practices.html#deployment-best-practices-availability-zones\\" target=\\"_blank\\">deploy it to a single Availability Zone</a> to reduce the cloud resources you consume. Adjusting availability is an example of a <a href=\\"https://docs.aws.amazon.com/wellarchitected/latest/sustainability-pillar/sustainability-as-a-non-functional-requirement.html\\" target=\\"_blank\\">conscious trade off</a> you can make to meet your sustainability targets.</li>\\n<li>When you don’t need real-time inference, use Amazon <a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html\\" target=\\"_blank\\">SageMaker batch transform. </a>Unlike persistent endpoints, clusters are decommissioned when batch transform jobs finish so you don’t continuously maintain an inference infrastructure.</li>\\n</ul>\n<h5><a id=\\"Use_efficient_silicon_20\\"></a><strong>Use efficient silicon</strong></h5>\\n<p>For CPU-based ML inference, use AWS <a href=\\"https://aws.amazon.com/ec2/graviton/\\" target=\\"_blank\\">Graviton3</a>. These processors offer the best performance per watt in <a href=\\"http://aws.amazon.com/ec2\\" target=\\"_blank\\">Amazon Elastic Compute Cloud (Amazon EC2)</a>. They use up to <a href=\\"https://aws.amazon.com/ec2/graviton/\\" target=\\"_blank\\">60% less energy</a> than comparable EC2 instances. Graviton3 processors deliver up to three times better performance compared to Graviton2 processors for ML workloads, and they <a href=\\"https://youtu.be/9NEQbFLtDmg?t=3613\\" target=\\"_blank\\">support bfloat16.</a></p>\\n<p>For deep learning workloads, the Amazon EC2 Inf1 instances (based on custom designed <a href=\\"https://aws.amazon.com/machine-learning/inferentia/\\" target=\\"_blank\\">AWS Inferentia</a> chips) deliver 2.3 times higher throughput and 80% lower cost compared to g4dn instances. <a href=\\"https://youtu.be/Yv3B_Zey83Y?t=732\\" target=\\"_blank\\">Inf1 has 50% higher performance per watt than g4dn,</a> which makes it the most sustainable ML accelerator [Amazon EC2 ](https://aws.amazon.com/cn/ec2/?trk=cndc-detail)offers.</p>\\n<h5><a id=\\"Make_efficient_use_of_GPU_25\\"></a><strong>Make efficient use of GPU</strong></h5>\\n<p>Use <a href=\\"https://aws.amazon.com/machine-learning/elastic-inference/\\" target=\\"_blank\\">Amazon Elastic Inference </a>to attach just the right amount of GPU-powered inference acceleration to any EC2 or SageMaker instance type or <a href=\\"http://aws.amazon.com/ecs\\" target=\\"_blank\\">Amazon Elastic Container Service (Amazon ECS)</a> task.</p>\\n<p>While training jobs batch process hundreds of data samples in parallel, inference jobs usually process a single input in real time, and thus consume a small amount of GPU compute. Elastic Inference allows you to reduce the cost and environmental impact of your inference by using GPU resources more efficiently.</p>\n<h5><a id=\\"Optimize_models_for_inference_30\\"></a><strong>Optimize models for inference</strong></h5>\\n<p>Improve efficiency of your models by compiling them into optimized forms with the following:</p>\n<ul>\\n<li>Various open-source libraries (like <a href=\\"https://treelite.readthedocs.io/en/latest/\\" target=\\"_blank\\">Treelite</a> for decision tree ensembles)</li>\\n<li>Third-party tools like <a href=\\"https://aws.amazon.com/marketplace/pp/prodview-vprkfzlr3xljo\\" target=\\"_blank\\">Hugging Face Infinity</a>, which allows you to speed up transformer models and run inference not only on GPU but also on CPU.</li>\\n<li><a href=\\"https://aws.amazon.com/sagemaker/neo/\\" target=\\"_blank\\">SageMaker Neo</a>’s runtime consumes as little as one-tenth the footprint of a deep learning framework and optimizes models to perform up to 25 time faster with no loss in accuracy (<a href=\\"https://aws.amazon.com/blogs/machine-learning/unlock-performance-gains-with-xgboost-amazon-sagemaker-neo-and-serverless-artillery/\\" target=\\"_blank\\">example with XGBoost</a>).</li>\\n</ul>\n<p>Deploying more efficient models means you need fewer resources for inference.</p>\n<h5><a id=\\"Deploy_multiple_models_behind_a_single_endpoint_39\\"></a><strong>Deploy multiple models behind a single endpoint</strong></h5>\\n<p>SageMaker provides three methods to deploy multiple models to a single endpoint to improve endpoint utilization:</p>\n<ol>\\n<li>Host <a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html\\" target=\\"_blank\\">multiple models in one container behind one endpoint</a>. Multi-model endpoints are served using a single container. This can help you <a href=\\"https://aws.amazon.com/blogs/machine-learning/save-on-inference-costs-by-using-amazon-sagemaker-multi-model-endpoints/\\" target=\\"_blank\\">cut up to 90 percent</a> of your inference costs and carbon emissions.</li>\\n<li>Host <a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/multi-container-endpoints.html\\" target=\\"_blank\\">multiple models that use different containers behind one endpoint.</a></li>\\n<li>Host a linear sequence of containers in an <a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html\\" target=\\"_blank\\">inference pipeline behind a single endpoint.</a></li>\\n</ol>\n<p>Sharing endpoint resources is more sustainable and less expensive than deploying a single model behind one endpoint.</p>\n<h5><a id=\\"Rightsize_your_inference_environment_48\\"></a><strong>Right-size your inference environment</strong></h5>\\n<p>Right-size your endpoints by using metrics from <a href=\\"http://aws.amazon.com/cloudwatch\\" target=\\"_blank\\">Amazon CloudWatch</a> or by using the <a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender.html\\" target=\\"_blank\\">Amazon SageMaker Inference Recommender.</a> This tool can run load testing jobs and recommend the proper instance type to host your model. When you use the appropriate instance type, you limit the carbon emission associated with over-provisioning.</p>\\n<p>If your workload has intermittent or unpredictable traffic, <a href=\\"https://aws.amazon.com/blogs/machine-learning/configuring-autoscaling-inference-endpoints-in-amazon-sagemaker/\\" target=\\"_blank\\">configure autoscaling inference endpoints in SageMaker</a> to optimize your endpoints. Autoscaling monitors your endpoints and dynamically adjusts their capacity to maintain steady and predictable performance using as few resources as possible. You can also try <a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html\\" target=\\"_blank\\">Serverless Inference</a> (in preview), which automatically launches compute resources and scales them in and out depending on traffic, which eliminates idle resources.</p>\\n<h5><a id=\\"Consider_inference_at_the_edge_53\\"></a><strong>Consider inference at the edge</strong></h5>\\n<p>When working on Internet of Things (IoT) use cases, evaluate if ML inference at the edge can reduce the carbon footprint of your workload. To do this, consider factors like the compute capacity of your devices, their energy consumption, or the emissions related to data transfer to the cloud. When <a href=\\"https://aws.amazon.com/blogs/machine-learning/build-machine-learning-at-the-edge-applications-using-amazon-sagemaker-edge-manager-and-aws-iot-greengrass-v2/\\" target=\\"_blank\\">deploying ML models to edge devices</a>, consider using <a href=\\"https://aws.amazon.com/sagemaker/edge-manager/\\" target=\\"_blank\\">SageMaker Edge Manager</a>, which integrates with SageMaker Neo and <a href=\\"https://aws.amazon.com/greengrass/\\" target=\\"_blank\\">AWS IoT Greengrass</a> (Figure 2).</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/9b3682cc1e60487ba629865f10538c9b_image.png\\" alt=\\"image.png\\" /></p>\n<p>Figure 2. Run inference at the edge with SageMaker Edge</p>\n<p>Device manufacturing represents <a href=\\"https://www.cell.com/patterns/fulltext/S2666-3899(21)00188-4\\" target=\\"_blank\\">32-57 percent of the global Information Communication Technology carbon footprint</a>. If your ML model is optimized, it requires less compute resources. You can then perform inference on <a href=\\"https://aws.amazon.com/blogs/architecture/optimizing-your-iot-devices-for-environmental-sustainability/\\" target=\\"_blank\\">lower specification machines</a>, which minimizes the environmental impact of the device manufacturing and uses less energy.</p>\\n<p>The following techniques compress the size of models for deployment, which speeds up inference and saves energy without significant loss of accuracy:</p>\n<ul>\\n<li><a href=\\"https://aws.amazon.com/blogs/machine-learning/pruning-machine-learning-models-with-amazon-sagemaker-debugger-and-amazon-sagemaker-experiments/\\" target=\\"_blank\\">Pruning</a> removes weights (learnable parameters) that don’t contribute much to the model.</li>\\n<li><a href=\\"https://aws.amazon.com/blogs/machine-learning/reduce-ml-inference-costs-on-amazon-sagemaker-with-hardware-and-software-acceleration/\\" target=\\"_blank\\">Quantization</a> represents numbers with the low-bit integers without incurring significant loss in accuracy. Specifically, you can reduce resource usage by replacing the parameters in an inference model with half-precision (16 bit), bfloat16 (16 bit, but the same dynamic range as 32 bit), or 8-bit integers instead of the usual single-precision floating-point (32 bit) values.</li>\\n</ul>\n<h5><a id=\\"Archive_or_delete_unnecessary_artifacts_66\\"></a><strong>Archive or delete unnecessary artifacts</strong></h5>\\n<p>Compress and reduce the volume of logs you keep during the inference phase. By default, CloudWatch retains logs indefinitely. By <a href=\\"https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Working-with-log-groups-and-streams.html#SettingLogRetention\\" target=\\"_blank\\">setting limited retention time</a> for your inference logs, you’ll avoid the carbon footprint of unnecessary log storage. Also delete unused versions of your models and <a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html\\" target=\\"_blank\\">custom container images</a> from your repositories.</p>\\n<h3><a id=\\"Monitoring_69\\"></a><strong>Monitoring</strong></h3>\\n<h5><a id=\\"Retrain_only_when_necessary_70\\"></a><strong>Retrain only when necessary</strong></h5>\\n<p>Monitor your ML model in production and only retrain if it’s required. Because of model drift, robustness, or new ground truth data being available, models usually need to be retrained. Instead of retraining arbitrarily, monitor your ML model in production, automate your <a href=\\"https://aws.amazon.com/sagemaker/model-monitor/\\" target=\\"_blank\\">model drift detection</a> and only retrain when your model’s predictive performance has fallen below <a href=\\"https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/mlper-01.html\\" target=\\"_blank\\">defined KPIs.</a></p>\\n<p>Consider <a href=\\"http://aws.amazon.com/sagemaker/pipelines/\\" target=\\"_blank\\">SageMaker Pipelines</a>, <a href=\\"https://aws.amazon.com/about-aws/whats-new/2019/11/introducing-aws-step-functions-data-science-sdk-amazon-sagemaker/\\" target=\\"_blank\\">AWS Step Functions Data Science SDK for Amazon SageMaker</a>, or third-party tools to automate your retraining pipelines.</p>\\n<h3><a id=\\"Measure_results_and_improve_75\\"></a><strong>Measure results and improve</strong></h3>\\n<p>To monitor and quantify improvements during the inference phase, track the following metrics:</p>\n<ul>\\n<li><a href=\\"https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariant.html\\" target=\\"_blank\\">Resources provisioned for your endpoints</a>(<code>InstanceType</code>and<code>AcceleratorType</code>)</li>\\n<li><a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html#cloudwatch-metrics-jobs\\" target=\\"_blank\\">Efficient use of these resources</a>(<code>CPUUtilization</code>,<code>GPUUtilization</code>,<code>GPUMemoryUtilization</code>,<code>MemoryUtilization</code>, and <code>DiskUtilization</code>) in the <a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html\\" target=\\"_blank\\">CloudWatch Console</a></li>\\n</ul>\n<p>For storage:</p>\n<ul>\\n<li>The total size of the <a href=\\"https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-data-capture.html\\" target=\\"_blank\\">data captured</a> by <a href=\\"https://aws.amazon.com/sagemaker/model-monitor/\\" target=\\"_blank\\">Amazon SageMaker Model Monitor</a> using <a href=\\"https://aws.amazon.com/s3/storage-analytics-insights/\\" target=\\"_blank\\">Amazon S3 Storage Lens</a></li>\\n<li>The <a href=\\"https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/API_LogGroup.html#CWL-Type-LogGroup-storedBytes\\" target=\\"_blank\\">size of your CloudWatch log groups</a></li>\\n</ul>\n<h3><a id=\\"Conclusion_85\\"></a><strong>Conclusion</strong></h3>\\n<p>AI/ML workloads can be energy intensive, but as <a href=\\"https://www.sparkblue.org/event/CODES\\" target=\\"_blank\\">called out by UN</a> and mentioned in the last <a href=\\"https://www.ipcc.ch/report/ar6/wg3/\\" target=\\"_blank\\">IPCC report</a>, AI can contribute to mitigation of climate change and the achievement of several <a href=\\"https://sdgs.un.org/goals\\" target=\\"_blank\\">Sustainable Development Goals</a>. As technology builders, it’s our <a href=\\"https://docs.aws.amazon.com/wellarchitected/latest/sustainability-pillar/the-shared-responsibility-model.html\\" target=\\"_blank\\">responsibility</a> to make sustainable use of AI and ML.</p>\\n<p>In this blog post series, we presented best practices you can use to make <a href=\\"https://docs.aws.amazon.com/wellarchitected/latest/sustainability-pillar/sustainability-pillar.html\\" target=\\"_blank\\">sustainability-conscious architectural decisions</a> and reduce the environmental impact for your AI/ML workloads.</p>\\n<h3><a id=\\"Other_posts_in_this_series_90\\"></a><strong>Other posts in this series</strong></h3>\\n<ul>\\n<li><a href=\\"https://aws.amazon.com/blogs/architecture/optimize-ai-ml-workloads-for-sustainability-part-1-identify-business-goals-validate-ml-use-and-process-data/\\" target=\\"_blank\\">Optimize AI/ML workloads for sustainability: Part 1, identify business goals, validate ML use, and process data</a></li>\\n<li><a href=\\"https://aws.amazon.com/blogs/architecture/optimize-ai-ml-workloads-for-sustainability-part-2-model-development/\\" target=\\"_blank\\">Optimize AI/ML workloads for sustainability: Part 2, model development</a></li>\\n</ul>\n<h3><a id=\\"About_the_WellArchitected_Framework_93\\"></a><strong>About the Well-Architected Framework</strong></h3>\\n<p>These practices are part of the <a href=\\"https://docs.aws.amazon.com/wellarchitected/latest/sustainability-pillar/sustainability-pillar.html\\" target=\\"_blank\\">Sustainability Pillar of the AWS Well-Architected Framework</a>. AWS Well-Architected is a set of guiding design principles developed by AWS to help organizations build secure, high-performing, resilient, and efficient infrastructure for a variety of applications and workloads. Use the <a href=\\"https://aws.amazon.com/well-architected-tool/\\" target=\\"_blank\\">AWS Well-Architected Tool</a> to review your workloads periodically to address important design considerations and ensure that they follow the best practices and guidance of the AWS Well-Architected Framework. For follow up questions or comments, join our growing community on <a href=\\"https://repost.aws/\\" target=\\"_blank\\">AWS re:Post.</a></p>\\n<h5><a id=\\"Benoit_de_Chateauvieux_95\\"></a><strong>Benoit de Chateauvieux</strong></h5>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/38105f6a219946719892b9c914ff88ce_image.png\\" alt=\\"image.png\\" /></p>\n<p>Benoit de Chateauvieux is a Startup Solutions Architect at AWS, based in Montreal, Canada. As a former CTO, he enjoys helping startups build great and sustainable products using the cloud. Outside of work, you’ll find Benoit in canoe-camping expeditions, paddling across Canadian rivers.</p>\n<h5><a id=\\"Eddie_Pick_99\\"></a><strong>Eddie Pick</strong></h5>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/0493b67e91874d038ecbcf568ecbece6_image.png\\" alt=\\"image.png\\" /></p>\n<p>Eddie Pick is a Senior Startup Solutions Architect at AWS based in Montréal, Canada. As an ex co-founder and former CTO, his goal is to help startups build great products faster on the Cloud, particularly using machine learning.</p>\n<h5><a id=\\"Dan_Ferguson_103\\"></a><strong>Dan Ferguson</strong></h5>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/31e124387330489092b4d804e9f7f7eb_image.png\\" alt=\\"image.png\\" /></p>\n<p>Dan Ferguson is a Solutions Architect at AWS, based in New York, USA. As a machine learning services expert, Dan works to support customers on their journey to integrating ML workflows efficiently, effectively, and sustainably.</p>\n<h5><a id=\\"Brendan_Sisson_107\\"></a><strong>Brendan Sisson</strong></h5>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/2ba779bde30a416a8c1c58043477d3d1_image.png\\" alt=\\"image.png\\" /></p>\n<p>Brendan Sisson is a Principal Sustainability Solutions Architect at AWS based in London, UK. As a contributor to the Sustainability Pillar of the AWS Well-Architected Framework, he supports customers on how they can optimize their workloads running in the AWS Cloud and how they can use the AWS Cloud to help solve their wider sustainability challenges.</p>\n"}