Amazon SageMaker 新增有助于减少基础模型部署成本和延迟的推理功能

re:Invent

Amazon SageMaker

Amazon Management Console

今天，我们宣布 [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) 新增推理功能，有助于为您优化部署成本并减少延迟。利用新增的推理功能，您可以在同一 SageMaker 端点上部署一个或多个基础模型（ FM ），并控制为每个 FM 保留的加速器数量和内存量。这有助于提高资源利用率，将模型部署成本平均降低 50% ，并使您可以根据用例扩展端点。您可以针对每个基础模型（ FM ）定义单独的扩展策略，以适应模型使用模式，并进一步优化基础设施成本。此外， SageMaker 会主动监控处理推理请求的实例，并根据可用的实例智能路由请求，从而将推理延迟平均降低 20% 。 ##### 关键组件新增的推理功能建立在 SageMaker 实时推理端点的基础上。如前所述，您使用定义端点的实例类型和初始实例计数的端点配置创建 SageMaker 端点。该模型配置在一个新的结构中，即在推理组件中。您可以在这里指定要分配给模型每个副本的加速器数量和内存量，以及要部署的模型工件、容器映像和模型副本数量。 ![屏幕截图 2023-12-26 202649.png](https://dev-media.amazoncloud.cn/2b28e510ba5e4a00a316b98060f6d09f_%E5%B1%8F%E5%B9%95%E6%88%AA%E5%9B%BE%202023-12-26%20202649.png "屏幕截图 2023-12-26 202649.png") 让我来为您展示这个工作原理。正在运行的新增推理功能您可以开始使用 [SageMaker Studio](https://aws.amazon.com/cn/sagemaker/studio/?trk=cndc-detail) 、 [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/?trk=cndc-detail) 、[Amazon SDKs ](https://aws.amazon.com/cn/developer/tools/?trk=cndc-detail) 和 [Amazon Command Line Interface (Amazon CLI)](https://aws.amazon.com/cn/cli/?trk=cndc-detail) 中的新增推理功能。它们也由 Amazon CloudFormation 提供支持。在这个演示中，我使用 [Amazon SDK For Python（Boto3）](https://aws.amazon.com/cn/sdk-for-python/?trk=cndc-detail)，在 SageMaker 实时端点上使用新的推理功能部署 [Dolly v2 7B](https://huggingface.co/databricks/dolly-v2-7b?trk=cndc-detail) 模型的副本和 [Hugging Face](https://huggingface.co/models?trk=cndc-detail) 模型中心的 [FLAN-T5 XXL](https://huggingface.co/google/flan-t5-xxl?trk=cndc-detail) 模型的副本。创建 SageMaker 端点配置 ```Python import boto3 import sagemaker role = sagemaker.get_execution_role() sm_client = boto3.client(service_name="sagemaker") sm_client.create_endpoint_config( EndpointConfigName=endpoint_config_name, ExecutionRoleArn=role, ProductionVariants=[{ "VariantName": "AllTraffic", "InstanceType": "ml.g5.12xlarge", "InitialInstanceCount": 1, "RoutingConfig": { "RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS" } }] ) ``` 创建 SageMaker 端点 ```Python sm_client.create_endpoint( EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name, ) ``` 在创建推理组件之前，您需要创建一个与 SageMaker 兼容的模型，并指定要使用的容器映像。这两种模型我都使用了 [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) 的 [Hugging Face LLM Inference Container](https://huggingface.co/blog/sagemaker-huggingface-llm?trk=cndc-detail) 。这些深度学习容器（ DLC ）包含在 SageMaker 上托管大型模型所需的组件、库和驱动程序。准备 Dolly v2 模型 ```Python from sagemaker.huggingface import get_huggingface_llm_image_uri # Retrieve the container image URI hf_inference_dlc = get_huggingface_llm_image_uri( "huggingface", version="0.9.3" ) # Configure model container dolly7b = { 'Image': hf_inference_dlc, 'Environment': { 'HF_MODEL_ID':'databricks/dolly-v2-7b', 'HF_TASK':'text-generation', } } # Create SageMaker Model sagemaker_client.create_model( ModelName = "dolly-v2-7b", ExecutionRoleArn = role, Containers = [dolly7b] ) ``` 准备 FLAN-T5 XXL 模型 ```Python # Configure model container flant5xxlmodel = { 'Image': hf_inference_dlc, 'Environment': { 'HF_MODEL_ID':'google/flan-t5-xxl', 'HF_TASK':'text-generation', } } # Create SageMaker Model sagemaker_client.create_model( ModelName = "flan-t5-xxl", ExecutionRoleArn = role, Containers = [flant5xxlmodel] ) ``` 现在，您已经为创建推理组件做好准备。 **针对每个模型创建一个推理组件** 针对要在端点上部署的每个模型指定一个推理组件。推理组件能够让您指定与 SageMaker 兼容的模型以及要分配的计算和内存资源。对于 CPU 工作负载，请定义要分配的内核数。对于加速器工作负载，请定义加速器的数量。`RuntimeConfig` 定义要部署的模型副本数。 ```Python # Inference compoonent for Dolly v2 7B sm_client.create_inference_component( InferenceComponentName="IC-dolly-v2-7b", EndpointName=endpoint_name, VariantName=variant_name, Specification={ "ModelName": "dolly-v2-7b", "ComputeResourceRequirements": { "NumberOfAcceleratorDevicesRequired": 2, "NumberOfCpuCoresRequired": 2, "MinMemoryRequiredInMb": 1024 } }, RuntimeConfig={"CopyCount": 1}, ) # Inference component for FLAN-T5 XXL sm_client.create_inference_component( InferenceComponentName="IC-flan-t5-xxl", EndpointName=endpoint_name, VariantName=variant_name, Specification={ "ModelName": "flan-t5-xxl", "ComputeResourceRequirements": { "NumberOfAcceleratorDevicesRequired": 2, "NumberOfCpuCoresRequired": 1, "MinMemoryRequiredInMb": 1024 } }, RuntimeConfig={"CopyCount": 1}, ) ``` 成功部署推理组件之后，就可以调用模型。运行推理要在端点上调用模型，请指定相应的推理组件。 ```Python import json sm_runtime_client = boto3.client(service_name="sagemaker-runtime") payload = {"inputs": "Why is California a great place to live?"} response_dolly = sm_runtime_client.invoke_endpoint( EndpointName=endpoint_name, InferenceComponentName = "IC-dolly-v2-7b", ContentType="application/json", Accept="application/json", Body=json.dumps(payload), ) response_flant5 = sm_runtime_client.invoke_endpoint( EndpointName=endpoint_name, InferenceComponentName = "IC-flan-t5-xxl", ContentType="application/json", Accept="application/json", Body=json.dumps(payload), ) result_dolly = json.loads(response_dolly['Body'].read().decode()) result_flant5 = json.loads(response_flant5['Body'].read().decode()) ``` 接下来，您可以通过注册扩展目标并将扩展策略应用于推理组件，为每个模型定义单独的扩展策略。请参阅 [SageMaker 开发者指南](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-deploy-models.html?trk=cndc-detail)，获取详细的说明。新增的推断功能提供了每个模型的 CloudWatch 指标和 CloudWatch 日志，并可与 SageMaker CPU 和 GPU 计算实例上的任何 SageMaker 兼容的容器镜像一起使用。如果容器镜像支持，您还可以使用响应流式传输。 ##### 现已上线 [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) 新增的推理功能现已在下列亚马逊云科技区域上线——美国东部（俄亥俄州、北弗吉尼亚州）、美国西部（俄勒冈州）、亚太地区（雅加达、孟买、首尔、新加坡、悉尼、东京）、加拿大（中部）、欧洲（法兰克福、爱尔兰、伦敦、斯德哥尔摩）、中东（阿联酋）和南美（圣保罗）。请访问亚马逊 [Amazon SageMaker Pricing](https://aws.amazon.com/cn/sagemaker/pricing/?trk=cndc-detail) 了解有关定价的详细信息。请访问 [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) 了解更多信息。 ##### 入门立即登录 [Amazon Management Console](https://console.aws.amazon.com/sagemaker/home?trk=cndc-detail) 并使用新增的 SageMaker 推理功能部署您的 FM！【文章来源：https://aws.amazon.com/blogs/aws/amazon-sagemaker-adds-new-inference-capabilities-to-help-reduce-foundation-model-deployment-costs-and-latency?trk=cndc-detail/】 — [Antje](https://www.linkedin.cn/incareer/in/antje-barth/?trk=cndc-detail)