利用 Mountpoint for Amazon S3 在 Kubernetes 上加速 LLM 的训练

存储

容器

Amazon Simple Storage Service (S3)

Amazon EC2

Amazon Elastic Kubernetes Service (EKS)

### **摘要** 本文展示了利用 Mountpoint for [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) Container Storage Interface (CSI) driver 将 [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) 存储桶挂载在 Kubernetes 容器下，容器中的 LLM 训练脚本通过读取 Mountpoint for S3 挂载目录的方式直接访问 S3 存储桶上的数据进行训练。本文通过在 [EC2](https://aws.amazon.com/cn/ec2/?trk=cndc-detail) g5.2xl 实例上完成 LLaMA 2 的调优训练，读者可以利用较少的 GPU 资源学习复现本文内容，熟悉在 Kubernetes 环境中利用 Mountpoint for [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) CSI driver 完成大模型的调优过程和配置方法。通过 Mountpoint for S3 CSI driver 实现性能加速的对比，不在本文讨论范围之内。 ### **背景** 2023 年 11 月 27 日，在 re:Invent 大会上，亚马逊云科技宣布 Mountpoint for [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) CSI driver 正式可用，这标志着客户在可以在 [Amazon Elastic Kubernetes Service](https://aws.amazon.com/cn/eks/?trk=cndc-detail) (EKS) 容器环境下，使用 CSI 的方式对 [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) 存储桶进行进行挂载到容器中。使用新的 [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) CSI driver 的挂载点，您的 Kubernetes 应用程序可以通过文件系统接口访问 S3 对象，实现高聚合吞吐量，而无需更改应用程序。基于 [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) 的挂载点，CSI driver 将 S3 存储桶呈现为 [Amazon EKS](https://aws.amazon.com/cn/eks/?trk=cndc-detail) 和自管理 Kubernetes 集群中容器可访问的卷。因此，在 [Amazon EKS](https://aws.amazon.com/cn/eks/?trk=cndc-detail) 和自管理的 Kubernetes 集群中，分布式[机器学习](https://aws.amazon.com/cn/machine-learning/?trk=cndc-detail)训练作业可以以高吞吐量从 [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) 读取数据，以加速训练时间。同时这无疑也极大简化了客户对 S3 对象操作的成本。 ### **Mountpoint for S3 CSI driver 架构** ![image.png](https://dev-media.amazoncloud.cn/25febb4799c245329ede03969464ae1e_image.png "image.png") ### **利用 Mountpoint for Amazon S3 CSI driver 进行 LLaMA 2 调优训练的动手实践** #### 1. Mountpoint for Amazon S3 CSI driver 安装与配置首先，进行 Mountpoint for [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) CSI driver 安装，为 Mountpoint for [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) CSI driver 配置 IAM 权限策略 AmazonS3CSIDriverPolicy，如下： ```js { "Version": "2012-10-17", "Statement": [ { "Sid": "MountpointFullBucketAccess", "Effect": "Allow", "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::DOC-EXAMPLE-BUCKET1" ] }, { "Sid": "MountpointFullObjectAccess", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject", "s3:AbortMultipartUpload", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::DOC-EXAMPLE-BUCKET1/*" ] } ] } ``` 为 Mountpoint for [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) CSI driver 创建角色： ```js CLUSTER_NAME=my-cluster REGION=region-code ROLE_NAME=AmazonEKS_S3_CSI_DriverRole ROLE_ARN=AmazonS3CSIDriverPolicy eksctl create iamserviceaccount \\ --name s3-csi-driver-sa \\ --namespace kube-system \\ --cluster \$CLUSTER_NAME \\ --attach-policy-arn \$ROLE_ARN \\ --approve --role-name \$ROLE_NAME --region \$REGION ``` 在已经创建好的 [Amazon EKS](https://aws.amazon.com/cn/eks/?trk=cndc-detail) 集群上进行 Kustomize 安装： ```js kubectl apply -k "github.com/awslabs/mountpoint-s3-csi-driver/deploy/kubernetes/overlays/stable/" ``` 一旦完成安装，验证 driver 是否正常运行： ```js kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-mountpoint-s3-csi-driver ``` 这时候 Mountpoint for S3 就安装好了。当容器对 Mountpoint for S3 挂载目录进行访问时，Mountpoint for S3 CSI driver 会将训练脚本中的 POSIX 访问请求转换为 S3 请求，如当读取一个样本文件时，Mountpoint for S3 CSI driver 会执行 S3 Get API 用于将 S3 对象下载至内存中。 #### 2. 对 S3 存储桶执行静态加载由于在 [Amazon EKS](https://aws.amazon.com/cn/eks/?trk=cndc-detail) 上 Mountpoint for S3 CSI driver 只支持资源静态加载，也就是说我们需要提前创建好存储桶，无法像文件系统如 Elastic File System 那样由 PVC 动态创建。 ##### 2.1 创建 PV 首先创建 PV，将我们要读取的 S3 桶与所对应的 region 配置在 PV yaml 文件中进行配置： ```js apiVersion: v1 kind: PersistentVolume metadata: name: s3-pv spec: capacity: storage: 1200Gi # ignored, required accessModes: - ReadWriteMany # supported options: ReadWriteMany / ReadOnlyMany mountOptions: - allow-delete - region us-west-2 csi: driver: s3.csi.aws.com # required volumeHandle: s3-csi-driver-volume volumeAttributes: bucketName: YOU-BUCKET-NAME ``` ##### 2.2 创建 PVC 基于 2.1 中创建的 PV s3-pv 进行 PVC 配置： ```js apiVersion: v1 kind: PersistentVolumeClaim metadata: name: s3-claim spec: accessModes: - ReadWriteMany # supported options: ReadWriteMany / ReadOnlyMany storageClassName: "" # required for static provisioning resources: requests: storage: 1200Gi # ignored, required volumeName: s3-pv ``` #### 3. 准备训练数据，并上传至 S3 桶中在本文中，将以 [Huggingface timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco?trk=cndc-detail) 为训练数据样本，数据量大小为 16M。将该数据集上传到 S3 存储桶中。 ```js aws s3 cp examples/scripts/datasets/openassistant-guanaco-train.arrow s3://YOUR-BUCKET_NAME ``` #### 4. 准备训练容器在本文中将采用 PEFT 对 LLaMA 2 进行调优。 ##### 4.1 准备 Dockerfile ```js FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.1-gpu-py310-cu118-ubuntu20.04-ec2 as base WORKDIR /llama2 RUN pip install trl RUN git clone https://github.com/lvwerra/trl WORKDIR /llama2/trl/ RUN pip install -r requirements.txt RUN sed -i 's|dataset = load_dataset(script_args.dataset_name, split="train")|dataset = load_dataset("arrow", data_files="/mount-s3/openassistant-guanaco-train.arrow",split="train")|' examples/scripts/sft.py ``` ##### 4.2 登陆 DLC 镜像仓库 ```js aws ecr get-login-password --region us-east-1 |sudo docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com ``` ##### 4.3 基于准备好的 Dockerfile 进行镜像构建 ```js sudo docker build -t public.ecr.aws/h6r2a7o6/emr:llama2_finetune-v4 . ``` ##### 4.4 登陆自己的镜像仓库 ```js aws ecr-public get-login-password --region us-east-1 | sudo docker login --username AWS --password-stdin public.ecr.aws/h6r2a7o6 ``` ##### 4.5 上传 Docker 镜像 ```js sudo docker push public.ecr.aws/h6r2a7o6/emr:llama2_finetune-v4 ``` ##### 4.6 准备容器 YAML 文件 llama2_finetuning.yaml ```js apiVersion: v1 kind: Pod metadata: name: llama2-pod spec: containers: - name: app image: public.ecr.aws/h6r2a7o6/emr:llama2_finetune-v4 command: ["/bin/sh"] args: ["-c", "python3 examples/scripts/sft.py --model_name meta-llama/Llama-2-7b-hf --dataset_name timdettmers/openassistant-guanaco --load_in_4bit --use_peft --batch_size 1 --gradient_accumulation_steps 2"] volumeMounts: - name: persistent-storage mountPath: /mount-s3 volumes: - name: persistent-storage persistentVolumeClaim: claimName: s3-claim ``` ##### 4.7 启动该容器，检查容器运行情况 ```js k apply -f llama2_finetuning.yaml；k get pods ``` ##### 4.8 通过容器日志查看训练进展 ```js k logs -f llama2-pod The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`. 0it [00:00, ?it/s] /opt/conda/lib/python3.10/site-packages/trl/trainer/ppo_config.py:141: UserWarning: The `optimize_cuda_cache` arguement will be deprecated soon, please use `optimize_device_cache` instead. warnings.warn( Detected kernel version 5.4.253, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. /opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:472: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead. warnings.warn( Loading checkpoint shards: 100%|██████████| 2/2 [00:07<00:00, 3.94s/it] /opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py:374: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead. warnings.warn( Downloading data files: 100%|██████████| 1/1 [00:00<00:00, 15650.39it/s] Extracting data files: 100%|██████████| 1/1 [00:01<00:00, 1.06s/it] Generating train split: 9846 examples [00:00, 19719.37 examples/s] Map: 100%|██████████| 9846/9846 [00:02<00:00, 4094.10 examples/s] Detected kernel version 5.4.253, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. 0%| | 0/14769 [00:00<?, ?it/s]You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding. {'loss': 2.1358, 'learning_rate': 1.4099045297582776e-05, 'epoch': 0.0} {'loss': 1.9417, 'learning_rate': 1.409809059516555e-05, 'epoch': 0.0} {'loss': 1.2295, 'learning_rate': 1.4097135892748325e-05, 'epoch': 0.0} {'loss': 2.2918, 'learning_rate': 1.4096181190331099e-05, 'epoch': 0.0} ``` 从上面日志可以看到 LLaMA 2 的训练工作正常启动和运行。一般情况下，利用自有数据进行 LLaMA 2 调优时，数据量往往是从几十 MB 到几个 GB，数据访问往往不是模型训练中的瓶颈。但通过 [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) CSI 使得训练脚本像读本地文件系统文件那样去读取 S3 桶上的数据，极大简化了 S3 访问方式，这使得代码无需任何针对 S3 数据访问改造，减少了训练代码的复杂性和增加了代码的可移植性。 ### **总结** 本文介绍了如何利用 Mountpoint for [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) CSI driver 将 S3 存储桶挂载到 Kubernetes 容器中，从而使容器内的 LLM 训练脚本能够通过文件系统接口访问 S3 对象，实现高吞吐量，而无需修改应用程序。在实践部分，文章详细说明了 Mountpoint for [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) CSI driver 的安装与配置、对 S3 存储桶的静态加载、训练数据的准备和上传、以及训练容器的准备。通过这些步骤，读者可以迅速在 [Amazon EKS](https://aws.amazon.com/cn/eks/?trk=cndc-detail) 容器环境中搭建起训练大模型的基础设施，并加速数据访问以提高训练效率。 ### **参考文献** - [Mountpoint for Amazon S3 CSI driver 安装](https://github.com/awslabs/mountpoint-s3-csi-driver/blob/main/docs/install.md#deploy-driver?trk=cndc-detail) - [Huggingface 上 LLaMA2 的官方博客](https://huggingface.co/blog/llama2?trk=cndc-detail) ![开发者尾巴.gif](https://dev-media.amazoncloud.cn/0e59f7807be44955bd9f72f12fe86299_%E5%BC%80%E5%8F%91%E8%80%85%E5%B0%BE%E5%B7%B4.gif "开发者尾巴.gif")