### **摘要**
本文展示了利用 Mountpoint for [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) Container Storage Interface (CSI) driver 将 [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) 存储桶挂载在 Kubernetes 容器下,容器中的 LLM 训练脚本通过读取 Mountpoint for S3 挂载目录的方式直接访问 S3 存储桶上的数据进行训练。
本文通过在 [EC2](https://aws.amazon.com/cn/ec2/?trk=cndc-detail) g5.2xl 实例上完成 LLaMA 2 的调优训练,读者可以利用较少的 GPU 资源学习复现本文内容,熟悉在 Kubernetes 环境中利用 Mountpoint for [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) CSI driver 完成大模型的调优过程和配置方法。通过 Mountpoint for S3 CSI driver 实现性能加速的对比,不在本文讨论范围之内。
### **背景**
2023 年 11 月 27 日,在 re:Invent 大会上,亚马逊云科技宣布 Mountpoint for [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) CSI driver 正式可用,这标志着客户在可以在 [Amazon Elastic Kubernetes Service](https://aws.amazon.com/cn/eks/?trk=cndc-detail) (EKS) 容器环境下,使用 CSI 的方式对 [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) 存储桶进行进行挂载到容器中。使用新的 [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) CSI driver 的挂载点,您的 Kubernetes 应用程序可以通过文件系统接口访问 S3 对象,实现高聚合吞吐量,而无需更改应用程序。
基于 [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) 的挂载点,CSI driver 将 S3 存储桶呈现为 [Amazon EKS](https://aws.amazon.com/cn/eks/?trk=cndc-detail) 和自管理 Kubernetes 集群中容器可访问的卷。因此,在 [Amazon EKS](https://aws.amazon.com/cn/eks/?trk=cndc-detail) 和自管理的 Kubernetes 集群中,分布式[机器学习](https://aws.amazon.com/cn/machine-learning/?trk=cndc-detail)训练作业可以以高吞吐量从 [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) 读取数据,以加速训练时间。同时这无疑也极大简化了客户对 S3 对象操作的成本。
### **Mountpoint for S3 CSI driver 架构**
![image.png](https://dev-media.amazoncloud.cn/25febb4799c245329ede03969464ae1e_image.png "image.png")
### **利用 Mountpoint for Amazon S3 CSI driver 进行 LLaMA 2 调优训练的动手实践**
#### 1. Mountpoint for Amazon S3 CSI driver 安装与配置
首先,进行 Mountpoint for [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) CSI driver 安装,为 Mountpoint for [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) CSI driver 配置 IAM 权限策略 AmazonS3CSIDriverPolicy,如下:
```js
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "MountpointFullBucketAccess",
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::DOC-EXAMPLE-BUCKET1"
]
},
{
"Sid": "MountpointFullObjectAccess",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:AbortMultipartUpload",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::DOC-EXAMPLE-BUCKET1/*"
]
}
]
}
```
为 Mountpoint for [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) CSI driver 创建角色:
```js
CLUSTER_NAME=my-cluster
REGION=region-code
ROLE_NAME=AmazonEKS_S3_CSI_DriverRole
ROLE_ARN=AmazonS3CSIDriverPolicy
eksctl create iamserviceaccount \\
--name s3-csi-driver-sa \\
--namespace kube-system \\
--cluster \$CLUSTER_NAME \\
--attach-policy-arn \$ROLE_ARN \\
--approve
--role-name \$ROLE_NAME
--region \$REGION
```
在已经创建好的 [Amazon EKS](https://aws.amazon.com/cn/eks/?trk=cndc-detail) 集群上进行 Kustomize 安装:
```js
kubectl apply -k "github.com/awslabs/mountpoint-s3-csi-driver/deploy/kubernetes/overlays/stable/"
```
一旦完成安装,验证 driver 是否正常运行:
```js
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-mountpoint-s3-csi-driver
```
这时候 Mountpoint for S3 就安装好了。当容器对 Mountpoint for S3 挂载目录进行访问时,Mountpoint for S3 CSI driver 会将训练脚本中的 POSIX 访问请求转换为 S3 请求,如当读取一个样本文件时,Mountpoint for S3 CSI driver 会执行 S3 Get API 用于将 S3 对象下载至内存中。
#### 2. 对 S3 存储桶执行静态加载
由于在 [Amazon EKS](https://aws.amazon.com/cn/eks/?trk=cndc-detail) 上 Mountpoint for S3 CSI driver 只支持资源静态加载,也就是说我们需要提前创建好存储桶,无法像文件系统如 Elastic File System 那样由 PVC 动态创建。
##### 2.1 创建 PV
首先创建 PV,将我们要读取的 S3 桶与所对应的 region 配置在 PV yaml 文件中进行配置:
```js
apiVersion: v1
kind: PersistentVolume
metadata:
name: s3-pv
spec:
capacity:
storage: 1200Gi # ignored, required
accessModes:
- ReadWriteMany # supported options: ReadWriteMany / ReadOnlyMany
mountOptions:
- allow-delete
- region us-west-2
csi:
driver: s3.csi.aws.com # required
volumeHandle: s3-csi-driver-volume
volumeAttributes:
bucketName: YOU-BUCKET-NAME
```
##### 2.2 创建 PVC
基于 2.1 中创建的 PV s3-pv 进行 PVC 配置:
```js
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: s3-claim
spec:
accessModes:
- ReadWriteMany # supported options: ReadWriteMany / ReadOnlyMany
storageClassName: "" # required for static provisioning
resources:
requests:
storage: 1200Gi # ignored, required
volumeName: s3-pv
```
#### 3. 准备训练数据,并上传至 S3 桶中
在本文中,将以 [Huggingface timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco?trk=cndc-detail) 为训练数据样本, 数据量大小为 16M。将该数据集上传到 S3 存储桶中。
```js
aws s3 cp examples/scripts/datasets/openassistant-guanaco-train.arrow s3://YOUR-BUCKET_NAME
```
#### 4. 准备训练容器
在本文中将采用 PEFT 对 LLaMA 2 进行调优。
##### 4.1 准备 Dockerfile
```js
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.1-gpu-py310-cu118-ubuntu20.04-ec2 as base
WORKDIR /llama2
RUN pip install trl
RUN git clone https://github.com/lvwerra/trl
WORKDIR /llama2/trl/
RUN pip install -r requirements.txt
RUN sed -i 's|dataset = load_dataset(script_args.dataset_name, split="train")|dataset = load_dataset("arrow", data_files="/mount-s3/openassistant-guanaco-train.arrow",split="train")|' examples/scripts/sft.py
```
##### 4.2 登陆 DLC 镜像仓库
```js
aws ecr get-login-password --region us-east-1 |sudo docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
```
##### 4.3 基于准备好的 Dockerfile 进行镜像构建
```js
sudo docker build -t public.ecr.aws/h6r2a7o6/emr:llama2_finetune-v4 .
```
##### 4.4 登陆自己的镜像仓库
```js
aws ecr-public get-login-password --region us-east-1 | sudo docker login --username AWS --password-stdin public.ecr.aws/h6r2a7o6
```
##### 4.5 上传 Docker 镜像
```js
sudo docker push public.ecr.aws/h6r2a7o6/emr:llama2_finetune-v4
```
##### 4.6 准备容器 YAML 文件 llama2_finetuning.yaml
```js
apiVersion: v1
kind: Pod
metadata:
name: llama2-pod
spec:
containers:
- name: app
image: public.ecr.aws/h6r2a7o6/emr:llama2_finetune-v4
command: ["/bin/sh"]
args: ["-c", "python3 examples/scripts/sft.py --model_name meta-llama/Llama-2-7b-hf --dataset_name timdettmers/openassistant-guanaco --load_in_4bit --use_peft --batch_size 1 --gradient_accumulation_steps 2"]
volumeMounts:
- name: persistent-storage
mountPath: /mount-s3
volumes:
- name: persistent-storage
persistentVolumeClaim:
claimName: s3-claim
```
##### 4.7 启动该容器,检查容器运行情况
```js
k apply -f llama2_finetuning.yaml;k get pods
```
##### 4.8 通过容器日志查看训练进展
```js
k logs -f llama2-pod
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
/opt/conda/lib/python3.10/site-packages/trl/trainer/ppo_config.py:141: UserWarning: The `optimize_cuda_cache` arguement will be deprecated soon, please use `optimize_device_cache` instead.
warnings.warn(
Detected kernel version 5.4.253, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:472: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
warnings.warn(
Loading checkpoint shards: 100%|██████████| 2/2 [00:07<00:00, 3.94s/it]
/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py:374: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
warnings.warn(
Downloading data files: 100%|██████████| 1/1 [00:00<00:00, 15650.39it/s]
Extracting data files: 100%|██████████| 1/1 [00:01<00:00, 1.06s/it]
Generating train split: 9846 examples [00:00, 19719.37 examples/s]
Map: 100%|██████████| 9846/9846 [00:02<00:00, 4094.10 examples/s]
Detected kernel version 5.4.253, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
0%| | 0/14769 [00:00<?, ?it/s]You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
{'loss': 2.1358, 'learning_rate': 1.4099045297582776e-05, 'epoch': 0.0}
{'loss': 1.9417, 'learning_rate': 1.409809059516555e-05, 'epoch': 0.0}
{'loss': 1.2295, 'learning_rate': 1.4097135892748325e-05, 'epoch': 0.0}
{'loss': 2.2918, 'learning_rate': 1.4096181190331099e-05, 'epoch': 0.0}
```
从上面日志可以看到 LLaMA 2 的训练工作正常启动和运行。
一般情况下,利用自有数据进行 LLaMA 2 调优时,数据量往往是从几十 MB 到几个 GB,数据访问往往不是模型训练中的瓶颈。但通过 [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) CSI 使得训练脚本像读本地文件系统文件那样去读取 S3 桶上的数据,极大简化了 S3 访问方式,这使得代码无需任何针对 S3 数据访问改造,减少了训练代码的复杂性和增加了代码的可移植性。
### **总结**
本文介绍了如何利用 Mountpoint for [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) CSI driver 将 S3 存储桶挂载到 Kubernetes 容器中,从而使容器内的 LLM 训练脚本能够通过文件系统接口访问 S3 对象,实现高吞吐量,而无需修改应用程序。在实践部分,文章详细说明了 Mountpoint for [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) CSI driver 的安装与配置、对 S3 存储桶的静态加载、训练数据的准备和上传、以及训练容器的准备。通过这些步骤,读者可以迅速在 [Amazon EKS](https://aws.amazon.com/cn/eks/?trk=cndc-detail) 容器环境中搭建起训练大模型的基础设施,并加速数据访问以提高训练效率。
### **参考文献**
- [Mountpoint for Amazon S3 CSI driver 安装](https://github.com/awslabs/mountpoint-s3-csi-driver/blob/main/docs/install.md#deploy-driver?trk=cndc-detail)
- [Huggingface 上 LLaMA2 的官方博客](https://huggingface.co/blog/llama2?trk=cndc-detail)
![开发者尾巴.gif](https://dev-media.amazoncloud.cn/0e59f7807be44955bd9f72f12fe86299_%E5%BC%80%E5%8F%91%E8%80%85%E5%B0%BE%E5%B7%B4.gif "开发者尾巴.gif")