Near-linear scaling of gigantic-model training on Amazon Web Services

分布式

海外精选

亚马逊云科技

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"State-of-the-art language models have billions of parameters. Training these models within a manageable time requires distributing the workload across a large computing cluster. Ideally, training time would decrease linearly as the cluster size scales up. However, linear scaling is difficult to achieve because the communication required to coordinate the work of the cluster nodes eats into the gains from parallelization.\n\nRecently, we put some effort into ++[optimizing the communication efficiency](https://www.amazon.science/blog/making-deepspeed-zero-run-efficiently-on-more-affordable-hardware)++ of Microsoft’s DeepSpeed distributed-training library, dramatically improving performance for up to 64 GPUs. However, when we scale from tens of GPUs to hundreds, in the public cloud environment, communication overhead again begins to overwhelm efficiency gains.\n\nIn a paper that we'll present in 2023 at the International Conference on Very Large Data Bases (++[VLDB](https://vldb.org/2023/)++), we propose a method to make model training scale efficiently on hundreds of GPUs in the cloud. We call this method MiCS, because it **mi**nimizes **c**ommunication **s**cale to bring down communication overhead.\n\nSpecifically, where existing distributed-training frameworks such as DeepSpeed and FairScale divide a model state across all GPUs, MiCS makes multiple replicas of the model state and partitions each replica within a subset of GPUs. Depending on the model size, a replica may fit on a single computing node — a single machine with high-speed connections between its GPUs — or on multiple nodes.\n\nThus, in MiCS, frequent communication operations, like parameter gathering, are restricted to a subset of GPUs. In this way, when we scale a cluster up — by adding new replicas across new nodes — the communication latency of frequent communication operations remains fixed, rather than growing with the size of the cluster.\n\nWe also reduce the data volume transmitted between nodes in the event that a copy of the model state won’t fit in a single node. Lastly, MiCS includes a gradient synchronization schedule that amortizes expensive gradient synchronization among all workers.\n\nOur experimental results show significant improvement in throughput and scaling efficiency on different-sized BERT models evaluated on clusters consisting of p3dn.24xlarge instances. MiCS is able to achieve near-linear scalability (denoted by the rectangular frames in the figure below) and provides up to 2.82-fold throughput compared to the second and third states of the three-stage zero-redundancy optimizer, or ZeRO, the communication management method built into DeepSpeed-v0.5.6 .\n\nWe have also compared MiCS with our earlier optimizations of ZeRO’s third stage (see figure below), demonstrating improvements even at the lower GPU counts that we investigated previously. We report all these findings in greater detail in a preprint paper on the arXiv.\n\n![下载.jpg](https://dev-media.amazoncloud.cn/ef8dc44a52d54edbb15b8bf8a839dd86_%E4%B8%8B%E8%BD%BD.jpg)\n\nA comparison of MiCS and our earlier optimizations of DeepSpeed Zero’s third stage.\n\n++[Amazon Web Services P4d](https://aws.amazon.com/ec2/instance-types/p4/)++ provides up to 400Gbps networking bandwidth for high-performance computing. Unfortunately, the distributed system may not be able to fully utilize 400Gbps efficiently because of communication overhead — especially latency, which increases when adding more GPUs to the cluster.\n\nWe have deployed MiCS to train proprietary models with up to 175 billion parameters on p4d.24xlarge (40GB A100) and p4de.24xlarge (80GB A100) instances. When training a 175-billion-parameter model with a sequence length of 2,048 on 16 p4de.24xlarge instances, we are able to achieve 169-teraflops (54.2% of the theoretical peak) performance on each GPU. When we train a 100-billion-parameter model on 64 p4d.24xlarge instances (512 A100 GPUs), MiCS maintains over 170 teraflops per GPU (54.5% of the theoretical peak).\n\nWhen the size of the cluster is scaled from 128 GPUs to 512 GPUs, MiCS achieves 99.4% of the linear-scaling efficiency (as measured by the “weak scaling” metric). In contrast, DeepSpeed ZeRO’s third stage achieves only 72% weak-scaling efficiency and saturates at 62 teraflops per GPU (19.9% of the theoretical peak).\n\n#### **Scale-aware model partitioning**\n\nBy default, DeepSpeed partitions model states across all devices, a strategy that lowers the memory consumption on each GPU in the cluster but incurs large communication overhead in training. More importantly, the overhead scales up with the size of the cluster, which causes the scalability to drop significantly at large scale.\n\nInstead of partitioning model states to all GPUs, MiCS divides GPUs in the cluster into multiple groups and partitions model states within each group. We call these groups partition groups. Each group holds a complete replica of model states. The following figure gives an example of partition groups consisting of two consecutive GPUs. Those GPUs holding the same part of the model state form another kind of group, a replication group.\n\n![下载 1.jpg](https://dev-media.amazoncloud.cn/4f7d7fedf325453fb52c6ef06cd2829e_%E4%B8%8B%E8%BD%BD%20%281%29.jpg)\n\nThe relationship between partition groups and replication groups in MiCS.\n\nPartitioning model states within each partition group restricts the most frequent communications, parameter gathering and gradient synchronization, within a fixed number of GPUs. This strategy effectively controls the communication overhead and does not let it grow with the size of the cluster.\n\n#### **Hierarchical communication strategy**\n\nWhen the memory requirement for a single replica of the model state is larger than the total amount of GPU memory in a single node, we need to store the replica on GPUs spanning multiple nodes. In that case, we have to rely on less-efficient internode communication.\n\nThe volume of transmitted data and the latency in a collective communication are determined by the message size and the number of participants. Particularly, the communication volume is proportional to (p - 1)/p, where p denotes the number of participants, and if the participants use the standard ring-shaped communication pattern, the latency has a linear dependency on the number of participants.\n\nThe message size cannot be reduced without compromising data integrity, but we can reduce the number of participants in internode communications. This lowers the communication volume factor to (p - k)/p and latency by p/(p/k + k) times, where k is the number of GPUs on a single node.\n\nConsider the simple example below, involving two nodes with two GPUs each. The standard ring-shaped communication pattern would aggregate data across nodes (left) by passing messages from each GPU to the next, so a single internode communication involves four GPUs.\n\n![下载 2.jpg](https://dev-media.amazoncloud.cn/d44ed3b05fe14a4b9cab684011b7fb8f_%E4%B8%8B%E8%BD%BD%20%282%29.jpg)\n\nMiCS reduces the number of GPUs that participate in any given internode communication.\n\nMiCS, by contrast, executes these internode operations in parallel, so each internode communication involves only two GPUs (right), which exchange only half the information that we want to communicate. Each node then aggregates the internode data locally to assemble the full message. In this case, the communication volume factor is reduced from ¾ ((4-1)/4) to ½ ((4-2/4).\n\n#### **Two-hop gradient synchronization**\n\nSynchronizing gradients among all workers is an expensive operation, required to keep workers working on the same model states. During the training of large neural nets, batch size is typically limited by GPU memory. Gradient accumulation is a technique that splits a batch of samples into several microbatches that will be run sequentially in multiple microsteps.\n\nWith MiCS, we can accumulate gradients inside each partition group in multiple microbatches until the last microbatch is processed. That is, for each microstep, we can accumulate the full set of gradients for each model replica inside a subset of GPUs (i.e., a partition group). Then, after the last microbatch is handled, each GPU synchronizes gradients with the other GPUs representing the same part of the model state.\n\nThis allows us to amortize the synchronization overhead across replication groups to multiple microsteps. The following figure gives an example of two-hop gradient synchronization for training with four microsteps.\n\n![下载 3.jpg](https://dev-media.amazoncloud.cn/d72ae7dfd86a412097497966967df368_%E4%B8%8B%E8%BD%BD%20%283%29.jpg)\n\nTwo-hop gradient synchronization.\n\nBecause of these three techniques, MiCS shows great scalability on large clusters and delivers excellent training throughput performance, and it enables us to achieve a new state-of-the-art performance on Amazon Web Services p4de.24xlarge machines.\n\nWe are working to open-source MiCS for public use, in the belief that it will greatly reduce the time and cost of large-model training on the Amazon EC2 platform. Please refer to our ++[preprint](https://arxiv.org/abs/2205.00119)++ for a more detailed explanation of our system and analysis of its performance.\n\nAcknowledgements: ++[Yida Wang](https://www.amazon.science/author/yida-wang)++, ++[Justin Chiu](https://www.amazon.science/author/justin-chiu)++, Roshan Makhijani, RJ, ++[Stephen Rawls](https://www.amazon.science/author/stephen-rawls)++, ++[Xin Jin](https://www.amazon.science/research-awards/recipients/xin-jin)++\n\nABOUT THE AUTHOR\n\n#### **Zhen Zhang**\nZhen Zhang is a PhD student in computer science at Johns Hopkins University. He was an applied-science intern at Amazon when the work was done.\n\n#### **[Shuai Zheng](https://www.amazon.science/author/shuai-zheng)**\nShuai Zheng is an applied scientist with Amazon Web Services.","render":"State-of-the-art language models have billions of parameters. Training these models within a manageable time requires distributing the workload across a large computing cluster. Ideally, training time would decrease linearly as the cluster size scales up. However, linear scaling is difficult to achieve because the communication required to coordinate the work of the cluster nodes eats into the gains from parallelization.\nRecently, we put some effort into <ins><a href=\"https://www.amazon.science/blog/making-deepspeed-zero-run-efficiently-on-more-affordable-hardware\" target=\"_blank\">optimizing the communication efficiency</a></ins> of Microsoft’s DeepSpeed distributed-training library, dramatically improving performance for up to 64 GPUs. However, when we scale from tens of GPUs to hundreds, in the public cloud environment, communication overhead again begins to overwhelm efficiency gains.\nIn a paper that we’ll present in 2023 at the International Conference on Very Large Data Bases (<ins><a href=\"https://vldb.org/2023/\" target=\"_blank\">VLDB</a></ins>), we propose a method to make model training scale efficiently on hundreds of GPUs in the cloud. We call this method MiCS, because it minimizes communication scale to bring down communication overhead.\nSpecifically, where existing distributed-training frameworks such as DeepSpeed and FairScale divide a model state across all GPUs, MiCS makes multiple replicas of the model state and partitions each replica within a subset of GPUs. Depending on the model size, a replica may fit on a single computing node — a single machine with high-speed connections between its GPUs — or on multiple nodes.\nThus, in MiCS, frequent communication operations, like parameter gathering, are restricted to a subset of GPUs. In this way, when we scale a cluster up — by adding new replicas across new nodes — the communication latency of frequent communication operations remains fixed, rather than growing with the size of the cluster.\nWe also reduce the data volume transmitted between nodes in the event that a copy of the model state won’t fit in a single node. Lastly, MiCS includes a gradient synchronization schedule that amortizes expensive gradient synchronization among all workers.\nOur experimental results show significant improvement in throughput and scaling efficiency on different-sized BERT models evaluated on clusters consisting of p3dn.24xlarge instances. MiCS is able to achieve near-linear scalability (denoted by the rectangular frames in the figure below) and provides up to 2.82-fold throughput compared to the second and third states of the three-stage zero-redundancy optimizer, or ZeRO, the communication management method built into DeepSpeed-v0.5.6 .\nWe have also compared MiCS with our earlier optimizations of ZeRO’s third stage (see figure below), demonstrating improvements even at the lower GPU counts that we investigated previously. We report all these findings in greater detail in a preprint paper on the arXiv.\n<img src=\"https://dev-media.amazoncloud.cn/ef8dc44a52d54edbb15b8bf8a839dd86_%E4%B8%8B%E8%BD%BD.jpg\" alt=\"下载.jpg\" />\nA comparison of MiCS and our earlier optimizations of DeepSpeed Zero’s third stage.\n<ins><a href=\"https://aws.amazon.com/ec2/instance-types/p4/\" target=\"_blank\">Amazon Web Services P4d</a></ins> provides up to 400Gbps networking bandwidth for high-performance computing. Unfortunately, the distributed system may not be able to fully utilize 400Gbps efficiently because of communication overhead — especially latency, which increases when adding more GPUs to the cluster.\nWe have deployed MiCS to train proprietary models with up to 175 billion parameters on p4d.24xlarge (40GB A100) and p4de.24xlarge (80GB A100) instances. When training a 175-billion-parameter model with a sequence length of 2,048 on 16 p4de.24xlarge instances, we are able to achieve 169-teraflops (54.2% of the theoretical peak) performance on each GPU. When we train a 100-billion-parameter model on 64 p4d.24xlarge instances (512 A100 GPUs), MiCS maintains over 170 teraflops per GPU (54.5% of the theoretical peak).\nWhen the size of the cluster is scaled from 128 GPUs to 512 GPUs, MiCS achieves 99.4% of the linear-scaling efficiency (as measured by the “weak scaling” metric). In contrast, DeepSpeed ZeRO’s third stage achieves only 72% weak-scaling efficiency and saturates at 62 teraflops per GPU (19.9% of the theoretical peak).\n<h4><a id=\"Scaleaware_model_partitioning_26\"></a>Scale-aware model partitioning</h4>\nBy default, DeepSpeed partitions model states across all devices, a strategy that lowers the memory consumption on each GPU in the cluster but incurs large communication overhead in training. More importantly, the overhead scales up with the size of the cluster, which causes the scalability to drop significantly at large scale.\nInstead of partitioning model states to all GPUs, MiCS divides GPUs in the cluster into multiple groups and partitions model states within each group. We call these groups partition groups. Each group holds a complete replica of model states. The following figure gives an example of partition groups consisting of two consecutive GPUs. Those GPUs holding the same part of the model state form another kind of group, a replication group.\n<img src=\"https://dev-media.amazoncloud.cn/4f7d7fedf325453fb52c6ef06cd2829e_%E4%B8%8B%E8%BD%BD%20%281%29.jpg\" alt=\"下载 1.jpg\" />\nThe relationship between partition groups and replication groups in MiCS.\nPartitioning model states within each partition group restricts the most frequent communications, parameter gathering and gradient synchronization, within a fixed number of GPUs. This strategy effectively controls the communication overhead and does not let it grow with the size of the cluster.\n<h4><a id=\"Hierarchical_communication_strategy_38\"></a>Hierarchical communication strategy</h4>\nWhen the memory requirement for a single replica of the model state is larger than the total amount of GPU memory in a single node, we need to store the replica on GPUs spanning multiple nodes. In that case, we have to rely on less-efficient internode communication.\nThe volume of transmitted data and the latency in a collective communication are determined by the message size and the number of participants. Particularly, the communication volume is proportional to (p - 1)/p, where p denotes the number of participants, and if the participants use the standard ring-shaped communication pattern, the latency has a linear dependency on the number of participants.\nThe message size cannot be reduced without compromising data integrity, but we can reduce the number of participants in internode communications. This lowers the communication volume factor to (p - k)/p and latency by p/(p/k + k) times, where k is the number of GPUs on a single node.\nConsider the simple example below, involving two nodes with two GPUs each. The standard ring-shaped communication pattern would aggregate data across nodes (left) by passing messages from each GPU to the next, so a single internode communication involves four GPUs.\n<img src=\"https://dev-media.amazoncloud.cn/d44ed3b05fe14a4b9cab684011b7fb8f_%E4%B8%8B%E8%BD%BD%20%282%29.jpg\" alt=\"下载 2.jpg\" />\nMiCS reduces the number of GPUs that participate in any given internode communication.\nMiCS, by contrast, executes these internode operations in parallel, so each internode communication involves only two GPUs (right), which exchange only half the information that we want to communicate. Each node then aggregates the internode data locally to assemble the full message. In this case, the communication volume factor is reduced from ¾ ((4-1)/4) to ½ ((4-2/4).\n<h4><a id=\"Twohop_gradient_synchronization_54\"></a>Two-hop gradient synchronization</h4>\nSynchronizing gradients among all workers is an expensive operation, required to keep workers working on the same model states. During the training of large neural nets, batch size is typically limited by GPU memory. Gradient accumulation is a technique that splits a batch of samples into several microbatches that will be run sequentially in multiple microsteps.\nWith MiCS, we can accumulate gradients inside each partition group in multiple microbatches until the last microbatch is processed. That is, for each microstep, we can accumulate the full set of gradients for each model replica inside a subset of GPUs (i.e., a partition group). Then, after the last microbatch is handled, each GPU synchronizes gradients with the other GPUs representing the same part of the model state.\nThis allows us to amortize the synchronization overhead across replication groups to multiple microsteps. The following figure gives an example of two-hop gradient synchronization for training with four microsteps.\n<img src=\"https://dev-media.amazoncloud.cn/d72ae7dfd86a412097497966967df368_%E4%B8%8B%E8%BD%BD%20%283%29.jpg\" alt=\"下载 3.jpg\" />\nTwo-hop gradient synchronization.\nBecause of these three techniques, MiCS shows great scalability on large clusters and delivers excellent training throughput performance, and it enables us to achieve a new state-of-the-art performance on Amazon Web Services p4de.24xlarge machines.\nWe are working to open-source MiCS for public use, in the belief that it will greatly reduce the time and cost of large-model training on the Amazon EC2 platform. Please refer to our <ins><a href=\"https://arxiv.org/abs/2205.00119\" target=\"_blank\">preprint</a></ins> for a more detailed explanation of our system and analysis of its performance.\nAcknowledgements: <ins><a href=\"https://www.amazon.science/author/yida-wang\" target=\"_blank\">Yida Wang</a></ins>, <ins><a href=\"https://www.amazon.science/author/justin-chiu\" target=\"_blank\">Justin Chiu</a></ins>, Roshan Makhijani, RJ, <ins><a href=\"https://www.amazon.science/author/stephen-rawls\" target=\"_blank\">Stephen Rawls</a></ins>, <ins><a href=\"https://www.amazon.science/research-awards/recipients/xin-jin\" target=\"_blank\">Xin Jin</a></ins>\nABOUT THE AUTHOR\n<h4><a id=\"Zhen_Zhang_74\"></a>Zhen Zhang</h4>\nZhen Zhang is a PhD student in computer science at Johns Hopkins University. He was an applied-science intern at Amazon when the work was done.\n<h4><a id=\"Shuai_Zhenghttpswwwamazonscienceauthorshuaizheng_77\"></a><a href=\"https://www.amazon.science/author/shuai-zheng\" target=\"_blank\">Shuai Zheng</a></h4>\nShuai Zheng is an applied scientist with Amazon Web Services.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家