Making DeepSpeed ZeRO run efficiently on more-affordable hardware

海外精选
海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时,内容中提到的“AWS” 是 “Amazon Web Services” 的缩写,在此网站不作为商标展示。
0
0
{"value":"Most modern ++[natural-language-processing](https://www.amazon.science/tag/nlp)++ applications are built on top of pretrained language models, which encode the probabilities of word sequences for entire languages. Over time, these models have trended larger and larger, into the regime of billions or even trillions of parameters.\n\nTraining these models within a reasonable amount of time requires very large computing clusters, whose large communication volume can block computation, resulting in low or inefficient GPU utilization. Communication between the GPUs needs to be carefully managed to avoid becoming a performance bottleneck.\n\n![image.png](https://dev-media.amazoncloud.cn/4225a73a790140d5adea32b34ee668da_image.png)\n\n\n##### **Related content**\n[The science behind Amazon SageMaker’s distributed-training engines](https://www.amazon.science/latest-news/the-science-of-amazon-sagemakers-distributed-training-engines)\n\nMicrosoft’s DeepSpeed distributed-training library introduced one such management technique, called the Zero Redundancy Optimizer (ZeRO). ZeRO works by partitioning the state of a machine learning model across distributed workers and fetching the necessary model state from other workers as needed during training. ZeRO has several “stages,” each of which allows the training of larger models through reducing memory requirements, typically at the cost of additional communication volume.\n\nWhile Microsoft researchers were able to achieve ideal scaling performance with this technique, they reported experiments only on a specialized hypercluster that uses expensive high-speed InfiniBand networking (specifically, an Nvidia DGX system).\n\nTo reduce costs for customers in need of high-performance computing, Amazon Web Services (Amazon Web Services) uses an ++[Elastic Fabric Adapter](https://docs.aws.amazon.com/batch/latest/userguide/efa.html)++ (EFA) network instead of InfiniBand. The EFA available on instances of Amazon Web Services’s p4d.24xlarge computational infrastructure has less communication bandwidth than InfiniBand on the Nvidia DGX hypercluster, so we would expect some performance dropoff for bandwidth-intensive tasks. When we tried to reproduce Microsoft’s results, however, we found that the relative dropoff in ZeRO’s third stage was twice that of the dropoff in the second stage.\n\nWe profiled the training process to look for bottlenecks and observed that in ZeRO Stage 3, communication dominated training time. We have made a series of optimizations to ZeRO Stage 3 in order to close the performance gap relative to results obtained on InfiniBand-equipped DGX clusters. Below is a table showing the overall performance improvement conferred by our optimizations, measured when training a RoBERTa language model on Amazon Web Services p4d.24xlarge instances.\n\n![e4512e3c820b81706d6d10a7ccc7db0.jpg](https://dev-media.amazoncloud.cn/a61c6d17ed484e9cb9050b5aeeca3b52_e4512e3c820b81706d6d10a7ccc7db0.jpg)\n\nIn January, we ++[merged our optimizations](https://github.com/microsoft/DeepSpeed/pull/1453)++ into the DeepSpeed code repository for public use.\n\n#### **Optimizations**\n\n![下载.gif](https://dev-media.amazoncloud.cn/cd6304221cfa44fab194b516b9931621_%E4%B8%8B%E8%BD%BD.gif)\n\n\n##### **Related content**\n\n[Accelerating parallel training of neural nets](https://www.amazon.science/blog/accelerating-parallel-training-of-neural-nets)\n\nOur optimizations can roughly be categorized as (1) improving overlap between communication and computation, (2) improving bandwidth utilization, and (3) improving memory efficiency\n\n#### **Synchronization/Parallelism**\n\n**Finer-grained synchronization between communication and computation streams.**\n\nIn lower-bandwidth or large clusters where communication times dominate, it is critical to mask communication costs by overlapping computation with communication. Through profiling, we found that this overlapping was limited by ZeRO’s overly coarse synchronization.\n\nThis resulted in a suboptimal level of overlapping for two distributed-computing operations: allgather, which aggregates data (in this case, model parameters) from all workers across the network, and reduce-scatter, which reduces data (in this case, summing gradients) across workers. These two operations were causing poor GPU utilization because communication was constantly blocking computation operations. In response, we made significant changes to the parameter gathering and gradient reduce-scatter paths to reduce or remove synchronization while maintaining correctness.\n\nAfter these changes, we were able to achieve much better overlapping and thus much fewer and smaller computation bubbles.\n\n##### **Precomputation/caching of Python fetching and partitioning decisions**\n\n![下载 1.gif](https://dev-media.amazoncloud.cn/22f9dfa9a4bc41b09f64aae514e7ea20_%E4%B8%8B%E8%BD%BD%20%281%29.gif)\n\n\n##### **Related content**\n\n**[More efficient and reliable retrieval of distributed data](https://www.amazon.science/blog/more-efficient-and-reliable-retrieval-of-distributed-data)**\n\nDuring training, many complex decisions need to be made, relating to which parameters should be fetched, which parameters will be used next, which parameters may be reused soon and should be kept, and which can be released. These operations were slow enough to frequently prevent the Python process from keeping GPUs fed with work, creating large computation bubbles.\n\nWe optimized this by precomputing or caching as many decisions as possible, speeding up their computation to the point that it has become a nonfactor for training throughput.\n\n#### **Communication/bandwidth use**\n\n##### **Batching allgather/reduce-scatter calls**\n\nWe found that batching the collective operations — allgather and reduce-scatter — uses bandwidth more efficiently and amortizes the fixed costs of running the computational kernels that execute the operations. To implement collective batching, we flatten tensor data into a single, contiguous buffer to be sent in a single transaction. Each collective requires a special interleaving scheme to ensure that each worker receives the correct data.\n\n![image.png](https://dev-media.amazoncloud.cn/673bdb312a0043a396f61e0ad9cab897_image.png)\n\nAllgather interleaving scheme.\n\n![image.png](https://dev-media.amazoncloud.cn/5d7120831d14468a828a71bf753488a5_image.png)\n\nReduce-scatter interleaving scheme.\n\n#### **Memory**\n\nOur implementation of ZeRO, like the Microsoft implementation, uses the Compute Unified Device Architecture (CUDA), Nvidia’s parallel-computing platform. CUDA memory allocations are both synchronous and slow (ignoring the stream-ordered alternatives cudaMallocAsync and cudaMemcpyAsync, which are not yet used in PyTorch), so PyTorch uses a caching allocator to avoid the large costs of constantly reallocating memory. If there are no cached or free blocks for an allocation request, the allocator will flush its cache. This is disastrous for a few reasons:\n\n- Before the flush can begin, several cudaEventSynchronize calls are necessary to allow computation on held memory to complete. This and the subsequent cudaFree calls can take multiple seconds.\n- Different workers are not guaranteed to flush their caches simultaneously. This means that for any collective, if even a single worker is currently flushing its cache, the other N-1 workers sit blocked waiting for that worker to join. As cluster size increases, so does the probability that at least one worker is flushing its cache for any given collective.\n- After the cache flush, subsequent allocations require cudaMalloc calls, which as mentioned earlier are both synchronous and slow.\n\nFor these reasons, memory efficiency is critical for performance.\n\n**Memory-efficient batched PyTorch collectives**\n\nAlthough our use of batched collectives significantly reduced kernel launch overhead and improved bandwidth utilization, it also increased memory consumption because of its flattening of batched tensors into an additional buffer.\n\n![下载 2.gif](https://dev-media.amazoncloud.cn/5836db6ac5424f619ed99d553e525307_%E4%B8%8B%E8%BD%BD%20%282%29.gif)\n\n##### **Related content**\n\n[How to train large graph neural networks efficiently](https://www.amazon.science/blog/how-to-train-large-graph-neural-networks-efficiently)\n\nTo avoid redundant flatten operations in PyTorch collectives, we used the *_base variants of the collective operations, which accept pre-flattened tensors, avoiding the need to internally allocate additional flattened buffers. In future work, we plan to use group-based batching operations from the Nvidia Collective Communications Library (NCCL) to eliminate all flattening operations.\n\n##### **More aggressive initialization-time defragmentation of parameter partitions**\n\nEven with more than 10GB of free GPU memory, we continued to see evidence of allocator cache flushes, suggesting memory fragmentation. In order to reduce this, we made initialization-time defragmentation changes to move all persisted tensors into a single contiguous buffer.\n\n#### **Miscellaneous**\n\nIn addition to the optimizations described above, we also\n\n- optimized gradient normalization by reducing host-device data movement and synchronization and pulling math operations out of a for-loop into a single kernel launch with parallelized computation; and\n\n- removed tensor operations (.norm()) that were being added to debug messages via string formatting. (These were causing copies from host to device, which meant data movement and host-device synchronization.)\n\nBy making DeepSpeed ZeRO Stage 3 performant on widely available public cloud offerings, we hope to further democratize the training of large language models.\n\n**Acknowledgments**: Zhen Zhang, ++[Stephen Rawls](https://www.amazon.science/author/stephen-rawls)++, ++[Yida Wang](https://www.amazon.science/author/yida-wang)++\n\nABOUT THE AUTHOR\n\n#### **[Justin Chiu](https://www.amazon.science/author/justin-chiu)**\nJustin Chiu is a senior software engineer in the Alexa AI organization.\n\n#### **[Shuai Zheng](https://www.amazon.science/author/shuai-zheng)**\nShuai Zheng is an applied scientist with Amazon Web Services.\n","render":"<p>Most modern <ins><a href=\"https://www.amazon.science/tag/nlp\" target=\"_blank\">natural-language-processing</a></ins> applications are built on top of pretrained language models, which encode the probabilities of word sequences for entire languages. Over time, these models have trended larger and larger, into the regime of billions or even trillions of parameters.</p>\n<p>Training these models within a reasonable amount of time requires very large computing clusters, whose large communication volume can block computation, resulting in low or inefficient GPU utilization. Communication between the GPUs needs to be carefully managed to avoid becoming a performance bottleneck.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/4225a73a790140d5adea32b34ee668da_image.png\" alt=\"image.png\" /></p>\n<h5><a id=\"Related_content_7\"></a><strong>Related content</strong></h5>\n<p><a href=\"https://www.amazon.science/latest-news/the-science-of-amazon-sagemakers-distributed-training-engines\" target=\"_blank\">The science behind Amazon SageMaker’s distributed-training engines</a></p>\n<p>Microsoft’s DeepSpeed distributed-training library introduced one such management technique, called the Zero Redundancy Optimizer (ZeRO). ZeRO works by partitioning the state of a machine learning model across distributed workers and fetching the necessary model state from other workers as needed during training. ZeRO has several “stages,” each of which allows the training of larger models through reducing memory requirements, typically at the cost of additional communication volume.</p>\n<p>While Microsoft researchers were able to achieve ideal scaling performance with this technique, they reported experiments only on a specialized hypercluster that uses expensive high-speed InfiniBand networking (specifically, an Nvidia DGX system).</p>\n<p>To reduce costs for customers in need of high-performance computing, Amazon Web Services (Amazon Web Services) uses an <ins><a href=\"https://docs.aws.amazon.com/batch/latest/userguide/efa.html\" target=\"_blank\">Elastic Fabric Adapter</a></ins> (EFA) network instead of InfiniBand. The EFA available on instances of Amazon Web Services’s p4d.24xlarge computational infrastructure has less communication bandwidth than InfiniBand on the Nvidia DGX hypercluster, so we would expect some performance dropoff for bandwidth-intensive tasks. When we tried to reproduce Microsoft’s results, however, we found that the relative dropoff in ZeRO’s third stage was twice that of the dropoff in the second stage.</p>\n<p>We profiled the training process to look for bottlenecks and observed that in ZeRO Stage 3, communication dominated training time. We have made a series of optimizations to ZeRO Stage 3 in order to close the performance gap relative to results obtained on InfiniBand-equipped DGX clusters. Below is a table showing the overall performance improvement conferred by our optimizations, measured when training a RoBERTa language model on Amazon Web Services p4d.24xlarge instances.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/a61c6d17ed484e9cb9050b5aeeca3b52_e4512e3c820b81706d6d10a7ccc7db0.jpg\" alt=\"e4512e3c820b81706d6d10a7ccc7db0.jpg\" /></p>\n<p>In January, we <ins><a href=\"https://github.com/microsoft/DeepSpeed/pull/1453\" target=\"_blank\">merged our optimizations</a></ins> into the DeepSpeed code repository for public use.</p>\n<h4><a id=\"Optimizations_22\"></a><strong>Optimizations</strong></h4>\n<p><img src=\"https://dev-media.amazoncloud.cn/cd6304221cfa44fab194b516b9931621_%E4%B8%8B%E8%BD%BD.gif\" alt=\"下载.gif\" /></p>\n<h5><a id=\"Related_content_27\"></a><strong>Related content</strong></h5>\n<p><a href=\"https://www.amazon.science/blog/accelerating-parallel-training-of-neural-nets\" target=\"_blank\">Accelerating parallel training of neural nets</a></p>\n<p>Our optimizations can roughly be categorized as (1) improving overlap between communication and computation, (2) improving bandwidth utilization, and (3) improving memory efficiency</p>\n<h4><a id=\"SynchronizationParallelism_33\"></a><strong>Synchronization/Parallelism</strong></h4>\n<p><strong>Finer-grained synchronization between communication and computation streams.</strong></p>\n<p>In lower-bandwidth or large clusters where communication times dominate, it is critical to mask communication costs by overlapping computation with communication. Through profiling, we found that this overlapping was limited by ZeRO’s overly coarse synchronization.</p>\n<p>This resulted in a suboptimal level of overlapping for two distributed-computing operations: allgather, which aggregates data (in this case, model parameters) from all workers across the network, and reduce-scatter, which reduces data (in this case, summing gradients) across workers. These two operations were causing poor GPU utilization because communication was constantly blocking computation operations. In response, we made significant changes to the parameter gathering and gradient reduce-scatter paths to reduce or remove synchronization while maintaining correctness.</p>\n<p>After these changes, we were able to achieve much better overlapping and thus much fewer and smaller computation bubbles.</p>\n<h5><a id=\"Precomputationcaching_of_Python_fetching_and_partitioning_decisions_43\"></a><strong>Precomputation/caching of Python fetching and partitioning decisions</strong></h5>\n<p><img src=\"https://dev-media.amazoncloud.cn/22f9dfa9a4bc41b09f64aae514e7ea20_%E4%B8%8B%E8%BD%BD%20%281%29.gif\" alt=\"下载 1.gif\" /></p>\n<h5><a id=\"Related_content_48\"></a><strong>Related content</strong></h5>\n<p><strong><a href=\"https://www.amazon.science/blog/more-efficient-and-reliable-retrieval-of-distributed-data\" target=\"_blank\">More efficient and reliable retrieval of distributed data</a></strong></p>\n<p>During training, many complex decisions need to be made, relating to which parameters should be fetched, which parameters will be used next, which parameters may be reused soon and should be kept, and which can be released. These operations were slow enough to frequently prevent the Python process from keeping GPUs fed with work, creating large computation bubbles.</p>\n<p>We optimized this by precomputing or caching as many decisions as possible, speeding up their computation to the point that it has become a nonfactor for training throughput.</p>\n<h4><a id=\"Communicationbandwidth_use_56\"></a><strong>Communication/bandwidth use</strong></h4>\n<h5><a id=\"Batching_allgatherreducescatter_calls_58\"></a><strong>Batching allgather/reduce-scatter calls</strong></h5>\n<p>We found that batching the collective operations — allgather and reduce-scatter — uses bandwidth more efficiently and amortizes the fixed costs of running the computational kernels that execute the operations. To implement collective batching, we flatten tensor data into a single, contiguous buffer to be sent in a single transaction. Each collective requires a special interleaving scheme to ensure that each worker receives the correct data.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/673bdb312a0043a396f61e0ad9cab897_image.png\" alt=\"image.png\" /></p>\n<p>Allgather interleaving scheme.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/5d7120831d14468a828a71bf753488a5_image.png\" alt=\"image.png\" /></p>\n<p>Reduce-scatter interleaving scheme.</p>\n<h4><a id=\"Memory_70\"></a><strong>Memory</strong></h4>\n<p>Our implementation of ZeRO, like the Microsoft implementation, uses the Compute Unified Device Architecture (CUDA), Nvidia’s parallel-computing platform. CUDA memory allocations are both synchronous and slow (ignoring the stream-ordered alternatives cudaMallocAsync and cudaMemcpyAsync, which are not yet used in PyTorch), so PyTorch uses a caching allocator to avoid the large costs of constantly reallocating memory. If there are no cached or free blocks for an allocation request, the allocator will flush its cache. This is disastrous for a few reasons:</p>\n<ul>\n<li>Before the flush can begin, several cudaEventSynchronize calls are necessary to allow computation on held memory to complete. This and the subsequent cudaFree calls can take multiple seconds.</li>\n<li>Different workers are not guaranteed to flush their caches simultaneously. This means that for any collective, if even a single worker is currently flushing its cache, the other N-1 workers sit blocked waiting for that worker to join. As cluster size increases, so does the probability that at least one worker is flushing its cache for any given collective.</li>\n<li>After the cache flush, subsequent allocations require cudaMalloc calls, which as mentioned earlier are both synchronous and slow.</li>\n</ul>\n<p>For these reasons, memory efficiency is critical for performance.</p>\n<p><strong>Memory-efficient batched PyTorch collectives</strong></p>\n<p>Although our use of batched collectives significantly reduced kernel launch overhead and improved bandwidth utilization, it also increased memory consumption because of its flattening of batched tensors into an additional buffer.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/5836db6ac5424f619ed99d553e525307_%E4%B8%8B%E8%BD%BD%20%282%29.gif\" alt=\"下载 2.gif\" /></p>\n<h5><a id=\"Related_content_86\"></a><strong>Related content</strong></h5>\n<p><a href=\"https://www.amazon.science/blog/how-to-train-large-graph-neural-networks-efficiently\" target=\"_blank\">How to train large graph neural networks efficiently</a></p>\n<p>To avoid redundant flatten operations in PyTorch collectives, we used the *_base variants of the collective operations, which accept pre-flattened tensors, avoiding the need to internally allocate additional flattened buffers. In future work, we plan to use group-based batching operations from the Nvidia Collective Communications Library (NCCL) to eliminate all flattening operations.</p>\n<h5><a id=\"More_aggressive_initializationtime_defragmentation_of_parameter_partitions_92\"></a><strong>More aggressive initialization-time defragmentation of parameter partitions</strong></h5>\n<p>Even with more than 10GB of free GPU memory, we continued to see evidence of allocator cache flushes, suggesting memory fragmentation. In order to reduce this, we made initialization-time defragmentation changes to move all persisted tensors into a single contiguous buffer.</p>\n<h4><a id=\"Miscellaneous_96\"></a><strong>Miscellaneous</strong></h4>\n<p>In addition to the optimizations described above, we also</p>\n<ul>\n<li>\n<p>optimized gradient normalization by reducing host-device data movement and synchronization and pulling math operations out of a for-loop into a single kernel launch with parallelized computation; and</p>\n</li>\n<li>\n<p>removed tensor operations (.norm()) that were being added to debug messages via string formatting. (These were causing copies from host to device, which meant data movement and host-device synchronization.)</p>\n</li>\n</ul>\n<p>By making DeepSpeed ZeRO Stage 3 performant on widely available public cloud offerings, we hope to further democratize the training of large language models.</p>\n<p><strong>Acknowledgments</strong>: Zhen Zhang, <ins><a href=\"https://www.amazon.science/author/stephen-rawls\" target=\"_blank\">Stephen Rawls</a></ins>, <ins><a href=\"https://www.amazon.science/author/yida-wang\" target=\"_blank\">Yida Wang</a></ins></p>\n<p>ABOUT THE AUTHOR</p>\n<h4><a id=\"Justin_Chiuhttpswwwamazonscienceauthorjustinchiu_110\"></a><strong><a href=\"https://www.amazon.science/author/justin-chiu\" target=\"_blank\">Justin Chiu</a></strong></h4>\n<p>Justin Chiu is a senior software engineer in the Alexa AI organization.</p>\n<h4><a id=\"Shuai_Zhenghttpswwwamazonscienceauthorshuaizheng_113\"></a><strong><a href=\"https://www.amazon.science/author/shuai-zheng\" target=\"_blank\">Shuai Zheng</a></strong></h4>\n<p>Shuai Zheng is an applied scientist with Amazon Web Services.</p>\n"}
目录
亚马逊云科技解决方案 基于行业客户应用场景及技术领域的解决方案
联系亚马逊云科技专家
亚马逊云科技解决方案
基于行业客户应用场景及技术领域的解决方案
联系专家
0
目录
关闭