Compressing BART models for resource-constrained operation

{"value":"Machines hoping to converse convincingly with humans have several natural-language-processing ([NLP](https://www.amazon.science/tag/nlp)) skills to master, including text summarization, information extraction, and question answering. It’s these skills that enable virtual personal assistants to search for dinner recipes online or answer random questions.\n\nRecent advances in sequence-to-sequence pretrained language models such as BART ([bidirectional autoregressive Transformers](https://aclanthology.org/2020.acl-main.703.pdf)) have led to great performance on many NLP tasks. These models’ successes, however, come at the cost of significant computation and memory resources; a typical BART model might have hundreds of millions of parameters.\n\nFor resource-constrained devices — such as cell phones or smart home appliances — this can make BART totally unusable. [At ACL 2022](https://www.amazon.science/conferences-and-events/acl-2022), scientists from Amazon Web Services’ AI Labs presented a paper that addresses this problem by using a combination of distillation and quantization to compress a BART model to less than 1/16th its size, with little drop in performance.\n\n\n#### **A two-pronged approach**\n\n\nQuantization, which maps high-precision values to a limited menu of lower-precision values, and distillation, in which a smaller, more efficient student model is trained to emulate a larger, more powerful teacher, are common techniques for reducing neural networks’ memory footprints.\n\nIn the new paper, the Amazon researchers combine distillation and distillation-aware quantization to reduce the model’s footprint.\n\nThe researchers begin by fine-tuning a BART model — the teacher model — on a particular NLP task, such as question answering or text summarization. They then copy the weights from select layers of the trained teacher model over to a student model. This is the distillation process, which reduces the model footprint.\n\nThe next step is distillation-aware quantization. The student model is quantized, producing a low-precision model — although the full-precision student model is also kept on hand, as it’s necessary for the next step in the process.\n\nThe quantized student model then processes the dataset that was used to train the teacher model, and its outputs are assessed according to two metrics: the standard task-based loss, which measures how far the outputs diverge from ground truth, and a distillation loss, which measures the difference between the quantized-and-distilled student model and the teacher model.\n\n![下载.jpg](https://dev-media.amazoncloud.cn/0319db4565864a26bde1c7d1dab3382b_%E4%B8%8B%E8%BD%BD.jpg)\n\nAn example of distillation-aware quantization, in which both task and distillation losses are computed for a quantized student model, but back-propagation updates the parameters of a differentiable full-precision model. The student models distill the teacher model by copying over only a subset of its network layers.\nBASED ON A FIGURE FROM \"[TERNARYBERT: DISTILLATION-AWARE ULTRA-LOW BIT BERT](https://aclanthology.org/2020.emnlp-main.37.pdf)\"\n\nBoth of these losses are then used to update the parameters of, not the quantized student model, but the full-precision student model. This is because the standard algorithm for updating a neural network depends on gradient descent, which requires differentiable — that is, continuously variable — model parameters. The parameters of a quantized model have discrete values, so they’re not differentiable.\n\nOnce the full-precision student model has been updated — to minimize not only its error on the training set but also its difference from the teacher model — it is quantized once again, to reduce its memory footprint\n\n\n#### **Experiments**\n\n\nThe researchers compared the efficiency of their distilled and quantized BART model against three different benchmarks, on the tasks of both text summarization and long-form question answering. They also investigated how distillation-aware quantization would work on a more complex model like mBART, a multilingual model designed to translate phrases between languages — in this case, between English and Romanian.\n\nIn their initial analysis, they found that combining distillation and quantization provided better compression than quantization alone and resulted in no performance drop for the long-form-question task and minimal dropoff for the summarization task. They also found that it is possible to compress the model to nearly 1/28th its original size. However, at this compression rate, the model has variable performance; the desired degree of compression should be assessed on a task-to-task basis.\n\nFor the mBART task, the team found that the distillation-aware approach was effective at compressing the model’s footprint when using eight-bit quantization, but its performance began to drop more significantly when the number of quantization bits was lowered to two. The researchers believe that this performance drop was caused by accumulated distillation and quantization errors, which may be more acute for the complex problem of machine translation.\n\nIn future work, the researchers hope to further explore the multilingual mBART model and to evaluate additional compression techniques, including head pruning and sequence-level distillation. As the current study focused on memory footprints, they also plan to expand their study to investigate latency effects.\n\nABOUT THE AUTHOR\n\n#### **[Sarah Wells](https://www.amazon.science/author/sarah-wells)**\n\n\nSarah Wells is a science journalist whose work has appeared in Popular Mechanics, Motherboard, Symmetry Magazine, and PBS Space Time, among others","render":"Machines hoping to converse convincingly with humans have several natural-language-processing (<a href=\"https://www.amazon.science/tag/nlp\" target=\"_blank\">NLP</a>) skills to master, including text summarization, information extraction, and question answering. It’s these skills that enable virtual personal assistants to search for dinner recipes online or answer random questions.\nRecent advances in sequence-to-sequence pretrained language models such as BART (<a href=\"https://aclanthology.org/2020.acl-main.703.pdf\" target=\"_blank\">bidirectional autoregressive Transformers</a>) have led to great performance on many NLP tasks. These models’ successes, however, come at the cost of significant computation and memory resources; a typical BART model might have hundreds of millions of parameters.\nFor resource-constrained devices — such as cell phones or smart home appliances — this can make BART totally unusable. <a href=\"https://www.amazon.science/conferences-and-events/acl-2022\" target=\"_blank\">At ACL 2022</a>, scientists from Amazon Web Services’ AI Labs presented a paper that addresses this problem by using a combination of distillation and quantization to compress a BART model to less than 1/16th its size, with little drop in performance.\n<h4><a id=\"A_twopronged_approach_7\"></a>A two-pronged approach</h4>\nQuantization, which maps high-precision values to a limited menu of lower-precision values, and distillation, in which a smaller, more efficient student model is trained to emulate a larger, more powerful teacher, are common techniques for reducing neural networks’ memory footprints.\nIn the new paper, the Amazon researchers combine distillation and distillation-aware quantization to reduce the model’s footprint.\nThe researchers begin by fine-tuning a BART model — the teacher model — on a particular NLP task, such as question answering or text summarization. They then copy the weights from select layers of the trained teacher model over to a student model. This is the distillation process, which reduces the model footprint.\nThe next step is distillation-aware quantization. The student model is quantized, producing a low-precision model — although the full-precision student model is also kept on hand, as it’s necessary for the next step in the process.\nThe quantized student model then processes the dataset that was used to train the teacher model, and its outputs are assessed according to two metrics: the standard task-based loss, which measures how far the outputs diverge from ground truth, and a distillation loss, which measures the difference between the quantized-and-distilled student model and the teacher model.\n<img src=\"https://dev-media.amazoncloud.cn/0319db4565864a26bde1c7d1dab3382b_%E4%B8%8B%E8%BD%BD.jpg\" alt=\"下载.jpg\" />\nAn example of distillation-aware quantization, in which both task and distillation losses are computed for a quantized student model, but back-propagation updates the parameters of a differentiable full-precision model. The student models distill the teacher model by copying over only a subset of its network layers. \nBASED ON A FIGURE FROM “<a href=\"https://aclanthology.org/2020.emnlp-main.37.pdf\" target=\"_blank\">TERNARYBERT: DISTILLATION-AWARE ULTRA-LOW BIT BERT</a>”\nBoth of these losses are then used to update the parameters of, not the quantized student model, but the full-precision student model. This is because the standard algorithm for updating a neural network depends on gradient descent, which requires differentiable — that is, continuously variable — model parameters. The parameters of a quantized model have discrete values, so they’re not differentiable.\nOnce the full-precision student model has been updated — to minimize not only its error on the training set but also its difference from the teacher model — it is quantized once again, to reduce its memory footprint\n<h4><a id=\"Experiments_30\"></a>Experiments</h4>\nThe researchers compared the efficiency of their distilled and quantized BART model against three different benchmarks, on the tasks of both text summarization and long-form question answering. They also investigated how distillation-aware quantization would work on a more complex model like mBART, a multilingual model designed to translate phrases between languages — in this case, between English and Romanian.\nIn their initial analysis, they found that combining distillation and quantization provided better compression than quantization alone and resulted in no performance drop for the long-form-question task and minimal dropoff for the summarization task. They also found that it is possible to compress the model to nearly 1/28th its original size. However, at this compression rate, the model has variable performance; the desired degree of compression should be assessed on a task-to-task basis.\nFor the mBART task, the team found that the distillation-aware approach was effective at compressing the model’s footprint when using eight-bit quantization, but its performance began to drop more significantly when the number of quantization bits was lowered to two. The researchers believe that this performance drop was caused by accumulated distillation and quantization errors, which may be more acute for the complex problem of machine translation.\nIn future work, the researchers hope to further explore the multilingual mBART model and to evaluate additional compression techniques, including head pruning and sequence-level distillation. As the current study focused on memory footprints, they also plan to expand their study to investigate latency effects.\nABOUT THE AUTHOR\n<h4><a id=\"Sarah_Wellshttpswwwamazonscienceauthorsarahwells_43\"></a><a href=\"https://www.amazon.science/author/sarah-wells\" target=\"_blank\">Sarah Wells</a></h4>\nSarah Wells is a science journalist whose work has appeared in Popular Mechanics, Motherboard, Symmetry Magazine, and PBS Space Time, among others\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家