使用 Amazon SageMaker 构建文本摘要应用

{"value":"##### **背景介绍**\n\n\n文本摘要，就是对给定的单个或者多个文档进行梗概，即在保证能够反映原文档的重要内容的情况下，尽可能地保持简明扼要。质量良好的文摘能够在信息检索过程中发挥重要的作用，比如利用文摘代替原文档参与索引，可以有效缩短检索的时间，同时也能减少检索结果中的冗余信息，提高用户体验。随着信息爆炸时代的到来，自动文摘逐渐成为自然语言处理领域的一项重要的研究课题。\n\n文本摘要的需求来自多个我们真实的客户案例，对于大量的长文本对于新闻领域，金融领域，法律领域是司空见惯的。而在人力成本越来越高的今天，雇佣大量的专业人员进行信息精炼或者内容审核无疑要投入大量的资金。而自动文本摘要就显得意义非凡，具体来说，通过大量数据训练的深度学习模型可以在几百毫秒内产生长度可控的文本摘要，这大大地提升了摘要生成效率，节约了大量人力成本。\n\n对于目前的技术，可以根据摘要产生的方式大体可以分为两类：1）抽取式文本摘要：找到一个文档中最重要的几个句子并对其进行拼接；2）生成式文本摘要：直接建模为序列到序列的生成问题，根据源文本直接递归生成摘要。对于抽取式摘要，其具备效率高，解释性强的优势，但是抽取得到的文本在语义连续性上相较生成式摘要有所不足，故这里我们主要展示生成式摘要。\n\n[Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/index.html)是亚马逊云计算（Amazon Web Service）的一项完全托管的[机器学习](https://aws.amazon.com/cn/machine-learning/?trk=cndc-detail)平台服务，算法工程师和数据科学家可以基于此平台快速构建、训练和部署[机器学习](https://aws.amazon.com/cn/machine-learning/?trk=cndc-detail) (ML) 模型，而无需关注底层资源的管理和运维工作。它作为一个工具集，提供了用于[机器学习](https://aws.amazon.com/cn/machine-learning/?trk=cndc-detail)的端到端的所有组件，包括数据标记、数据处理、算法设计、模型训练、训练调试、超参调优、模型部署、模型监控等，使得[机器学习](https://aws.amazon.com/cn/machine-learning/?trk=cndc-detail)变得更为简单和轻松；同时，它依托于 Amazon 强大的底层资源，提供了高性能 CPU、GPU、弹性推理加速卡等丰富的计算资源和充足的算力，使得模型研发和部署更为轻松和高效。同时，本文还基于 [Huggingface](https://huggingface.co/)，Huggingface 是 NLP 著名的开源社区，并且与 Amazon SagaMaker 高度适配，可以在 Amazon SagaMaker 上以几行代码轻松实现 NLP 模型训练和部署。\n\n#### **解决方案概览**\n\n在此示例中，我们将使用 [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) 执行以下操作：\n\n- 环境准备\n- 下载数据集并将其进行数据预处理\n- 使用本地机器训练\n- 使用 [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) BYOS 进行模型训练\n- 托管部署及推理测试\n\n#### **环境准备**\n\n我们首先要创建一个 [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) Notebook，笔记本实例类型最好选择 ml.p3.2xlarge，因为本例中用到了本地机器训练的部分用来测试我们的代码，卷大小建议改成10GB或以上，因为运行该项目需要下载一些额外的数据。\n\n![image.png](https://dev-media.amazoncloud.cn/f279d4c7dec249afbf17c029a7f62c62_image.png)\n\n笔记本启动后，打开页面上的终端，执行以下命令下载代码。\n\n```\\ncd ~/SageMaker\\ngit clone https://github.com/HaoranLv/nlp_transformer.git\\n```\n\n#### **下载数据集并将其进行数据预处理**\n\n这里给出若干开源的中英文数据集：\n\n1.公开数据集 (英文)\n\n- XSUM，227k BBC articles\n- CNN/Dailymail，93k articles from the CNN, 220k articles from the Daily Mail\n- NEWSROOM，3M article-summary pairs written by authors and editors in the newsrooms of 38 major publications\n- Multi-News，56k pairs of news articles and their human-written summaries from the **[http://](http://sitenewser.com/)[com](http://sitenewser.com/)**\n- Gigaword，4M examples extracted from news articles，the task is to generate theheadline from the first sentence\n- arXiv, PubMed，two long documentdatasets of scientific publications from **[http://org](https://link.zhihu.com/?target=http%3A//arXiv.org)**(113k) andPubMed (215k). The task is to generate the abstract fromthe paper body.\n- BIGPATENT，3 millionU.S. patents along with human summaries under nine patent classification categories\n\n2.公开数据集 (中文)\n\n- 哈工大的新浪微博短文本摘要 [LCSTS](https://link.zhihu.com/?target=http%3A//icrc.hitsz.edu.cn/Article/show/139.html)\n- 教育新闻自动摘要语料 [chinese_abstractive_corpus](https://link.zhihu.com/?target=https%3A//github.com/wonderfulsuccess/chinese_abstractive_corpus)\n- NLPCC 2017 task3 [Single Document Summarization](https://link.zhihu.com/?target=http%3A//tcci.ccf.org.cn/conference/2017/taskdata.php)\n- 娱乐新闻等 [“神策杯”2018高校算法大师赛](https://link.zhihu.com/?target=https%3A//www.dcjingsai.com/common/cmpt/%25E2%2580%259C%25E7%25A5%259E%25E7%25AD%2596%25E6%259D%25AF%25E2%2580%259D2018%25E9%25AB%2598%25E6%25A0%25A1%25E7%25AE%2597%25E6%25B3%2595%25E5%25A4%25A7%25E5%25B8%2588%25E8%25B5%259B_%25E7%25AB%259E%25E8%25B5%259B%25E4%25BF%25A1%25E6%2581%25AF.html)\n\n本文以 Multi-News 为例，数据分为两列，headlines 代表摘要，text 代表全文。由于文本数据集较小，故直接官网下载原始 csv 文件上传到 SageMaker Notebook 即可。如下是部分数据集样例。\n\n![image.png](https://dev-media.amazoncloud.cn/0d6722fce59048ab859b960d4f0e539e_image.png)\n\n找到 hp_data.ipynb 运行代码。\n\n首先加载数据集\n\n```\\ndf=pd.read_csv（./data/hp/summary/news_summary.csv'）\\n```\n\n而后进行数据清洗\n\n```\\nclass Settings:\\n\\n TRAIN_DATA = \\"./data/hp/summary/news_summary_total.csv\\"\\n Columns = ['headlines', 'text']\\n encoding = 'latin-1'\\n columns_dict = {\\"headlines\\": \\"headlines\\", \\"text\\": \\"text\\"}\\n df_column_list = ['text', 'headlines']\\n SUMMARIZE_KEY = \\"\\"\\n SOURCE_TEXT_KEY = 'text'\\n TEST_SIZE = 0.2\\n BATCH_SIZE = 16\\n source_max_token_len = 128\\n target_max_token_len = 50\\n train_df_len = 82332\\n test_df_len = 20583\\n \\nclass Preprocess:\\n def __init__(self):\\n self.settings = Settings\\n\\n def clean_text(self, text):\\n text = text.lower()\\n text = re.sub('\\\\[.*?\\\\]', '', text)\\n text = re.sub('https?://\\\\S+|www\\\\.\\\\S+', '', text)\\n text = re.sub('<.*?>+', '', text)\\n text = re.sub('[%s]' % re.escape(string.punctuation), '', text)\\n text = re.sub('\\\\n', '', text)\\n text = re.sub('\\\\w*\\\\d\\\\w*', '', text)\\n return text\\n\\n def preprocess_data(self, data_path):\\n df = pd.read_csv(data_path, encoding=self.settings.encoding, usecols=self.settings.Columns)\\n # simpleT5 expects dataframe to have 2 columns: \\"source_text\\" and \\"target_text\\"\\n df = df.rename(columns=self.settings.columns_dict)\\n df = df[self.settings.df_column_list]\\n # T5 model expects a task related prefix: since it is a summarization task, we will add a prefix \\"summarize: \\"\\n df[self.settings.SOURCE_TEXT_KEY] = df[self.settings.SOURCE_TEXT_KEY]\\n\\n return df\\nsettings=Settings\\npreprocess=Preprocess()\\ndf = preprocess.preprocess_data(settings.TRAIN_DATA)\\n```\n\n随后完成训练集和测试集的划分并分别保存：\n\n```\\ndf.to_csv('./data/hp/summary/news_summary_cleaned.csv',index=False)\\ndf2=pd.read_csv('./data/hp/summary/news_summary_cleaned.csv')\\norder=['text','headlines']\\ndf3=df2[order]\\ntrain_df, test_df = train_test_split(df3, test_size=0.2,random_state=100)\\ntrain_df.to_csv('./data/hp/summary/news_summary_cleaned_train.csv',index=False)\\ntest_df.to_csv('./data/hp/summary/news_summary_cleaned_test.csv',index=False)\\n```\n\n#### **使用本地机器训练**\n\n在完成了上述的数据处理过程后，就可以进行模型训练了，下面的命令运行后即开始模型训练，代码会自动 Huggingface hub 中加载 google/pegasus-large 作为预训练模型，而后使用我们处理后的数据集进行模型训练。\n\n```\\n!python -u examples/pytorch/summarization/run_summarization.py \\\\\\n--model_name_or_path google/pegasus-large \\\\\\n--do_train \\\\\\n--do_eval \\\\\\n--per_device_train_batch_size=2 \\\\\\n--per_device_eval_batch_size=1 \\\\\\n--save_strategy epoch \\\\\\n--evaluation_strategy epoch \\\\\\n--overwrite_output_dir \\\\\\n--predict_with_generate \\\\\\n--train_file './data/hp/summary/news_summary_cleaned_train.csv' \\\\\\n--validation_file './data/hp/summary/news_summary_cleaned_test.csv' \\\\\\n--text_column 'text' \\\\\\n--summary_column 'headlines' \\\\\\n--output_dir='./models/local_train/pegasus-hp' \\\\\\n--num_train_epochs=1.0 \\\\\\n--eval_steps=500 \\\\\\n--save_total_limit=3 \\\\\\n--source_prefix \\"summarize: \\" > train_pegasus.log\\n```\n\n训练完成后，会提示日志信息如下。\n\n![image.png](https://dev-media.amazoncloud.cn/b0ddd3a90222421e86bbd7d60644b078_image.png)\n\n并且会对验证集的数据进行客观指标评估，这里使用 Rouge 进行评估。\n\n![image.png](https://dev-media.amazoncloud.cn/b9047e668efc426d85b945dde62062f9_image.png)\n\n模型结果文件及相应的日志等信息会自动保存在./models/local_train/pegasus-hp/checkpoint-500\n\n![image.png](https://dev-media.amazoncloud.cn/4e75f4e2911d44c798c1159b7bfae2f2_image.png)\n\n我们可以直接用这个产生的模型文件进行本地推理。注意这里的模型文件地址的指定为你刚刚训练产生的。\n\n```\\nimport pandas as pd\\ndf=pd.read_csv('./data/hp/summary/news_summary_cleaned_small_test.csv')\\nprint('原文:',df.loc[0,'text'])\\nprint('真实标签:',df.loc[0,'headlines'])\\nfrom transformers import pipeline\\nsummarizer=pipeline(\\"summarization\\",model=\\"./models/local_train/Pegasus-hp/checkpoint-500\\")\\nprint('模型预测:',summarizer(df.loc[0,'text'], max_length=50)[0]['summary_text'])\\n```\n\n输出如下：\n\n```\\n原文: Germany on Wednesday accused Vietnam of kidnapping a former Vietnamese oil executive Trinh Xuan Thanh, who allegedly sought asylum in Berlin, and taking him home to face accusations of corruption. Germany expelled a Vietnamese intelligence officer over the suspected kidnapping and demanded that Vietnam allow Thanh to return to Germany. However, Vietnam said Thanh had returned home by himself.\\n真实标签: Germany accuses Vietnam of kidnapping asylum seeker \\n模型预测: Germany accuses Vietnam of kidnapping ex-oil exec, taking him home\\n```\n\n到这里，就完成了一个模型的本地训练和推理过程。\n\n#### **使用 [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) BYOS 进行模型训练**\n\n在上文的范例中，我们使用本地环境一步步的训练了一个较小的模型，验证了我们的代码。现在，我们需要把代码进行整理，在 [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) 上，进行可扩展至分布式的托管训练任务。\n\n首先，我们要将上文的训练代码整理至一个 python 脚本，然后使用 SageMaker 上预配置的 Huggingface 容器，我们提供了很多灵活的使用方式来使用该容器，具体可以参考 [Hugging Face Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#hugging-face-estimator)。\n\n由于 SageMaker 预置的 Huggingface 容器已经具备推理逻辑, 故这里只需要将上一步中的训练脚本引入容器即可, 具体流程如下:\n\n启动一个 Jupyter Notebook，选择 python3 作为解释器完成如下工作：\n\n权限配置\n\n```\\nimport sagemaker\\nimport os\\nsess = sagemaker.Session()\\nrole = sagemaker.get_execution_role()\\n\\nprint(f\\"sagemaker role arn: {role}\\")\\nprint(f\\"sagemaker bucket: {sess.default_bucket()}\\")\\nprint(f\\"sagemaker session region: {sess.boto_region_name}\\")\\n```\n\n数据上传到 S3\n\n```\\n# dataset used\\ndataset_name = ' news_summary'\\n# s3 key prefix for the data\\ns3_prefix = 'datasets/news_summary'\\nWORK_DIRECTORY = './data/'\\ndata_location = sess.upload_data(WORK_DIRECTORY, key_prefix=s3_prefix)\\ndata_location\\n```\n\n定义超参数并初始化 estimator。\n\n```\\nfrom sagemaker.huggingface import HuggingFace\\n\\n# hyperparameters which are passed to the training job\\nhyperparameters={'text_column':'text',\\n 'summary_column':'headlines',\\n 'train_file':'/opt/ml/input/data/train/news_summary_cleaned_train.csv',\\n 'validation_file':'/opt/ml/input/data/test/ news_summary_cleaned_test.csv',\\n 'output_dir':'/opt/ml/model',\\n 'do_train':True,\\n 'do_eval':True,\\n 'max_source_length': 128,\\n 'max_target_length': 128,\\n 'model_name_or_path': 't5-large',\\n 'learning_rate': 3e-4,\\n 'num_train_epochs': 1,\\n 'per_device_train_batch_size': 2,#16\\n 'gradient_accumulation_steps':2, \\n 'save_strategy':'epoch',\\n 'evaluation_strategy':'epoch',\\n 'save_total_limit':1,\\n }\\ndistribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}\\n# create the Estimator\\nhuggingface_estimator = HuggingFace(\\n entry_point='run_paraphrase.py',\\n source_dir='./scripts',\\n instance_type='ml.p3.2xlarge',#'ml.p3dn.24xlarge'\\n instance_count=1,\\n role=role,\\n max_run=24*60*60,\\n transformers_version='4.6',\\n pytorch_version='1.7',\\n py_version='py36',\\n volume_size=128,\\n hyperparameters = hyperparameters,\\n# distribution=distribution\\n)\\n```\n\n启动模型训练。\n\n```\\nhuggingface_estimator.fit(\\n {'train': data_location+'/news_summary_cleaned_train.csv',\\n 'test': data_location+'/news_summary_cleaned_test.csv',}\\n)\\n\\n```\n\n训练启动后，我们可以在 [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) 控制台看到这个训练任务，点进详情可以看到训练的日志输出，以及监控机器的 GPU、CPU、内存等的使用率等情况，以确认程序可以正常工作。训练完成后也可以在 CloudWatch 中查看训练日志。\n\n![image.png](https://dev-media.amazoncloud.cn/ed0f3ceb3d6942419c833f807fd1aac9_image.png)\n\n![image.png](https://dev-media.amazoncloud.cn/db18572cf28e402bbf0397f4a03980c7_image.png)\n\n#### **托管部署及推理测试**\n\n完成训练后，我们可以轻松的将上面的模型部署成一个实时可在生产环境中调用的端口。\n\n```\\nfrom sagemaker.huggingface.model import HuggingFaceModel\\n\\n# create Hugging Face Model Class\\nhuggingface_model = HuggingFaceModel(\\n# env= {'HF_TASK':'text-generation'},\\n model_data=\\"s3://sagemaker-us-west-2-847380964353/huggingface-pytorch-training-2022-04-19-05-56-07-474/output/model.tar.gz\\", # path to your trained SageMaker model\\n role=role, # IAM role with permissions to create an endpoint\\n transformers_version=\\"4.6\\", # Transformers version used\\n pytorch_version=\\"1.7\\", # PyTorch version used\\n py_version='py36', # Python version used\\n \\n)\\npredictor = huggingface_model.deploy(\\n initial_instance_count=1,\\n instance_type=\\"ml.g4dn.xlarge\\"\\n)\\n```\n\n模型调用\n\n```\\nfrom sagemaker.huggingface.model import HuggingFacePredictor\\npredictor=HuggingFacePredictor(endpoint_name='huggingface-pytorch-inference-2022-04-19-06-41-55-309')\\n\\nimport time\\ns=time.time()\\ndf=pd.read_csv('./data/hp/summary/news_summary_cleaned_small_test.csv')\\nprint('原文:',df.loc[0,'text'])\\nprint('真实标签:',df.loc[0,'headlines'])\\nout=predictor.predict({\\n 'inputs': df.loc[0,'text'],\\n \\"parameters\\": {\\"max_length\\": 256},\\n })\\ne=time.time()\\nprint('模型预测:' out)\\n```\n\n输出如下：\n\n```\\n原文: Germany on Wednesday accused Vietnam of kidnapping a former Vietnamese oil executive Trinh Xuan Thanh, who allegedly sought asylum in Berlin, and taking him home to face accusations of corruption. Germany expelled a Vietnamese intelligence officer over the suspected kidnapping and demanded that Vietnam allow Thanh to return to Germany. However, Vietnam said Thanh had returned home by himself.\\n真实标签: Germany accuses Vietnam of kidnapping asylum seeker \\n模型预测: Germany accuses Vietnam of kidnapping ex-oil exec, taking him home\\n```\n\n[Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail)\n\n以上就是使用 [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) 构建文本摘要应用的全部过程，可以看到通过 [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) 可以非常便利地结合 Huggingface 进行 NLP 模型的搭建，训练，部署的全流程。整个过程仅需要准备训练脚本以及数据即可通过若干命令启动训练和部署，同时，我们后续还会推出，使用 Amaozn SageMaker 进行更多 NLP 相关任务的实现方式，敬请关注。\n\n#### **参考资料**\n\n- Amazon Sagemaker: https://docs.aws.amazon.com/sagemaker/index.html\n- Huggingface：[https://huggingface.co/](https://huggingface.co/)\n- Code Link：[https://github.com/HaoranLv/nlp_transformer](https://github.com/HaoranLv/nlp_transformer)\n\n#### **本篇作者**\n\n![image.png](https://dev-media.amazoncloud.cn/958956cade3a4ec0bfc986b0655856d6_image.png)\n\n#### **吕浩然**\n\n亚马逊云科技应用科学家，长期从事计算机视觉，自然语言处理等领域的研究和开发工作。支持数据实验室项目，在时序预测，目标检测，OCR，自然语言生成等方向有丰富的算法开发以及落地实践经验。\n","render":"<h5><a id=\\"_0\\"></a>背景介绍</h5>\\n文本摘要，就是对给定的单个或者多个文档进行梗概，即在保证能够反映原文档的重要内容的情况下，尽可能地保持简明扼要。质量良好的文摘能够在信息检索过程中发挥重要的作用，比如利用文摘代替原文档参与索引，可以有效缩短检索的时间，同时也能减少检索结果中的冗余信息，提高用户体验。随着信息爆炸时代的到来，自动文摘逐渐成为自然语言处理领域的一项重要的研究课题。\n文本摘要的需求来自多个我们真实的客户案例，对于大量的长文本对于新闻领域，金融领域，法律领域是司空见惯的。而在人力成本越来越高的今天，雇佣大量的专业人员进行信息精炼或者内容审核无疑要投入大量的资金。而自动文本摘要就显得意义非凡，具体来说，通过大量数据训练的深度学习模型可以在几百毫秒内产生长度可控的文本摘要，这大大地提升了摘要生成效率，节约了大量人力成本。\n对于目前的技术，可以根据摘要产生的方式大体可以分为两类：1）抽取式文本摘要：找到一个文档中最重要的几个句子并对其进行拼接；2）生成式文本摘要：直接建模为序列到序列的生成问题，根据源文本直接递归生成摘要。对于抽取式摘要，其具备效率高，解释性强的优势，但是抽取得到的文本在语义连续性上相较生成式摘要有所不足，故这里我们主要展示生成式摘要。\n<a href=\\"https://docs.aws.amazon.com/sagemaker/index.html\\" target=\\"_blank\\">Amazon SageMaker</a>是亚马逊云计算（Amazon Web Service）的一项完全托管的[机器学习](https://aws.amazon.com/cn/machine-learning/?trk=cndc-detail)平台服务，算法工程师和数据科学家可以基于此平台快速构建、训练和部署[机器学习](https://aws.amazon.com/cn/machine-learning/?trk=cndc-detail) (ML) 模型，而无需关注底层资源的管理和运维工作。它作为一个工具集，提供了用于[机器学习](https://aws.amazon.com/cn/machine-learning/?trk=cndc-detail)的端到端的所有组件，包括数据标记、数据处理、算法设计、模型训练、训练调试、超参调优、模型部署、模型监控等，使得[机器学习](https://aws.amazon.com/cn/machine-learning/?trk=cndc-detail)变得更为简单和轻松；同时，它依托于 Amazon 强大的底层资源，提供了高性能 CPU、GPU、弹性推理加速卡等丰富的计算资源和充足的算力，使得模型研发和部署更为轻松和高效。同时，本文还基于 <a href=\\"https://huggingface.co/\\" target=\\"_blank\\">Huggingface</a>，Huggingface 是 NLP 著名的开源社区，并且与 Amazon SagaMaker 高度适配，可以在 Amazon SagaMaker 上以几行代码轻松实现 NLP 模型训练和部署。\\n<h4><a id=\\"_11\\"></a>解决方案概览</h4>\\n在此示例中，我们将使用 Amazon SageMaker 执行以下操作：\n<ul>\\n<li>环境准备</li>\n<li>下载数据集并将其进行数据预处理</li>\n<li>使用本地机器训练</li>\n<li>使用 Amazon SageMaker BYOS 进行模型训练</li>\n<li>托管部署及推理测试</li>\n</ul>\\n<h4><a id=\\"_21\\"></a>环境准备</h4>\\n我们首先要创建一个 Amazon SageMaker Notebook，笔记本实例类型最好选择 ml.p3.2xlarge，因为本例中用到了本地机器训练的部分用来测试我们的代码，卷大小建议改成10GB或以上，因为运行该项目需要下载一些额外的数据。\n<img src=\\"https://dev-media.amazoncloud.cn/f279d4c7dec249afbf17c029a7f62c62_image.png\\" alt=\\"image.png\\" />\n笔记本启动后，打开页面上的终端，执行以下命令下载代码。\n<pre><code class=\\"lang-\\">cd ~/SageMaker\\ngit clone https://github.com/HaoranLv/nlp_transformer.git\\n</code></pre>\\n<h4><a id=\\"_34\\"></a>下载数据集并将其进行数据预处理</h4>\\n这里给出若干开源的中英文数据集：\n1.公开数据集 (英文)\n<ul>\\n<li>XSUM，227k BBC articles</li>\n<li>CNN/Dailymail，93k articles from the CNN, 220k articles from the Daily Mail</li>\n<li>NEWSROOM，3M article-summary pairs written by authors and editors in the newsrooms of 38 major publications</li>\n<li>Multi-News，56k pairs of news articles and their human-written summaries from the <a href=\\"http://sitenewser.com/\\" target=\\"_blank\\">http://</a><a href=\\"http://sitenewser.com/\\" target=\\"_blank\\">com</a></li>\n<li>Gigaword，4M examples extracted from news articles，the task is to generate theheadline from the first sentence</li>\n<li>arXiv, PubMed，two long documentdatasets of scientific publications from <a href=\\"https://link.zhihu.com/?target=http%3A//arXiv.org\\" target=\\"_blank\\">http://org</a>(113k) andPubMed (215k). The task is to generate the abstract fromthe paper body.</li>\n<li>BIGPATENT，3 millionU.S. patents along with human summaries under nine patent classification categories</li>\n</ul>\\n2.公开数据集 (中文)\n<ul>\\n<li>哈工大的新浪微博短文本摘要 <a href=\\"https://link.zhihu.com/?target=http%3A//icrc.hitsz.edu.cn/Article/show/139.html\\" target=\\"_blank\\">LCSTS</a></li>\\n<li>教育新闻自动摘要语料 <a href=\\"https://link.zhihu.com/?target=https%3A//github.com/wonderfulsuccess/chinese_abstractive_corpus\\" target=\\"_blank\\">chinese_abstractive_corpus</a></li>\\n<li>NLPCC 2017 task3 <a href=\\"https://link.zhihu.com/?target=http%3A//tcci.ccf.org.cn/conference/2017/taskdata.php\\" target=\\"_blank\\">Single Document Summarization</a></li>\\n<li>娱乐新闻等 <a href=\\"https://link.zhihu.com/?target=https%3A//www.dcjingsai.com/common/cmpt/%25E2%2580%259C%25E7%25A5%259E%25E7%25AD%2596%25E6%259D%25AF%25E2%2580%259D2018%25E9%25AB%2598%25E6%25A0%25A1%25E7%25AE%2597%25E6%25B3%2595%25E5%25A4%25A7%25E5%25B8%2588%25E8%25B5%259B_%25E7%25AB%259E%25E8%25B5%259B%25E4%25BF%25A1%25E6%2581%25AF.html\\" target=\\"_blank\\">“神策杯”2018高校算法大师赛</a></li>\\n</ul>\n本文以 Multi-News 为例，数据分为两列，headlines 代表摘要，text 代表全文。由于文本数据集较小，故直接官网下载原始 csv 文件上传到 SageMaker Notebook 即可。如下是部分数据集样例。\n<img src=\\"https://dev-media.amazoncloud.cn/0d6722fce59048ab859b960d4f0e539e_image.png\\" alt=\\"image.png\\" />\n找到 hp_data.ipynb 运行代码。\n首先加载数据集\n<pre><code class=\\"lang-\\">df=pd.read_csv（./data/hp/summary/news_summary.csv'）\\n</code></pre>\\n而后进行数据清洗\n<pre><code class=\\"lang-\\">class Settings:\\n\\n TRAIN_DATA = "./data/hp/summary/news_summary_total.csv"\\n Columns = ['headlines', 'text']\\n encoding = 'latin-1'\\n columns_dict = {"headlines": "headlines", "text": "text"}\\n df_column_list = ['text', 'headlines']\\n SUMMARIZE_KEY = ""\\n SOURCE_TEXT_KEY = 'text'\\n TEST_SIZE = 0.2\\n BATCH_SIZE = 16\\n source_max_token_len = 128\\n target_max_token_len = 50\\n train_df_len = 82332\\n test_df_len = 20583\\n \\nclass Preprocess:\\n def __init__(self):\\n self.settings = Settings\\n\\n def clean_text(self, text):\\n text = text.lower()\\n text = re.sub('\\\\[.*?\\\\]', '', text)\\n text = re.sub('https?://\\\\S+|www\\\\.\\\\S+', '', text)\\n text = re.sub('<.*?>+', '', text)\\n text = re.sub('[%s]' % re.escape(string.punctuation), '', text)\\n text = re.sub('\\\\n', '', text)\\n text = re.sub('\\\\w*\\\\d\\\\w*', '', text)\\n return text\\n\\n def preprocess_data(self, data_path):\\n df = pd.read_csv(data_path, encoding=self.settings.encoding, usecols=self.settings.Columns)\\n # simpleT5 expects dataframe to have 2 columns: "source_text" and "target_text"\\n df = df.rename(columns=self.settings.columns_dict)\\n df = df[self.settings.df_column_list]\\n # T5 model expects a task related prefix: since it is a summarization task, we will add a prefix "summarize: "\\n df[self.settings.SOURCE_TEXT_KEY] = df[self.settings.SOURCE_TEXT_KEY]\\n\\n return df\\nsettings=Settings\\npreprocess=Preprocess()\\ndf = preprocess.preprocess_data(settings.TRAIN_DATA)\\n</code></pre>\\n随后完成训练集和测试集的划分并分别保存：\n<pre><code class=\\"lang-\\">df.to_csv('./data/hp/summary/news_summary_cleaned.csv',index=False)\\ndf2=pd.read_csv('./data/hp/summary/news_summary_cleaned.csv')\\norder=['text','headlines']\\ndf3=df2[order]\\ntrain_df, test_df = train_test_split(df3, test_size=0.2,random_state=100)\\ntrain_df.to_csv('./data/hp/summary/news_summary_cleaned_train.csv',index=False)\\ntest_df.to_csv('./data/hp/summary/news_summary_cleaned_test.csv',index=False)\\n</code></pre>\\n<h4><a id=\\"_126\\"></a>使用本地机器训练</h4>\\n在完成了上述的数据处理过程后，就可以进行模型训练了，下面的命令运行后即开始模型训练，代码会自动 Huggingface hub 中加载 google/pegasus-large 作为预训练模型，而后使用我们处理后的数据集进行模型训练。\n<pre><code class=\\"lang-\\">!python -u examples/pytorch/summarization/run_summarization.py \\\\\\n--model_name_or_path google/pegasus-large \\\\\\n--do_train \\\\\\n--do_eval \\\\\\n--per_device_train_batch_size=2 \\\\\\n--per_device_eval_batch_size=1 \\\\\\n--save_strategy epoch \\\\\\n--evaluation_strategy epoch \\\\\\n--overwrite_output_dir \\\\\\n--predict_with_generate \\\\\\n--train_file './data/hp/summary/news_summary_cleaned_train.csv' \\\\\\n--validation_file './data/hp/summary/news_summary_cleaned_test.csv' \\\\\\n--text_column 'text' \\\\\\n--summary_column 'headlines' \\\\\\n--output_dir='./models/local_train/pegasus-hp' \\\\\\n--num_train_epochs=1.0 \\\\\\n--eval_steps=500 \\\\\\n--save_total_limit=3 \\\\\\n--source_prefix "summarize: " > train_pegasus.log\\n</code></pre>\\n训练完成后，会提示日志信息如下。\n<img src=\\"https://dev-media.amazoncloud.cn/b0ddd3a90222421e86bbd7d60644b078_image.png\\" alt=\\"image.png\\" />\n并且会对验证集的数据进行客观指标评估，这里使用 Rouge 进行评估。\n<img src=\\"https://dev-media.amazoncloud.cn/b9047e668efc426d85b945dde62062f9_image.png\\" alt=\\"image.png\\" />\n模型结果文件及相应的日志等信息会自动保存在./models/local_train/pegasus-hp/checkpoint-500\n<img src=\\"https://dev-media.amazoncloud.cn/4e75f4e2911d44c798c1159b7bfae2f2_image.png\\" alt=\\"image.png\\" />\n我们可以直接用这个产生的模型文件进行本地推理。注意这里的模型文件地址的指定为你刚刚训练产生的。\n<pre><code class=\\"lang-\\">import pandas as pd\\ndf=pd.read_csv('./data/hp/summary/news_summary_cleaned_small_test.csv')\\nprint('原文:',df.loc[0,'text'])\\nprint('真实标签:',df.loc[0,'headlines'])\\nfrom transformers import pipeline\\nsummarizer=pipeline("summarization",model="./models/local_train/Pegasus-hp/checkpoint-500")\\nprint('模型预测:',summarizer(df.loc[0,'text'], max_length=50)[0]['summary_text'])\\n</code></pre>\\n输出如下：\n<pre><code class=\\"lang-\\">原文: Germany on Wednesday accused Vietnam of kidnapping a former Vietnamese oil executive Trinh Xuan Thanh, who allegedly sought asylum in Berlin, and taking him home to face accusations of corruption. Germany expelled a Vietnamese intelligence officer over the suspected kidnapping and demanded that Vietnam allow Thanh to return to Germany. However, Vietnam said Thanh had returned home by himself.\\n真实标签: Germany accuses Vietnam of kidnapping asylum seeker \\n模型预测: Germany accuses Vietnam of kidnapping ex-oil exec, taking him home\\n</code></pre>\\n到这里，就完成了一个模型的本地训练和推理过程。\n<h4><a id=\\"_Amazon_SageMaker_BYOS__186\\"></a>使用 Amazon SageMaker BYOS 进行模型训练</h4>\\n在上文的范例中，我们使用本地环境一步步的训练了一个较小的模型，验证了我们的代码。现在，我们需要把代码进行整理，在 Amazon SageMaker 上，进行可扩展至分布式的托管训练任务。\n首先，我们要将上文的训练代码整理至一个 python 脚本，然后使用 SageMaker 上预配置的 Huggingface 容器，我们提供了很多灵活的使用方式来使用该容器，具体可以参考 <a href=\\"https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#hugging-face-estimator\\" target=\\"_blank\\">Hugging Face Estimator</a>。\\n由于 SageMaker 预置的 Huggingface 容器已经具备推理逻辑, 故这里只需要将上一步中的训练脚本引入容器即可, 具体流程如下:\n启动一个 Jupyter Notebook，选择 python3 作为解释器完成如下工作：\n权限配置\n<pre><code class=\\"lang-\\">import sagemaker\\nimport os\\nsess = sagemaker.Session()\\nrole = sagemaker.get_execution_role()\\n\\nprint(f"sagemaker role arn: {role}")\\nprint(f"sagemaker bucket: {sess.default_bucket()}")\\nprint(f"sagemaker session region: {sess.boto_region_name}")\\n</code></pre>\\n数据上传到 S3\n<pre><code class=\\"lang-\\"># dataset used\\ndataset_name = ' news_summary'\\n# s3 key prefix for the data\\ns3_prefix = 'datasets/news_summary'\\nWORK_DIRECTORY = './data/'\\ndata_location = sess.upload_data(WORK_DIRECTORY, key_prefix=s3_prefix)\\ndata_location\\n</code></pre>\\n定义超参数并初始化 estimator。\n<pre><code class=\\"lang-\\">from sagemaker.huggingface import HuggingFace\\n\\n# hyperparameters which are passed to the training job\\nhyperparameters={'text_column':'text',\\n 'summary_column':'headlines',\\n 'train_file':'/opt/ml/input/data/train/news_summary_cleaned_train.csv',\\n 'validation_file':'/opt/ml/input/data/test/ news_summary_cleaned_test.csv',\\n 'output_dir':'/opt/ml/model',\\n 'do_train':True,\\n 'do_eval':True,\\n 'max_source_length': 128,\\n 'max_target_length': 128,\\n 'model_name_or_path': 't5-large',\\n 'learning_rate': 3e-4,\\n 'num_train_epochs': 1,\\n 'per_device_train_batch_size': 2,#16\\n 'gradient_accumulation_steps':2, \\n 'save_strategy':'epoch',\\n 'evaluation_strategy':'epoch',\\n 'save_total_limit':1,\\n }\\ndistribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}\\n# create the Estimator\\nhuggingface_estimator = HuggingFace(\\n entry_point='run_paraphrase.py',\\n source_dir='./scripts',\\n instance_type='ml.p3.2xlarge',#'ml.p3dn.24xlarge'\\n instance_count=1,\\n role=role,\\n max_run=24*60*60,\\n transformers_version='4.6',\\n pytorch_version='1.7',\\n py_version='py36',\\n volume_size=128,\\n hyperparameters = hyperparameters,\\n# distribution=distribution\\n)\\n</code></pre>\\n启动模型训练。\n<pre><code class=\\"lang-\\">huggingface_estimator.fit(\\n {'train': data_location+'/news_summary_cleaned_train.csv',\\n 'test': data_location+'/news_summary_cleaned_test.csv',}\\n)\\n\\n</code></pre>\\n训练启动后，我们可以在 Amazon SageMaker 控制台看到这个训练任务，点进详情可以看到训练的日志输出，以及监控机器的 GPU、CPU、内存等的使用率等情况，以确认程序可以正常工作。训练完成后也可以在 CloudWatch 中查看训练日志。\n<img src=\\"https://dev-media.amazoncloud.cn/ed0f3ceb3d6942419c833f807fd1aac9_image.png\\" alt=\\"image.png\\" />\n<img src=\\"https://dev-media.amazoncloud.cn/db18572cf28e402bbf0397f4a03980c7_image.png\\" alt=\\"image.png\\" />\n<h4><a id=\\"_279\\"></a>托管部署及推理测试</h4>\\n完成训练后，我们可以轻松的将上面的模型部署成一个实时可在生产环境中调用的端口。\n<pre><code class=\\"lang-\\">from sagemaker.huggingface.model import HuggingFaceModel\\n\\n# create Hugging Face Model Class\\nhuggingface_model = HuggingFaceModel(\\n# env= {'HF_TASK':'text-generation'},\\n model_data="s3://sagemaker-us-west-2-847380964353/huggingface-pytorch-training-2022-04-19-05-56-07-474/output/model.tar.gz", # path to your trained SageMaker model\\n role=role, # IAM role with permissions to create an endpoint\\n transformers_version="4.6", # Transformers version used\\n pytorch_version="1.7", # PyTorch version used\\n py_version='py36', # Python version used\\n \\n)\\npredictor = huggingface_model.deploy(\\n initial_instance_count=1,\\n instance_type="ml.g4dn.xlarge"\\n)\\n</code></pre>\\n模型调用\n<pre><code class=\\"lang-\\">from sagemaker.huggingface.model import HuggingFacePredictor\\npredictor=HuggingFacePredictor(endpoint_name='huggingface-pytorch-inference-2022-04-19-06-41-55-309')\\n\\nimport time\\ns=time.time()\\ndf=pd.read_csv('./data/hp/summary/news_summary_cleaned_small_test.csv')\\nprint('原文:',df.loc[0,'text'])\\nprint('真实标签:',df.loc[0,'headlines'])\\nout=predictor.predict({\\n 'inputs': df.loc[0,'text'],\\n "parameters": {"max_length": 256},\\n })\\ne=time.time()\\nprint('模型预测:' out)\\n</code></pre>\\n输出如下：\n<pre><code class=\\"lang-\\">原文: Germany on Wednesday accused Vietnam of kidnapping a former Vietnamese oil executive Trinh Xuan Thanh, who allegedly sought asylum in Berlin, and taking him home to face accusations of corruption. Germany expelled a Vietnamese intelligence officer over the suspected kidnapping and demanded that Vietnam allow Thanh to return to Germany. However, Vietnam said Thanh had returned home by himself.\\n真实标签: Germany accuses Vietnam of kidnapping asylum seeker \\n模型预测: Germany accuses Vietnam of kidnapping ex-oil exec, taking him home\\n</code></pre>\\nAmazon SageMaker\n以上就是使用 Amazon SageMaker 构建文本摘要应用的全部过程，可以看到通过 Amazon SageMaker 可以非常便利地结合 Huggingface 进行 NLP 模型的搭建，训练，部署的全流程。整个过程仅需要准备训练脚本以及数据即可通过若干命令启动训练和部署，同时，我们后续还会推出，使用 Amaozn SageMaker 进行更多 NLP 相关任务的实现方式，敬请关注。\n<h4><a id=\\"_333\\"></a>参考资料</h4>\\n<ul>\\n<li>Amazon Sagemaker: https://docs.aws.amazon.com/sagemaker/index.html</li>\n<li>Huggingface：<a href=\\"https://huggingface.co/\\" target=\\"_blank\\">https://huggingface.co/</a></li>\\n<li>Code Link：<a href=\\"https://github.com/HaoranLv/nlp_transformer\\" target=\\"_blank\\">https://github.com/HaoranLv/nlp_transformer</a></li>\\n</ul>\n<h4><a id=\\"_339\\"></a>本篇作者</h4>\\n<img src=\\"https://dev-media.amazoncloud.cn/958956cade3a4ec0bfc986b0655856d6_image.png\\" alt=\\"image.png\\" />\n<h4><a id=\\"_343\\"></a>吕浩然</h4>\\n亚马逊云科技应用科学家，长期从事计算机视觉，自然语言处理等领域的研究和开发工作。支持数据实验室项目，在时序预测，目标检测，OCR，自然语言生成等方向有丰富的算法开发以及落地实践经验。\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家