New – Process PDFs, Word Documents, and Images with Amazon Comprehend for IDP

海外精选

re:Invent

Amazon Comprehend

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"Today we are announcing a new [Amazon Comprehend](https://aws.amazon.com/comprehend/) feature for intelligent document processing (IDP). This feature allows you to classify and extract entities from PDF documents, Microsoft Word files, and images directly from Amazon Comprehend without you needing to extract the text first.\n\nMany customers need to process documents that have a semi-structured format, like images of receipts that were scanned or tax statements in PDF format. Until today, those customers ﬁrst needed to preprocess those documents to flatten them into machine-readable text, which can reduce the quality of the document context. Then they could use Amazon Comprehend to classify and extract entities from those preprocessed files.\n\nNow with Amazon Comprehend for IDP, customers can process their semi-structured documents, such as PDFs, docx, PNG, JPG, or TIFF images, as well as plain-text documents, with a single API call. This new feature combines OCR and Amazon Comprehend’s existing natural language processing (NLP) capabilities to classify and extract entities from the documents. The custom document classification API allows you to organize documents into categories or classes, and the custom-named entity recognition API allows you to extract entities from documents like product codes or business-specific entities. For example, an insurance company can now process scanned customers’ claims with fewer API calls. Using the Amazon Comprehend entity recognition API, they can extract the customer number from the claims and use the custom classifier API to sort the claim into the different insurance categories—home, car, or personal.\n\nStarting today, Amazon Comprehend for IDP APIs are available for real-time inferencing of files, as well as for asynchronous batch processing on large document sets. This feature simplifies the document processing pipeline and reduces development effort.\n\n### ++Getting Started++\nYou can use Amazon Comprehend for IDP from the AWS Management Console, [AWS SDKs](https://aws.amazon.com/tools/), or [AWS Command Line Interface (CLI)](https://aws.amazon.com/cli/).\n\n\nIn this demo, you will see how to asynchronously process a semi-structured file with a custom classifier. For extracting entities, the steps are different, and you can [learn how to do it by checking the documentation.](https://docs.aws.amazon.com/comprehend/latest/dg/custom-entity-recognition.html)\n\n\nAfter you train your custom classifier, you can classify documents using either asynchronous or synchronous operations. For using the synchronous operation to analyze a single document, you need to create an endpoint to run real-time analysis using a custom model. You can find more information about [real-time analysis in the documentation](https://docs.aws.amazon.com/comprehend/latest/dg/custom-sync.html). For this demo, you are going to use the asynchronous operation, placing the documents to classify in an [Amazon Simple Storage Service (Amazon S3)](https://aws.amazon.com/s3/) bucket and running an analysis batch job.\n\nTo get started classifying documents in batch from the console, on the Amazon Comprehend page, go to **Analysis jobs** and then **Create job**.\n\n![image.png](https://dev-media.amazoncloud.cn/2a9fbbe504ec49368d00e840a2ce0e99_image.png)\n\nThen you can configure the new analysis job. First, input a name and pick **Custom classification** and the custom classifier you created earlier.\n\nThen you can configure the input data. First, select the S3 location for that data. In that location, you can place your PDFs, images, and Word Documents. Because you are processing semi-structured documents, you need to choose **One document per file**. If you want to override Amazon Comprehend settings for extracting and parsing the document, you can configure the **Advanced document input** options.\n\n\n![image.png](https://dev-media.amazoncloud.cn/46fa0d524ece4cd8a78482d05259fb6f_image.png)\n\nAfter configuring the input data, you can select where the output of this analysis should be stored. Also, you need to give access permissions for this analysis job to read and write on the specified Amazon S3 locations, and then you are ready to create the job.\n\n![image.png](https://dev-media.amazoncloud.cn/07d699d5d6df41818d71935ed5a1d825_image.png)\n\nThe job takes a few minutes to run, depending on the size of the input. When the job is ready, you can check the output results. You can find the results in the Amazon S3 location you specified when you created the job.\n\nIn the results folder, you will find a ```.out``` file for each of the semi-structured files Amazon Comprehend classified. The ```.out``` file is a JSON, in which each line represents a page of the document. In the ```amazon-textract-output``` directory, you will find a folder for each classified file, and inside that folder, there is one file per page from the original file. Those page files contain the classification results. To learn more about the outputs of the classifications, check [the documentation page](https://docs.aws.amazon.com/comprehend/latest/dg/how-class-run.html).\n\n![image.png](https://dev-media.amazoncloud.cn/7b9f721063ad4fc9b04ae2bb30024288_image.png)\n\n\n### ++Available Now++\nYou can get started [classifying](https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification.html) and [extracting entities](https://docs.aws.amazon.com/comprehend/latest/dg/custom-entity-recognition.html) from semi-structured files like PDFs, images, and Word Documents asynchronously and synchronously today from Amazon Comprehend in all the Regions where Amazon Comprehend is available. Learn more about this new launch in the [Amazon Comprehend Developer Guide](https://docs.aws.amazon.com/comprehend/latest/dg/what-is.html).\n\n— [Marcia](https://twitter.com/mavi888uy)\n\n![a461c5365d1c9a3ae376470d8b3ddd7.png](https://dev-media.amazoncloud.cn/9f9e0a57de4d4d02a6c10219ce303177_a461c5365d1c9a3ae376470d8b3ddd7.png)\n\n### Marcia Villalba\nMarcia Villalba is a Principal Developer Advocate for Amazon Web Services. She has almost 20 years of experience working in the software industry building and scaling applications. Her passion is designing systems that can take full advantage of the cloud and embrace the DevOps culture.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n","render":"Today we are announcing a new <a href=\"https://aws.amazon.com/comprehend/\" target=\"_blank\">Amazon Comprehend</a> feature for intelligent document processing (IDP). This feature allows you to classify and extract entities from PDF documents, Microsoft Word files, and images directly from Amazon Comprehend without you needing to extract the text first.\nMany customers need to process documents that have a semi-structured format, like images of receipts that were scanned or tax statements in PDF format. Until today, those customers ﬁrst needed to preprocess those documents to flatten them into machine-readable text, which can reduce the quality of the document context. Then they could use Amazon Comprehend to classify and extract entities from those preprocessed files.\nNow with Amazon Comprehend for IDP, customers can process their semi-structured documents, such as PDFs, docx, PNG, JPG, or TIFF images, as well as plain-text documents, with a single API call. This new feature combines OCR and Amazon Comprehend’s existing natural language processing (NLP) capabilities to classify and extract entities from the documents. The custom document classification API allows you to organize documents into categories or classes, and the custom-named entity recognition API allows you to extract entities from documents like product codes or business-specific entities. For example, an insurance company can now process scanned customers’ claims with fewer API calls. Using the Amazon Comprehend entity recognition API, they can extract the customer number from the claims and use the custom classifier API to sort the claim into the different insurance categories—home, car, or personal.\nStarting today, Amazon Comprehend for IDP APIs are available for real-time inferencing of files, as well as for asynchronous batch processing on large document sets. This feature simplifies the document processing pipeline and reduces development effort.\n<h3><a id=\"Getting_Started_8\"></a><ins>Getting Started</ins></h3>\nYou can use Amazon Comprehend for IDP from the AWS Management Console, <a href=\"https://aws.amazon.com/tools/\" target=\"_blank\">AWS SDKs</a>, or <a href=\"https://aws.amazon.com/cli/\" target=\"_blank\">AWS Command Line Interface (CLI)</a>.\nIn this demo, you will see how to asynchronously process a semi-structured file with a custom classifier. For extracting entities, the steps are different, and you can <a href=\"https://docs.aws.amazon.com/comprehend/latest/dg/custom-entity-recognition.html\" target=\"_blank\">learn how to do it by checking the documentation.</a>\nAfter you train your custom classifier, you can classify documents using either asynchronous or synchronous operations. For using the synchronous operation to analyze a single document, you need to create an endpoint to run real-time analysis using a custom model. You can find more information about <a href=\"https://docs.aws.amazon.com/comprehend/latest/dg/custom-sync.html\" target=\"_blank\">real-time analysis in the documentation</a>. For this demo, you are going to use the asynchronous operation, placing the documents to classify in an <a href=\"https://aws.amazon.com/s3/\" target=\"_blank\">Amazon Simple Storage Service (Amazon S3)</a> bucket and running an analysis batch job.\nTo get started classifying documents in batch from the console, on the Amazon Comprehend page, go to Analysis jobs and then Create job.\n<img src=\"https://dev-media.amazoncloud.cn/2a9fbbe504ec49368d00e840a2ce0e99_image.png\" alt=\"image.png\" />\nThen you can configure the new analysis job. First, input a name and pick Custom classification and the custom classifier you created earlier.\nThen you can configure the input data. First, select the S3 location for that data. In that location, you can place your PDFs, images, and Word Documents. Because you are processing semi-structured documents, you need to choose One document per file. If you want to override Amazon Comprehend settings for extracting and parsing the document, you can configure the Advanced document input options.\n<img src=\"https://dev-media.amazoncloud.cn/46fa0d524ece4cd8a78482d05259fb6f_image.png\" alt=\"image.png\" />\nAfter configuring the input data, you can select where the output of this analysis should be stored. Also, you need to give access permissions for this analysis job to read and write on the specified Amazon S3 locations, and then you are ready to create the job.\n<img src=\"https://dev-media.amazoncloud.cn/07d699d5d6df41818d71935ed5a1d825_image.png\" alt=\"image.png\" />\nThe job takes a few minutes to run, depending on the size of the input. When the job is ready, you can check the output results. You can find the results in the Amazon S3 location you specified when you created the job.\nIn the results folder, you will find a <code>.out</code> file for each of the semi-structured files Amazon Comprehend classified. The <code>.out</code> file is a JSON, in which each line represents a page of the document. In the <code>amazon-textract-output</code> directory, you will find a folder for each classified file, and inside that folder, there is one file per page from the original file. Those page files contain the classification results. To learn more about the outputs of the classifications, check <a href=\"https://docs.aws.amazon.com/comprehend/latest/dg/how-class-run.html\" target=\"_blank\">the documentation page</a>.\n<img src=\"https://dev-media.amazoncloud.cn/7b9f721063ad4fc9b04ae2bb30024288_image.png\" alt=\"image.png\" />\n<h3><a id=\"Available_Now_39\"></a><ins>Available Now</ins></h3>\nYou can get started <a href=\"https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification.html\" target=\"_blank\">classifying</a> and <a href=\"https://docs.aws.amazon.com/comprehend/latest/dg/custom-entity-recognition.html\" target=\"_blank\">extracting entities</a> from semi-structured files like PDFs, images, and Word Documents asynchronously and synchronously today from Amazon Comprehend in all the Regions where Amazon Comprehend is available. Learn more about this new launch in the <a href=\"https://docs.aws.amazon.com/comprehend/latest/dg/what-is.html\" target=\"_blank\">Amazon Comprehend Developer Guide</a>.\n— <a href=\"https://twitter.com/mavi888uy\" target=\"_blank\">Marcia</a>\n<img src=\"https://dev-media.amazoncloud.cn/9f9e0a57de4d4d02a6c10219ce303177_a461c5365d1c9a3ae376470d8b3ddd7.png\" alt=\"a461c5365d1c9a3ae376470d8b3ddd7.png\" />\n<h3><a id=\"Marcia_Villalba_46\"></a>Marcia Villalba</h3>\nMarcia Villalba is a Principal Developer Advocate for Amazon Web Services. She has almost 20 years of experience working in the software industry building and scaling applications. Her passion is designing systems that can take full advantage of the cloud and embrace the DevOps culture.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家