Generate synchronized closed captions and audio using the Amazon Polly subtitle generator

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"[Amazon Polly](https://aws.amazon.com/polly/), an AI generated text-to-speech service, enables you to automate and scale your interactive voice solutions, helping to improve productivity and reduce costs.\n\nAs our customers continue to use Amazon Polly for its rich set of features and ease of use, we have observed a demand for the ability to simultaneously generate synchronized audio and subtitles or closed captions for a given text input. At AWS, we continuously work backward from our customer asks, so in this post, we outline a method to generate audio and subtitles at the same time for a given text.\n\nAlthough subtitles and captions are often used interchangeably, including in this post, there are subtle differences among them:\n\n- **Subtitles** – In subtitles, text language displayed on the screen is different from the audio language and doesn’t display anything for non-dialogue like significant sounds. The primary objective is to reach the audience that doesn’t speak the audio language in the video.\n- **Captions** (closed/open) – Captions display the dialogues being spoken in the audio in the same language. Its primary purpose is to increase accessibility in cases where the audio can’t be heard by the end consumer due to a range of issues. Closed captions are part of a different file than the audio/video source and can be turned off and on at the user’s discretion, whereas open captions are part of the video file and can’t be turned off by the user.\n\n\n### **Benefits of using Amazon Polly to generate audio with subtitles or closed captions**\n\nImagine the following use case: you prepare a slide-based presentation for an online learning portal. Each slide includes onscreen content and narration. The onscreen content is a basic outline, and the narration goes into detail. Instead of recording a human voice, which can be cumbersome and inconsistent, you can use Amazon Polly to generate the narration. Amazon Polly produces high-quality, consistent voices. There’s no need for post-production. In the future, if you need to update a portion of the presentation, you only need to update the affected slides. The voice matches the original slides. Additionally, when Amazon Polly generates your audio, captions are included that appear in time with the audio. You save time because there’s no manual recording involved, and save additional time when updates are needed. Your presentation also delivers more value because captions help students consume the content. It’s a win-win-win solution.\n\nThere are a multitude of use cases for captions, such as advertisements in social spaces, gymnasiums, coffee shops, and other places where typically there is something on a television with the audio muted and music in the background; online training and classes; virtual meetings; public electronic announcements; watching videos while commuting without headphones and without disturbing co-passengers; and several more.\n\nIrrespective of the field of application, closed captioning can help with the following:\n\n\n- **Accessibility** – People with hearing impairments can better consume your content.\n- Retention – Online learning is easier for e-learners to grasp and retain when more human senses are involved.\n- **Reachability** – Your content can reach people that have competing priorities, such as gaming and watching news simultaneously, or people who have a different native language than the audio language.\n- **Searchability** – The content is searchable by search engines. Whereas videos can’t be searched optimally by most search engines, search engines can use the caption text files and make your content more discoverable.\n- **Social courtesy** – Sometimes it may be rude to play audio because of your surroundings, or the audio could be difficult to hear because of the noise of your environment.\n- **Comprehension** – The content is easier to comprehend irrespective of the accent of the speaker, native language of the speaker, or speed of speech. You can also take notes without repeatedly watching the same scene.\n\n\n### **Solution overview**\n\nThe library presented in this post uses Amazon Polly to generate sound and closed captions for an input text. You can easily integrate this library in your text-to-speech applications. It supports several audio formats, and captions in both VTT and SRT file formats, which are the most commonly used across the industry.\n\nIn this post, we focus on the ```PollyVTT()``` syntax and options, and offer a few examples that demonstrate how to use the Python ```SubtitleGeneratorForPolly``` to simultaneously generate synchronous audio and subtitle files for a given text input. The output audio file format can be PCM(wav), OGG, or MP3, and the subtitle file format can be VTT or SRT. Furthermore, ```SubtitleGeneratorForPolly``` supports all Amazon Polly ```synthesize_speech``` parameters and adds to the rich Amazon Polly feature set.\n\nThe ```polly-vtt``` library and its dependencies are available on [GitHub](https://github.com/aws-samples/amazon-polly-closed-caption-subtitle-generator).\n\n\n### **Install and use the function**\n\nBefore we look at some examples of using ```PollyVTT()```, the function that powers ```SubtitleGeneratorForPolly```, let’s look at the installation and syntax of it.\n\nInstall the library using the following code:\n\n```\n\npip install\n\n```\n\nTo run from the command line, you simply run ```polly-vtt```:\n\n```\n\nUsage: polly-vtt [OPTIONS] BASE_FILENAME VOICE_ID OUTPUT_FORMAT TEXT\n\n```\n\nThe following code shows your options:\n\n```\n\n--caption-format TEXT 'srt' or 'vtt'\n--help Show this message and exit. \n\nBASE_FILENAME: Base filename for both the audio and caption files \nVOICE_ID: Polly voice to use (Case-sensitive)\nOUTPUT_FORMAT: Amazon Polly output format: pcm, mp3, ogg_vorbis \nTEXT: Full text to be digitized \nCaption format: srt or vtt\n\n```\n\nLet’s look at a few examples now.\n\n### **Example 1**\n\nThis example generates a PCM audio file along with an SRT caption file for two simple sentences:\n\n```\n\n$ polly-vtt testfile Joanna pcm \"this is a test. this is a second sentence.\" --caption-format srt \n\ntestfile.wav written successfully.\ntestfile.wav.srt written successfully.\nTotal Audio Length: 0:00:03.017500 \n# of Sentences: 2\n\n```\n\n### **Example 2**\n\nThis example demonstrates how to use a paragraph of text as input. This generates audio files in WAV, MP3, and OGG, and subtitles in SRT and VTT. The following example creates six files for the given input text:\n\n- ```pcm_testfile.wav```\n- ```pcm_testfile.wav.vtt```\n- ```mp3_testfile.mp3```\n- ```mp3_testfile.mp3.vtt```\n- ```ogg_testfile.ogg```\n- ```ogg_testfile.ogg.srt```\n\nSee the following code:\n\n```\n\nfrom polly_vtt import PollyVTT \n\ntext = \"News content is shaped by its own unique characteristics. Sentences and paragraphs are usually short and highly in formative because writers have to compress information into a limited space. Depending on the theme, news articles may con tain relevant terminology, place names, abbreviations, people’s names, and quotes. Excellent news writing is clear, precis e, and avoids ambiguity. The writing is dynamic, especially in online articles, because content may get updated multiple times per day as new information becomes available.\" \n\npolly_vtt = PollyVTT() \n\n# pcm with VTT captions \npolly_vtt.generate( \n\"pcm_testfile\", \nText=text, \nVoiceId=\"Joanna\", \nOutputFormat=\"pcm\", \n) \n\n# mp3 with VTT captions \npolly_vtt.generate( \n\"mp3_testfile\", \nText=text, \nVoiceId=\"Joanna\", \nOutputFormat=\"mp3\", \n)\n \n# ogg with SRT captions \npolly_vtt.generate( \n\"ogg_testfile\", \n\"srt\",\nText=text, \nVoiceId=\"Joanna\", \nOutputFormat=\"ogg_vorbis\", \n) \n\n\n```\n\n### **Example 3**\n\nIn most cases, however, you want to pass the text as an input file. The following is a Python example of this, with the same output as the previous example:\n\n```\n\nfrom polly_vtt import PollyVTT\nimport os\nimport boto3\nimport json\n\npolly_vtt = PollyVTT()\n\ntry:\n\tf=open(\"input.txt\", \"r\")\n\tprint(\"file is opened\")\n\tpolly_vtt.generate(\n\t\"pcm_testfile\",\n\tText=f.read(),\n\tVoiceId=\"Joanna\",\n\tOutputFormat=\"pcm\",\n\t)\n\tf.close()\nexcept:\n\tprint(\"error occurred while converting to PCM\")\nprint(\"end of file\")\n\n# mp3 with VTT captions\ntry:\n\tf=open(\"input.txt\", \"r\")\n\tprint(\"file is opened\")\n\tpolly_vtt.generate(\n\t\"mp3_testfile\",\n\tText=f.read(),\n\tVoiceId=\"Joanna\",\n\tOutputFormat=\"mp3\",\n\t)\n\tf.close()\nexcept:\n\tprint(\"error occurred while converting to MP3\")\nprint(\"end of file\")\n\n# ogg with SRT captions\ntry:\n\tf=open(\"input.txt\", \"r\")\n\tprint(\"file is opened\")\n\tpolly_vtt.generate(\n\t\"ogg_testfile\",\n\t\"srt\",\n\tText=f.read(),\n\tVoiceId=\"Joanna\",\n\tOutputFormat=\"ogg_vorbis\",\n\t)\n\tf.close()\nexcept:\n\tprint(\"error occurred while converting to OGG\")\nprint(\"end of file\")\n\n```\n\nThe following is a testimonial post from the AWS internal training team of using Amazon Polly with closed captions:\n<video src=\"https://dev-media.amazoncloud.cn/01bea14f01ae46cf9e959315157e922f_testimonial-construct.mp4\" class=\"manvaVedio\" controls=\"controls\" style=\"width:160px;height:160px\"></video>\n\nThe following video offers a short demo of how the internal training team at AWS uses ```PollyVTT()```:\n<video src=\"https://dev-media.amazoncloud.cn/80299e7b8d8b49bb94c97f690e5ba74e_convert-process.mp4\" class=\"manvaVedio\" controls=\"controls\" style=\"width:160px;height:160px\"></video>\n\n### **Conclusion**\n\nIn this post, we shared a method to generate audio and subtitles at the same time for a given text. The ```PollyVTT()``` function and ```SubtitleGeneratorForPolly``` address a common requirement for subtitles in an efficient and effective manner. The Amazon Polly team continues to invent and offer simplified solutions to complex customer requirements.\n\nFor more tutorials and information about Amazon Polly, check out the [AWS Machine Learning Blog](https://aws.amazon.com/blogs/machine-learning/category/artificial-intelligence/amazon-polly/).\n\n\n#### **About the Authors**\n\n![image.png](https://dev-media.amazoncloud.cn/a1ff389c9a70423188310ac6c4afa64a_image.png)\n\n**Abhishek Soni** is a Partner Solutions Architect at AWS. He works with customers to provide technical guidance for the best outcome of workloads on AWS.\n\n![image.png](https://dev-media.amazoncloud.cn/36fbd60e4759496089b44def0e02e477_image.png)\n\n**Dan McKee** uses audio, video, and coffee to distill content into targeted, modular, and structured courses. In his role as Curriculum Developer Project Manager for the NetSec Domain at Amazon Web Services, he leverages his experience in Data Center Networking to help subject matter experts bring ideas to life.\n\n\n![image.png](https://dev-media.amazoncloud.cn/6b1dd7d4e3b04d4ca5bc2b93ebd13edf_image.png)\n\n**Orlando Karam** is a Technical Curriculum Developer at Amazon Web Services, which means he gets to play with cool new technologies and then talk about it. Occasionally, he also uses those cool technologies to make his job easier.\n","render":"<a href=\"https://aws.amazon.com/polly/\" target=\"_blank\">Amazon Polly</a>, an AI generated text-to-speech service, enables you to automate and scale your interactive voice solutions, helping to improve productivity and reduce costs.\nAs our customers continue to use Amazon Polly for its rich set of features and ease of use, we have observed a demand for the ability to simultaneously generate synchronized audio and subtitles or closed captions for a given text input. At AWS, we continuously work backward from our customer asks, so in this post, we outline a method to generate audio and subtitles at the same time for a given text.\nAlthough subtitles and captions are often used interchangeably, including in this post, there are subtle differences among them:\n<ul>\n<li>Subtitles – In subtitles, text language displayed on the screen is different from the audio language and doesn’t display anything for non-dialogue like significant sounds. The primary objective is to reach the audience that doesn’t speak the audio language in the video.</li>\n<li>Captions (closed/open) – Captions display the dialogues being spoken in the audio in the same language. Its primary purpose is to increase accessibility in cases where the audio can’t be heard by the end consumer due to a range of issues. Closed captions are part of a different file than the audio/video source and can be turned off and on at the user’s discretion, whereas open captions are part of the video file and can’t be turned off by the user.</li>\n</ul>\n<h3><a id=\"Benefits_of_using_Amazon_Polly_to_generate_audio_with_subtitles_or_closed_captions_10\"></a>Benefits of using Amazon Polly to generate audio with subtitles or closed captions</h3>\nImagine the following use case: you prepare a slide-based presentation for an online learning portal. Each slide includes onscreen content and narration. The onscreen content is a basic outline, and the narration goes into detail. Instead of recording a human voice, which can be cumbersome and inconsistent, you can use Amazon Polly to generate the narration. Amazon Polly produces high-quality, consistent voices. There’s no need for post-production. In the future, if you need to update a portion of the presentation, you only need to update the affected slides. The voice matches the original slides. Additionally, when Amazon Polly generates your audio, captions are included that appear in time with the audio. You save time because there’s no manual recording involved, and save additional time when updates are needed. Your presentation also delivers more value because captions help students consume the content. It’s a win-win-win solution.\nThere are a multitude of use cases for captions, such as advertisements in social spaces, gymnasiums, coffee shops, and other places where typically there is something on a television with the audio muted and music in the background; online training and classes; virtual meetings; public electronic announcements; watching videos while commuting without headphones and without disturbing co-passengers; and several more.\nIrrespective of the field of application, closed captioning can help with the following:\n<ul>\n<li>Accessibility – People with hearing impairments can better consume your content.</li>\n<li>Retention – Online learning is easier for e-learners to grasp and retain when more human senses are involved.</li>\n<li>Reachability – Your content can reach people that have competing priorities, such as gaming and watching news simultaneously, or people who have a different native language than the audio language.</li>\n<li>Searchability – The content is searchable by search engines. Whereas videos can’t be searched optimally by most search engines, search engines can use the caption text files and make your content more discoverable.</li>\n<li>Social courtesy – Sometimes it may be rude to play audio because of your surroundings, or the audio could be difficult to hear because of the noise of your environment.</li>\n<li>Comprehension – The content is easier to comprehend irrespective of the accent of the speaker, native language of the speaker, or speed of speech. You can also take notes without repeatedly watching the same scene.</li>\n</ul>\n<h3><a id=\"Solution_overview_27\"></a>Solution overview</h3>\nThe library presented in this post uses Amazon Polly to generate sound and closed captions for an input text. You can easily integrate this library in your text-to-speech applications. It supports several audio formats, and captions in both VTT and SRT file formats, which are the most commonly used across the industry.\nIn this post, we focus on the <code>PollyVTT()</code> syntax and options, and offer a few examples that demonstrate how to use the Python <code>SubtitleGeneratorForPolly</code> to simultaneously generate synchronous audio and subtitle files for a given text input. The output audio file format can be PCM(wav), OGG, or MP3, and the subtitle file format can be VTT or SRT. Furthermore, <code>SubtitleGeneratorForPolly</code> supports all Amazon Polly <code>synthesize_speech</code> parameters and adds to the rich Amazon Polly feature set.\nThe <code>polly-vtt</code> library and its dependencies are available on <a href=\"https://github.com/aws-samples/amazon-polly-closed-caption-subtitle-generator\" target=\"_blank\">GitHub</a>.\n<h3><a id=\"Install_and_use_the_function_36\"></a>Install and use the function</h3>\nBefore we look at some examples of using <code>PollyVTT()</code>, the function that powers <code>SubtitleGeneratorForPolly</code>, let’s look at the installation and syntax of it.\nInstall the library using the following code:\n<pre><code class=\"lang-\">\npip install\n\n</code></pre>\nTo run from the command line, you simply run <code>polly-vtt</code>:\n<pre><code class=\"lang-\">\nUsage: polly-vtt [OPTIONS] BASE_FILENAME VOICE_ID OUTPUT_FORMAT TEXT\n\n</code></pre>\nThe following code shows your options:\n<pre><code class=\"lang-\">\n--caption-format TEXT 'srt' or 'vtt'\n--help Show this message and exit. \n\nBASE_FILENAME: Base filename for both the audio and caption files \nVOICE_ID: Polly voice to use (Case-sensitive)\nOUTPUT_FORMAT: Amazon Polly output format: pcm, mp3, ogg_vorbis \nTEXT: Full text to be digitized \nCaption format: srt or vtt\n\n</code></pre>\nLet’s look at a few examples now.\n<h3><a id=\"Example_1_73\"></a>Example 1</h3>\nThis example generates a PCM audio file along with an SRT caption file for two simple sentences:\n<pre><code class=\"lang-\">\n$ polly-vtt testfile Joanna pcm "this is a test. this is a second sentence." --caption-format srt \n\ntestfile.wav written successfully.\ntestfile.wav.srt written successfully.\nTotal Audio Length: 0:00:03.017500 \n# of Sentences: 2\n\n</code></pre>\n<h3><a id=\"Example_2_88\"></a>Example 2</h3>\nThis example demonstrates how to use a paragraph of text as input. This generates audio files in WAV, MP3, and OGG, and subtitles in SRT and VTT. The following example creates six files for the given input text:\n<ul>\n<li><code>pcm_testfile.wav</code></li>\n<li><code>pcm_testfile.wav.vtt</code></li>\n<li><code>mp3_testfile.mp3</code></li>\n<li><code>mp3_testfile.mp3.vtt</code></li>\n<li><code>ogg_testfile.ogg</code></li>\n<li><code>ogg_testfile.ogg.srt</code></li>\n</ul>\nSee the following code:\n<pre><code class=\"lang-\">\nfrom polly_vtt import PollyVTT \n\ntext = "News content is shaped by its own unique characteristics. Sentences and paragraphs are usually short and highly in formative because writers have to compress information into a limited space. Depending on the theme, news articles may con tain relevant terminology, place names, abbreviations, people’s names, and quotes. Excellent news writing is clear, precis e, and avoids ambiguity. The writing is dynamic, especially in online articles, because content may get updated multiple times per day as new information becomes available." \n\npolly_vtt = PollyVTT() \n\n# pcm with VTT captions \npolly_vtt.generate( \n"pcm_testfile", \nText=text, \nVoiceId="Joanna", \nOutputFormat="pcm", \n) \n\n# mp3 with VTT captions \npolly_vtt.generate( \n"mp3_testfile", \nText=text, \nVoiceId="Joanna", \nOutputFormat="mp3", \n)\n \n# ogg with SRT captions \npolly_vtt.generate( \n"ogg_testfile", \n"srt",\nText=text, \nVoiceId="Joanna", \nOutputFormat="ogg_vorbis", \n) \n\n\n</code></pre>\n<h3><a id=\"Example_3_137\"></a>Example 3</h3>\nIn most cases, however, you want to pass the text as an input file. The following is a Python example of this, with the same output as the previous example:\n<pre><code class=\"lang-\">\nfrom polly_vtt import PollyVTT\nimport os\nimport boto3\nimport json\n\npolly_vtt = PollyVTT()\n\ntry:\n\tf=open("input.txt", "r")\n\tprint("file is opened")\n\tpolly_vtt.generate(\n\t"pcm_testfile",\n\tText=f.read(),\n\tVoiceId="Joanna",\n\tOutputFormat="pcm",\n\t)\n\tf.close()\nexcept:\n\tprint("error occurred while converting to PCM")\nprint("end of file")\n\n# mp3 with VTT captions\ntry:\n\tf=open("input.txt", "r")\n\tprint("file is opened")\n\tpolly_vtt.generate(\n\t"mp3_testfile",\n\tText=f.read(),\n\tVoiceId="Joanna",\n\tOutputFormat="mp3",\n\t)\n\tf.close()\nexcept:\n\tprint("error occurred while converting to MP3")\nprint("end of file")\n\n# ogg with SRT captions\ntry:\n\tf=open("input.txt", "r")\n\tprint("file is opened")\n\tpolly_vtt.generate(\n\t"ogg_testfile",\n\t"srt",\n\tText=f.read(),\n\tVoiceId="Joanna",\n\tOutputFormat="ogg_vorbis",\n\t)\n\tf.close()\nexcept:\n\tprint("error occurred while converting to OGG")\nprint("end of file")\n\n</code></pre>\nThe following is a testimonial post from the AWS internal training team of using Amazon Polly with closed captions: \n<video src=\"https://dev-media.amazoncloud.cn/01bea14f01ae46cf9e959315157e922f_testimonial-construct.mp4\" controls=\"controls\"></video>\nThe following video offers a short demo of how the internal training team at AWS uses <code>PollyVTT()</code>: \n<video src=\"https://dev-media.amazoncloud.cn/80299e7b8d8b49bb94c97f690e5ba74e_convert-process.mp4\" controls=\"controls\"></video>\n<h3><a id=\"Conclusion_203\"></a>Conclusion</h3>\nIn this post, we shared a method to generate audio and subtitles at the same time for a given text. The <code>PollyVTT()</code> function and <code>SubtitleGeneratorForPolly</code> address a common requirement for subtitles in an efficient and effective manner. The Amazon Polly team continues to invent and offer simplified solutions to complex customer requirements.\nFor more tutorials and information about Amazon Polly, check out the <a href=\"https://aws.amazon.com/blogs/machine-learning/category/artificial-intelligence/amazon-polly/\" target=\"_blank\">AWS Machine Learning Blog</a>.\n<h4><a id=\"About_the_Authors_210\"></a>About the Authors</h4>\n<img src=\"https://dev-media.amazoncloud.cn/a1ff389c9a70423188310ac6c4afa64a_image.png\" alt=\"image.png\" />\nAbhishek Soni is a Partner Solutions Architect at AWS. He works with customers to provide technical guidance for the best outcome of workloads on AWS.\n<img src=\"https://dev-media.amazoncloud.cn/36fbd60e4759496089b44def0e02e477_image.png\" alt=\"image.png\" />\nDan McKee uses audio, video, and coffee to distill content into targeted, modular, and structured courses. In his role as Curriculum Developer Project Manager for the NetSec Domain at Amazon Web Services, he leverages his experience in Data Center Networking to help subject matter experts bring ideas to life.\n<img src=\"https://dev-media.amazoncloud.cn/6b1dd7d4e3b04d4ca5bc2b93ebd13edf_image.png\" alt=\"image.png\" />\nOrlando Karam is a Technical Curriculum Developer at Amazon Web Services, which means he gets to play with cool new technologies and then talk about it. Occasionally, he also uses those cool technologies to make his job easier.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家