Localize content into multiple languages using Amazon machine learning services

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"Over the last few years, online education platforms have seen an increase in adoption of and an uptick in demand for video-based learnings because it offers an effective medium to engage learners. To expand to international markets and address a culturally and linguistically diverse population, businesses are also looking at diversifying their learning offerings by localizing content into multiple languages. These businesses are looking for reliable and cost-effective ways to solve their localization use cases.\n\nLocalizing content mainly includes translating original voices into new languages and adding visual aids such as subtitles. Traditionally, this process is cost-prohibitive, manual, and takes a lot of time, including working with localization specialists. With the power of AWS machine learning (ML) services such as [Amazon Transcribe](https://aws.amazon.com/transcribe/), [Amazon Translate](https://aws.amazon.com/translate/), and [Amazon Polly](https://aws.amazon.com/polly/), you can create a viable and a cost-effective localization solution. You can use Amazon Transcribe to create a transcript of your existing audio and video streams, and then translate this transcript into multiple languages using Amazon Translate. You can then use Amazon Polly, a text-to speech service, to convert the translated text into natural-sounding human speech.\n\nThe next step of localization is to add subtitles to the content, which can improve accessibility and comprehension, and help viewers understand the videos better. Subtitle creation on video content can be challenging because the translated speech doesn’t match the original speech timing. This synchronization between audio and subtitles is a critical task to consider because it might disconnect the audience from your content if they’re not in sync. Amazon Polly offers a solution to this challenge through enabling [speech marks](https://docs.aws.amazon.com/polly/latest/dg/speechmarks.html), which you can use to create a subtitle file that can be synced with the generated speech output.\n\nIn this post, we review a localization solution using AWS ML services where we use an original English video and convert it into Spanish. We also focus on using speech marks to create a synced subtitle file in Spanish.\n\n### **Solution overview**\n\nThe following diagram illustrates the solution architecture.\n\n![image.png](https://dev-media.amazoncloud.cn/e9dfa2537bc84def9cb9bb557cdb481d_image.png)\n\nThe solution takes a video file and the target language settings as input and uses Amazon Transcribe to create a transcription of the video. We then use Amazon Translate to translate the transcript to the target language. The translated text is provided as an input to Amazon Polly to generate the audio stream and speech marks in the target language. Amazon Polly returns [speech mark output](https://docs.aws.amazon.com/polly/latest/dg/using-speechmarks.html#output) in a line-delimited JSON stream, which contains the fields such as time, type, start, end, and value. The value could vary depending on the type of speech mark requested in the input, such as [SSML](https://docs.aws.amazon.com/polly/latest/dg/ssml.html), [viseme](https://docs.aws.amazon.com/polly/latest/dg/viseme.html), word, or sentence. For the purpose of our example, we requested the [speech mark type](https://docs.aws.amazon.com/polly/latest/dg/using-speechmarks1.html) as ```word```. With this option, Amazon Polly breaks a sentence into its individual words in the sentence and their start and end times in the audio stream. With this metadata, the speech marks are then processed to generate the subtitles for the corresponding audio stream generated by Amazon Polly.\n\nFinally, we use [AWS Elemental MediaConvert](https://aws.amazon.com/mediaconvert/) to render the final video with the translated audio and corresponding subtitles.\n\nThe following video demonstrates the final outcome of the solution:\n\n<video src=\"https://dev-media.amazoncloud.cn/ea53aed53f9545eca029773676629748_sample-video-for-polly-blog.mp4\" class=\"manvaVedio\" controls=\"controls\" style=\"width:160px;height:160px\"></video>\n\n### **AWS Step Functions workflow**\n\nWe use [AWS Step Functions](http://aws.amazon.com/step-functions) to orchestrate this process. The following figure shows a high-level view of the Step Functions workflow (some steps are omitted from the diagram for better clarity).\n\n![image.png](https://dev-media.amazoncloud.cn/1a7cf1e1de64412fb845a87042719b3b_image.png)\n\n\nThe workflow steps are as follows:\n\n1. A user uploads the source video file to an [Amazon Simple Storage Service ](https://aws.amazon.com/s3/)(Amazon S3) bucket.\n2. The [S3 event notification](https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html) triggers the [AWS Lambda](http://aws.amazon.com/lambda) function [state_machine.py](https://github.com/aws-samples/localize-content-using-aws-ml-services/blob/main/lambda/statemachine/state_machine.py) (not shown in the diagram), which invokes the Step Functions state machine.\n3. The first step, **Transcribe audio**, invokes the Lambda function [transcribe.py](https://github.com/aws-samples/localize-content-using-aws-ml-services/blob/main/lambda/transcribe/transcribe.py), which uses Amazon Transcribe to generate a transcript of the audio from the source video.\n\n\nThe following sample code demonstrates how to create a transcription job using the Amazon Transcribe [Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/transcribe.html) Python SDK:\n\n```\nresponse = transcribe.start_transcription_job(\n TranscriptionJobName = jobName,\n MediaFormat=media_format,\n Media = {\"MediaFileUri\": \"s3://\"+bucket+\"/\"+mediaconvert_video_filename},\n OutputBucketName=bucket,\n OutputKey=outputKey,\n IdentifyLanguage=True\n)\n\n```\n\nAfter the job is complete, the output files are saved into the S3 bucket and the process continues to the next step of translating the content.\n\n4. The Translate transcription step invokes the Lambda function [translate.py](https://github.com/aws-samples/localize-content-using-aws-ml-services/blob/main/lambda/translate/translate.py) which uses Amazon Translate to translate the transcript to the target language. Here, we use the synchronous/real-time translation using the [translate_text](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/translate.html#Translate.Client.translate_text) function:\n\n\n```\n# Real-time translation\nresponse = translate.translate_text(\n Text=transcribe_text,\n SourceLanguageCode=source_language_code,\n TargetLanguageCode=target_language_code,\n)\n\n```\n\nSynchronous translation has limits on the document size it can translate; as of this writing, it’s set to 5,000 bytes. For larger document sizes, consider using an asynchronous route of creating the job using [start_text_translation_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/translate.html#Translate.Client.start_text_translation_job) and checking the status via [describe_text_translation_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/translate.html#Translate.Client.describe_text_translation_job).\n\n5. The next step is a Step Functions Parallel state, where we create parallel branches in our state machine.\n a. In the first branch, we invoke the Lambda function the Lambda function [generate_polly_audio.py](https://github.com/aws-samples/localize-content-using-aws-ml-services/blob/main/lambda/polly/generate_polly_audio.py) to generate our Amazon Polly audio stream:\n\n\n```\n# Set up the polly and translate services\nclient = boto3.client('polly')\n\n# Use the translated text to create the synthesized speech\nresponse = client.start_speech_synthesis_task(\n Engine=\"standard\", LanguageCode=\"es\", OutputFormat=\"mp3\",\n SampleRate=\"22050\", Text=polly_text, VoiceId=\"Miguel\",\n TextType=\"text\",\n OutputS3BucketName=\"S3-bucket-name\",\n OutputS3KeyPrefix=\"-polly-recording\")\naudio_task_id = response['SynthesisTask']['TaskId']\n\n```\nHere we use the [start_speech_synthesis_task](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/polly.html#Polly.Client.start_speech_synthesis_task) method of the Amazon Polly Python SDK to trigger the speech synthesis task that creates the Amazon Polly audio. We set the ```OutputFormat``` to ```mp3```, which tells Amazon Polly to generate an audio stream for this API call.\n\nb. In the second branch, we invoke the Lambda function generate_speech_marks.py to generate speech marks output:\n\n```\n....\n# Use the translated text to create the speech marks\nresponse = client.start_speech_synthesis_task(\n Engine=\"standard\", LanguageCode=\"es\", OutputFormat=\"json\",\n SampleRate=\"22050\", Text=polly_text, VoiceId=\"Miguel\",\n TextType=\"text\", SpeechMarkTypes=['word'],\n OutputS3BucketName=\"S3-bucket-name\", \n OutputS3KeyPrefix=\"-polly-speech-marks\")\nspeechmarks_task_id = response['SynthesisTask']['TaskId']\n\n```\nWe again use the [start_speech_synthesis_task](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/polly.html#Polly.Client.start_speech_synthesis_task) method but specify ```OutputFormat``` to ```json```, which tells Amazon Polly to generate speech marks for this API call.\n\nIn the next step of the second branch, we invoke the Lambda function [generate_subtitles.py](https://github.com/aws-samples/localize-content-using-aws-ml-services/blob/main/lambda/polly/generate_subtitles.py), which implements the logic to generate a subtitle file from the speech marks output.\n\nIt uses the Python module in the file [webvtt_utils.py](https://github.com/aws-samples/localize-content-using-aws-ml-services/blob/main/lambda/polly/webvtt_utils.py). This module has multiple utility functions to create the subtitle file; one such method ```get_phrases_from_speechmarks``` is responsible for parsing the speech marks file. The speech marks JSON structure provides just the start time for each word individually. To create the subtitle timing required for the SRT file, we first create phrases of about n (where n=10) words from the list of words in the speech marks file. Then we write them into the SRT file format, taking the start time from the first word in the phrase, and for the end time we use the start time of the (n+1) word and subtract it by 1 to create the sequenced entry. The following function creates the phrases in preparation for writing them to the SRT file:\n\n```\ndef get_phrases_from_speechmarks(words, transcript):\n.....\n\n for item in items:\n # if it is a new phrase, then get the start_time of the first item\n if n_phrase:\n phrase[\"start_time\"] = get_time_code(words[c][\"start_time\"])\n n_phrase = False\n\n else:\n if c == len(words) - 1:\n phrase[\"end_time\"] = get_time_code(words[c][\"start_time\"])\n else:\n phrase[\"end_time\"] = get_time_code(words[c + 1][\"start_time\"] - 1)\n\n # in either case, append the word to the phrase...\n phrase[\"words\"].append(item)\n x += 1\n\n # now add the phrase to the phrases, generate a new phrase, etc.\n if x == 10 or c == (len(items) - 1):\n # print c, phrase\n if c == (len(items) - 1):\n if phrase[\"end_time\"] == '':\n start_time = words[c][\"start_time\"]\n end_time = int(start_time) + 500\n phrase[\"end_time\"] = get_time_code(end_time)\n\n phrases.append(phrase)\n phrase = new_phrase()\n n_phrase = True\n x = 0\n\n .....\n\n return phrases\n\n```\n\n6. The final step, **Media Convert**, invokes the Lambda function [create_mediaconvert_job.py](https://github.com/aws-samples/localize-content-using-aws-ml-services/blob/main/lambda/mediaconvert/create_mediaconvert_job.py) to combine the audio stream from Amazon Polly and the subtitle file with the source video file to generate the final output file, which is then stored in an S3 bucket. This step uses ```MediaConvert```, a file-based video transcoding service with broadcast-grade features. It allows you to easily create video-on-demand content and combines advanced video and audio capabilities with a simple web interface. Here again we use the Python [Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/mediaconvert.html) SDK to create a ```MediaConvert job```:\n\n```\n……\njob_metadata = {'asset_id': asset_id, 'application': \"createMediaConvertJob\"}\nmc_client = boto3.client('mediaconvert', region_name=region)\nendpoints = mc_client.describe_endpoints()\nmc_endpoint_url = endpoints['Endpoints'][0]['Url']\n\nmc = boto3.client('mediaconvert', region_name=region, endpoint_url=mc_endpoint_url, verify=True)\n\nmc.create_job(Role=mediaconvert_role_arn, UserMetadata=job_metadata, Settings=mc_data[\"Settings\"])\n\n```\n\n### **Prerequisites**\n\nBefore getting started, you must have the following prerequisites:\n\n- An [AWS account](https://aws.amazon.com/free)\n- [AWS Cloud Development Kit](https://aws.amazon.com/cdk/) (AWS CDK)\n\n\n### **Deploy the solution**\n\nTo deploy the solution using the AWS CDK, complete the following steps:\n\n1. Clone the [repository](https://github.com/aws-samples/localize-content-using-aws-ml-services):\n\n\n```\ngit clone https://github.com/aws-samples/localize-content-using-aws-ml-services.git \n\n```\n\n2. To make sure the AWS CDK is [bootstrapped](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_bootstrap), run the command ```cdk bootstrap``` from the root of the repository:\n\n```\n$ cdk bootstrap\n⏳ Bootstrapping environment aws://<acct#>/<region>...\nTrusted accounts for deployment: (none)\nTrusted accounts for lookup: (none)\nUsing default execution policy of 'arn:aws:iam::aws:policy/AdministratorAccess'. Pass '--cloudformation-execution-policies' to customize.\n✅ Environment aws://<acct#>/<region> bootstrapped (no changes).\n\n```\n\n3. Change the working directory to the root of the repository and run the following command:\n\n```\ncdk deploy\n\n```\n\nBy default, the target audio settings are set to US Spanish (```es-US```). If you plan to test it with a different target language, use the following command:\n\n```\n\ncdk deploy --parameters pollyLanguageCode=<pollyLanguageCode> \n --parameters pollyVoiceId=<pollyVoiceId>\n --parameters pollyEngine=<pollyEngine> \n --parameters mediaConvertLangShort=<mediaConvertLangShort>\n --parameters mediaConvertLangLong=<mediaConvertLangLong>\n --parameters targetLanguageCode=<targetLanguageCode>\n\n```\n\nThe process takes a few minutes to complete, after which it displays a link that you can use to view the target video file with the translated audio and translated subtitles.\n\n![image.png](https://dev-media.amazoncloud.cn/ccbb4314f1584d4c8de79ca16c4295f7_image.png)\n\n### **Test the solution**\n\nTo test this solution, we used a small portion of the following [AWS re:Invent 2017 video](https://www.youtube.com/embed/1IxDLeFQKPk?start=6904&end=7073) from YouTube, where Amazon Transcribe was first introduced. You can also test the solution with your own video. The original language of our test video is English. When you deploy this solution, you can specify the target audio settings or you can use the default target audio settings, which uses Spanish for generating audio and subtitles. The solution creates an S3 bucket that can be used to upload the video file to.\n\n1. On the Amazon S3 console, navigate to the bucket ```PollyBlogBucket```.\n\n\n![image.png](https://dev-media.amazoncloud.cn/bcc30405170c4417be6f75885f55e56c_image.png)\n\n2. Choose the bucket, navigate to the ```/inputVideo``` directory, and upload the video file (the solution is tested with videos of type mp4). At this point, an S3 event notification triggers the Lambda function, which starts the state machine.\n3. On the Step Functions console, browse to the state machine (```ProcessAudioWithSubtitles```).\n4. Choose one of the runs of the state machine to locate the **Graph Inspector**.\n\nThis shows the run results for each state. The Step Functions workflow takes a few minutes to complete, after which you can verify if all the steps successfully completed.\n\n![image.png](https://dev-media.amazoncloud.cn/2ee9acbe56804fcfae7ed08194d08c5f_image.png)\n\n\n### **Review the output**\n\nTo review the output, open the Amazon S3 console and check if the audio file (.mp3) and the speech mark file (.marks) are stored in the S3 bucket under ```<ROOT_S3_BUCKET>/<UID>/synthesisOutput/```.\n\n![image.png](https://dev-media.amazoncloud.cn/404fa3c0992643ef80d4d3e6fbb5d260_image.png)\n\nThe following is a sample of the speech mark file generated from the translated text:\n\n\n\n```\n{\"time\":6,\"type\":\"word\",\"start\":2,\"end\":6,\"value\":\"Qué\"}\n{\"time\":109,\"type\":\"word\",\"start\":7,\"end\":10,\"value\":\"tal\"}\n{\"time\":347,\"type\":\"word\",\"start\":11,\"end\":13,\"value\":\"el\"}\n{\"time\":453,\"type\":\"word\",\"start\":14,\"end\":20,\"value\":\"idioma\"}\n{\"time\":1351,\"type\":\"word\",\"start\":22,\"end\":24,\"value\":\"Ya\"}\n{\"time\":1517,\"type\":\"word\",\"start\":25,\"end\":30,\"value\":\"sabes\"}\n{\"time\":2240,\"type\":\"word\",\"start\":32,\"end\":38,\"value\":\"hablé\"}\n{\"time\":2495,\"type\":\"word\",\"start\":39,\"end\":44,\"value\":\"antes\"}\n{\"time\":2832,\"type\":\"word\",\"start\":45,\"end\":50,\"value\":\"sobre\"}\n{\"time\":3125,\"type\":\"word\",\"start\":51,\"end\":53,\"value\":\"el\"}\n{\"time\":3227,\"type\":\"word\",\"start\":54,\"end\":59,\"value\":\"hecho\"}\n{\"time\":3464,\"type\":\"word\",\"start\":60,\"end\":62,\"value\":\"de\"}\n\n```\n\nIn this output, each part of the text is broken out in terms of speech marks:\n\n- **time** – The timestamp in milliseconds from the beginning of the corresponding audio stream\n- **type** – The type of speech mark (sentence, word, viseme, or SSML)\n- **start** – The offset in bytes (not characters) of the start of the object in the input text (not including viseme marks)\n- **end** – The offset in bytes (not characters) of the object’s end in the input text (not including viseme marks)\n- **value** – Individual words in the sentence\n\nThe generated subtitle file is written back to the S3 bucket. You can find the file under ```<ROOT_S3_BUCKET>/<UID>/subtitlesOutput/```. Inspect the subtitle file; the content should be similar to the following text:\n\n```\n1\n00:00:00,006 --> 00:00:03,226\n¿Qué tal el idioma? Ya sabes, hablé antes sobre el\n\n2\n00:00:03,227 --> 00:00:06,065\nhecho de que el año pasado lanzamos Polly y Lex,\n\n3\n00:00:06,066 --> 00:00:09,263\npero hay muchas otras cosas que los constructores quieren hacer\n\n4\n00:00:09,264 --> 00:00:11,642\ncon el lenguaje. Y una de las cosas que ha\n\n5\n00:00:11,643 --> 00:00:14,549\nsido interesante es que ahora hay tantos datos que están\n\n```\n\nAfter the subtitles file and audio file are generated, the final source video file is created using MediaConvert. Check the MediaConvert console to verify if the job status is ```COMPLETE```.\n\nWhen the MediaConvert job is complete, the final video file is generated and saved back to the S3 bucket, which can be found under ```<ROOT_S3_BUCKET>/<UID>/convertedAV/```.\n\nAs part of this deployment, the final video is distributed through an[ Amazon CloudFront](https://aws.amazon.com/cloudfront/) (CDN) link and displayed in the terminal or in the [AWS CloudFormation](https://aws.amazon.com/cloudformation/) console.\n\nOpen the URL in a browser to view the original video with additional options for audio and subtitles. You can verify that the translated audio and subtitles are in sync.\n\n### **Conclusion**\n\nIn this post, we discussed how to create new language versions of video files without the need of manual intervention. Content creators can use this process to synchronize the audio and subtitles of their videos and reach a global audience.\n\nYou can easily integrate this approach into your own production pipelines to handle large volumes and scale according to your needs. Amazon Polly uses [Neural TTS (NTTS)](https://docs.aws.amazon.com/polly/latest/dg/NTTS-main.html) to produce natural and human-like text-to-speech voices. It also supports [generating speech from SSML](https://docs.aws.amazon.com/polly/latest/dg/ssml.html), which gives you additional control over how Amazon Polly generates speech from the text provided. Amazon Polly also provides a [variety of different voices](https://docs.aws.amazon.com/polly/latest/dg/voicelist.html) in multiple languages to support your needs.\n\n\nGet started with AWS machine learning services by visiting the [product page](https://aws.amazon.com/machine-learning/), or refer the [Amazon Machine Learning Solutions Lab](https://aws.amazon.com/ml-solutions-lab/) page where you can collaborate with experts to bring machine learning solutions to your organization.\n\n### **Additional resources**\n\nFor more information about the services used in this solution, refer to the following:\n\n- [Amazon Transcribe Developer Guide](https://docs.aws.amazon.com/transcribe/latest/dg/transcribe-dg.pdf)\n- [Amazon Translate Developer Guide](https://docs.aws.amazon.com/translate/latest/dg/translate-dg.pdf)\n- [AWS Polly Developer Guide](https://docs.aws.amazon.com/polly/latest/dg/what-is.html)\n- [AWS Step Functions Developer Guide](https://docs.aws.amazon.com/step-functions/latest/dg/step-functions-dg.pdf)\n- [AWS Elemental MediaConvert User Guide](https://docs.aws.amazon.com/mediaconvert/latest/ug/mediaconvert-guide.pdf)\n- [Languages Supported by Amazon Polly](https://docs.aws.amazon.com/polly/latest/dg/SupportedLanguage.html)\n- [Languages Supported by Amazon Transcribe](https://docs.aws.amazon.com/transcribe/latest/dg/supported-languages.html)\n\n#### **About the authors**\n\n![image.png](https://dev-media.amazoncloud.cn/cecd9facd8a2422ea7634705a6d220da_image.png)\n\n\n**Reagan Rosario** works as a solutions architect at AWS focusing on education technology companies. He loves helping customers build scalable, highly available, and secure solutions in the AWS Cloud. He has more than a decade of experience working in a variety of technology roles, with a focus on software engineering and architecture.\n\n![image.png](https://dev-media.amazoncloud.cn/bc2a050ec94346fa9ee2576d0243fc74_image.png)\n\n**Anil Kodali** is a Solutions Architect with Amazon Web Services. He works with AWS EdTech customers, guiding them with architectural best practices for migrating existing workloads to the cloud and designing new workloads with a cloud-first approach. Prior to joining AWS, he worked with large retailers to help them with their cloud migrations.\n\n![image.png](https://dev-media.amazoncloud.cn/07046593896e45d6bd9f09349370402e_image.png)\n\n**Prasanna Saraswathi Krishnan** is a Solutions Architect with Amazon Web Services working with EdTech customers. He helps them drive their cloud architecture and data strategy using best practices. His background is in distributed computing, big data analytics, and data engineering. He is passionate about machine learning and natural language processing.","render":"Over the last few years, online education platforms have seen an increase in adoption of and an uptick in demand for video-based learnings because it offers an effective medium to engage learners. To expand to international markets and address a culturally and linguistically diverse population, businesses are also looking at diversifying their learning offerings by localizing content into multiple languages. These businesses are looking for reliable and cost-effective ways to solve their localization use cases.\nLocalizing content mainly includes translating original voices into new languages and adding visual aids such as subtitles. Traditionally, this process is cost-prohibitive, manual, and takes a lot of time, including working with localization specialists. With the power of AWS machine learning (ML) services such as <a href=\"https://aws.amazon.com/transcribe/\" target=\"_blank\">Amazon Transcribe</a>, <a href=\"https://aws.amazon.com/translate/\" target=\"_blank\">Amazon Translate</a>, and <a href=\"https://aws.amazon.com/polly/\" target=\"_blank\">Amazon Polly</a>, you can create a viable and a cost-effective localization solution. You can use Amazon Transcribe to create a transcript of your existing audio and video streams, and then translate this transcript into multiple languages using Amazon Translate. You can then use Amazon Polly, a text-to speech service, to convert the translated text into natural-sounding human speech.\nThe next step of localization is to add subtitles to the content, which can improve accessibility and comprehension, and help viewers understand the videos better. Subtitle creation on video content can be challenging because the translated speech doesn’t match the original speech timing. This synchronization between audio and subtitles is a critical task to consider because it might disconnect the audience from your content if they’re not in sync. Amazon Polly offers a solution to this challenge through enabling <a href=\"https://docs.aws.amazon.com/polly/latest/dg/speechmarks.html\" target=\"_blank\">speech marks</a>, which you can use to create a subtitle file that can be synced with the generated speech output.\nIn this post, we review a localization solution using AWS ML services where we use an original English video and convert it into Spanish. We also focus on using speech marks to create a synced subtitle file in Spanish.\n<h3><a id=\"Solution_overview_8\"></a>Solution overview</h3>\nThe following diagram illustrates the solution architecture.\n<img src=\"https://dev-media.amazoncloud.cn/e9dfa2537bc84def9cb9bb557cdb481d_image.png\" alt=\"image.png\" />\nThe solution takes a video file and the target language settings as input and uses Amazon Transcribe to create a transcription of the video. We then use Amazon Translate to translate the transcript to the target language. The translated text is provided as an input to Amazon Polly to generate the audio stream and speech marks in the target language. Amazon Polly returns <a href=\"https://docs.aws.amazon.com/polly/latest/dg/using-speechmarks.html#output\" target=\"_blank\">speech mark output</a> in a line-delimited JSON stream, which contains the fields such as time, type, start, end, and value. The value could vary depending on the type of speech mark requested in the input, such as <a href=\"https://docs.aws.amazon.com/polly/latest/dg/ssml.html\" target=\"_blank\">SSML</a>, <a href=\"https://docs.aws.amazon.com/polly/latest/dg/viseme.html\" target=\"_blank\">viseme</a>, word, or sentence. For the purpose of our example, we requested the <a href=\"https://docs.aws.amazon.com/polly/latest/dg/using-speechmarks1.html\" target=\"_blank\">speech mark type</a> as <code>word</code>. With this option, Amazon Polly breaks a sentence into its individual words in the sentence and their start and end times in the audio stream. With this metadata, the speech marks are then processed to generate the subtitles for the corresponding audio stream generated by Amazon Polly.\nFinally, we use <a href=\"https://aws.amazon.com/mediaconvert/\" target=\"_blank\">AWS Elemental MediaConvert</a> to render the final video with the translated audio and corresponding subtitles.\nThe following video demonstrates the final outcome of the solution:\n<video src=\"https://dev-media.amazoncloud.cn/ea53aed53f9545eca029773676629748_sample-video-for-polly-blog.mp4\" controls=\"controls\"></video>\n<h3><a id=\"AWS_Step_Functions_workflow_22\"></a>AWS Step Functions workflow</h3>\nWe use <a href=\"http://aws.amazon.com/step-functions\" target=\"_blank\">AWS Step Functions</a> to orchestrate this process. The following figure shows a high-level view of the Step Functions workflow (some steps are omitted from the diagram for better clarity).\n<img src=\"https://dev-media.amazoncloud.cn/1a7cf1e1de64412fb845a87042719b3b_image.png\" alt=\"image.png\" />\nThe workflow steps are as follows:\n<ol>\n<li>A user uploads the source video file to an <a href=\"https://aws.amazon.com/s3/\" target=\"_blank\">Amazon Simple Storage Service </a>(Amazon S3) bucket.</li>\n<li>The <a href=\"https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html\" target=\"_blank\">S3 event notification</a> triggers the <a href=\"http://aws.amazon.com/lambda\" target=\"_blank\">AWS Lambda</a> function <a href=\"https://github.com/aws-samples/localize-content-using-aws-ml-services/blob/main/lambda/statemachine/state_machine.py\" target=\"_blank\">state_machine.py</a> (not shown in the diagram), which invokes the Step Functions state machine.</li>\n<li>The first step, Transcribe audio, invokes the Lambda function <a href=\"https://github.com/aws-samples/localize-content-using-aws-ml-services/blob/main/lambda/transcribe/transcribe.py\" target=\"_blank\">transcribe.py</a>, which uses Amazon Transcribe to generate a transcript of the audio from the source video.</li>\n</ol>\nThe following sample code demonstrates how to create a transcription job using the Amazon Transcribe <a href=\"https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/transcribe.html\" target=\"_blank\">Boto3</a> Python SDK:\n<pre><code class=\"lang-\">response = transcribe.start_transcription_job(\n TranscriptionJobName = jobName,\n MediaFormat=media_format,\n Media = {"MediaFileUri": "s3://"+bucket+"/"+mediaconvert_video_filename},\n OutputBucketName=bucket,\n OutputKey=outputKey,\n IdentifyLanguage=True\n)\n\n</code></pre>\nAfter the job is complete, the output files are saved into the S3 bucket and the process continues to the next step of translating the content.\n<ol start=\"4\">\n<li>The Translate transcription step invokes the Lambda function <a href=\"https://github.com/aws-samples/localize-content-using-aws-ml-services/blob/main/lambda/translate/translate.py\" target=\"_blank\">translate.py</a> which uses Amazon Translate to translate the transcript to the target language. Here, we use the synchronous/real-time translation using the <a href=\"https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/translate.html#Translate.Client.translate_text\" target=\"_blank\">translate_text</a> function:</li>\n</ol>\n<pre><code class=\"lang-\"># Real-time translation\nresponse = translate.translate_text(\n Text=transcribe_text,\n SourceLanguageCode=source_language_code,\n TargetLanguageCode=target_language_code,\n)\n\n</code></pre>\nSynchronous translation has limits on the document size it can translate; as of this writing, it’s set to 5,000 bytes. For larger document sizes, consider using an asynchronous route of creating the job using <a href=\"https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/translate.html#Translate.Client.start_text_translation_job\" target=\"_blank\">start_text_translation_job</a> and checking the status via <a href=\"https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/translate.html#Translate.Client.describe_text_translation_job\" target=\"_blank\">describe_text_translation_job</a>.\n<ol start=\"5\">\n<li>The next step is a Step Functions Parallel state, where we create parallel branches in our state machine. \na. In the first branch, we invoke the Lambda function the Lambda function <a href=\"https://github.com/aws-samples/localize-content-using-aws-ml-services/blob/main/lambda/polly/generate_polly_audio.py\" target=\"_blank\">generate_polly_audio.py</a> to generate our Amazon Polly audio stream:</li>\n</ol>\n<pre><code class=\"lang-\"># Set up the polly and translate services\nclient = boto3.client('polly')\n\n# Use the translated text to create the synthesized speech\nresponse = client.start_speech_synthesis_task(\n Engine="standard", LanguageCode="es", OutputFormat="mp3",\n SampleRate="22050", Text=polly_text, VoiceId="Miguel",\n TextType="text",\n OutputS3BucketName="S3-bucket-name",\n OutputS3KeyPrefix="-polly-recording")\naudio_task_id = response['SynthesisTask']['TaskId']\n\n</code></pre>\nHere we use the <a href=\"https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/polly.html#Polly.Client.start_speech_synthesis_task\" target=\"_blank\">start_speech_synthesis_task</a> method of the Amazon Polly Python SDK to trigger the speech synthesis task that creates the Amazon Polly audio. We set the <code>OutputFormat</code> to <code>mp3</code>, which tells Amazon Polly to generate an audio stream for this API call.\nb. In the second branch, we invoke the Lambda function generate_speech_marks.py to generate speech marks output:\n<pre><code class=\"lang-\">....\n# Use the translated text to create the speech marks\nresponse = client.start_speech_synthesis_task(\n Engine="standard", LanguageCode="es", OutputFormat="json",\n SampleRate="22050", Text=polly_text, VoiceId="Miguel",\n TextType="text", SpeechMarkTypes=['word'],\n OutputS3BucketName="S3-bucket-name", \n OutputS3KeyPrefix="-polly-speech-marks")\nspeechmarks_task_id = response['SynthesisTask']['TaskId']\n\n</code></pre>\nWe again use the <a href=\"https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/polly.html#Polly.Client.start_speech_synthesis_task\" target=\"_blank\">start_speech_synthesis_task</a> method but specify <code>OutputFormat</code> to <code>json</code>, which tells Amazon Polly to generate speech marks for this API call.\nIn the next step of the second branch, we invoke the Lambda function <a href=\"https://github.com/aws-samples/localize-content-using-aws-ml-services/blob/main/lambda/polly/generate_subtitles.py\" target=\"_blank\">generate_subtitles.py</a>, which implements the logic to generate a subtitle file from the speech marks output.\nIt uses the Python module in the file <a href=\"https://github.com/aws-samples/localize-content-using-aws-ml-services/blob/main/lambda/polly/webvtt_utils.py\" target=\"_blank\">webvtt_utils.py</a>. This module has multiple utility functions to create the subtitle file; one such method <code>get_phrases_from_speechmarks</code> is responsible for parsing the speech marks file. The speech marks JSON structure provides just the start time for each word individually. To create the subtitle timing required for the SRT file, we first create phrases of about n (where n=10) words from the list of words in the speech marks file. Then we write them into the SRT file format, taking the start time from the first word in the phrase, and for the end time we use the start time of the (n+1) word and subtract it by 1 to create the sequenced entry. The following function creates the phrases in preparation for writing them to the SRT file:\n<pre><code class=\"lang-\">def get_phrases_from_speechmarks(words, transcript):\n.....\n\n for item in items:\n # if it is a new phrase, then get the start_time of the first item\n if n_phrase:\n phrase["start_time"] = get_time_code(words[c]["start_time"])\n n_phrase = False\n\n else:\n if c == len(words) - 1:\n phrase["end_time"] = get_time_code(words[c]["start_time"])\n else:\n phrase["end_time"] = get_time_code(words[c + 1]["start_time"] - 1)\n\n # in either case, append the word to the phrase...\n phrase["words"].append(item)\n x += 1\n\n # now add the phrase to the phrases, generate a new phrase, etc.\n if x == 10 or c == (len(items) - 1):\n # print c, phrase\n if c == (len(items) - 1):\n if phrase["end_time"] == '':\n start_time = words[c]["start_time"]\n end_time = int(start_time) + 500\n phrase["end_time"] = get_time_code(end_time)\n\n phrases.append(phrase)\n phrase = new_phrase()\n n_phrase = True\n x = 0\n\n .....\n\n return phrases\n\n</code></pre>\n<ol start=\"6\">\n<li>The final step, Media Convert, invokes the Lambda function <a href=\"https://github.com/aws-samples/localize-content-using-aws-ml-services/blob/main/lambda/mediaconvert/create_mediaconvert_job.py\" target=\"_blank\">create_mediaconvert_job.py</a> to combine the audio stream from Amazon Polly and the subtitle file with the source video file to generate the final output file, which is then stored in an S3 bucket. This step uses <code>MediaConvert</code>, a file-based video transcoding service with broadcast-grade features. It allows you to easily create video-on-demand content and combines advanced video and audio capabilities with a simple web interface. Here again we use the Python <a href=\"https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/mediaconvert.html\" target=\"_blank\">Boto3</a> SDK to create a <code>MediaConvert job</code>:</li>\n</ol>\n<pre><code class=\"lang-\">……\njob_metadata = {'asset_id': asset_id, 'application': "createMediaConvertJob"}\nmc_client = boto3.client('mediaconvert', region_name=region)\nendpoints = mc_client.describe_endpoints()\nmc_endpoint_url = endpoints['Endpoints'][0]['Url']\n\nmc = boto3.client('mediaconvert', region_name=region, endpoint_url=mc_endpoint_url, verify=True)\n\nmc.create_job(Role=mediaconvert_role_arn, UserMetadata=job_metadata, Settings=mc_data["Settings"])\n\n</code></pre>\n<h3><a id=\"Prerequisites_162\"></a>Prerequisites</h3>\nBefore getting started, you must have the following prerequisites:\n<ul>\n<li>An <a href=\"https://aws.amazon.com/free\" target=\"_blank\">AWS account</a></li>\n<li><a href=\"https://aws.amazon.com/cdk/\" target=\"_blank\">AWS Cloud Development Kit</a> (AWS CDK)</li>\n</ul>\n<h3><a id=\"Deploy_the_solution_170\"></a>Deploy the solution</h3>\nTo deploy the solution using the AWS CDK, complete the following steps:\n<ol>\n<li>Clone the <a href=\"https://github.com/aws-samples/localize-content-using-aws-ml-services\" target=\"_blank\">repository</a>:</li>\n</ol>\n<pre><code class=\"lang-\">git clone https://github.com/aws-samples/localize-content-using-aws-ml-services.git \n\n</code></pre>\n<ol start=\"2\">\n<li>To make sure the AWS CDK is <a href=\"https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_bootstrap\" target=\"_blank\">bootstrapped</a>, run the command <code>cdk bootstrap</code> from the root of the repository:</li>\n</ol>\n<pre><code class=\"lang-\">$ cdk bootstrap\n⏳ Bootstrapping environment aws://<acct#>/<region>...\nTrusted accounts for deployment: (none)\nTrusted accounts for lookup: (none)\nUsing default execution policy of 'arn:aws:iam::aws:policy/AdministratorAccess'. Pass '--cloudformation-execution-policies' to customize.\n✅ Environment aws://<acct#>/<region> bootstrapped (no changes).\n\n</code></pre>\n<ol start=\"3\">\n<li>Change the working directory to the root of the repository and run the following command:</li>\n</ol>\n<pre><code class=\"lang-\">cdk deploy\n\n</code></pre>\nBy default, the target audio settings are set to US Spanish (<code>es-US</code>). If you plan to test it with a different target language, use the following command:\n<pre><code class=\"lang-\">\ncdk deploy --parameters pollyLanguageCode=<pollyLanguageCode> \n --parameters pollyVoiceId=<pollyVoiceId>\n --parameters pollyEngine=<pollyEngine> \n --parameters mediaConvertLangShort=<mediaConvertLangShort>\n --parameters mediaConvertLangLong=<mediaConvertLangLong>\n --parameters targetLanguageCode=<targetLanguageCode>\n\n</code></pre>\nThe process takes a few minutes to complete, after which it displays a link that you can use to view the target video file with the translated audio and translated subtitles.\n<img src=\"https://dev-media.amazoncloud.cn/ccbb4314f1584d4c8de79ca16c4295f7_image.png\" alt=\"image.png\" />\n<h3><a id=\"Test_the_solution_218\"></a>Test the solution</h3>\nTo test this solution, we used a small portion of the following <a href=\"https://www.youtube.com/embed/1IxDLeFQKPk?start=6904&end=7073\" target=\"_blank\">AWS re:Invent 2017 video</a> from YouTube, where Amazon Transcribe was first introduced. You can also test the solution with your own video. The original language of our test video is English. When you deploy this solution, you can specify the target audio settings or you can use the default target audio settings, which uses Spanish for generating audio and subtitles. The solution creates an S3 bucket that can be used to upload the video file to.\n<ol>\n<li>On the Amazon S3 console, navigate to the bucket <code>PollyBlogBucket</code>.</li>\n</ol>\n<img src=\"https://dev-media.amazoncloud.cn/bcc30405170c4417be6f75885f55e56c_image.png\" alt=\"image.png\" />\n<ol start=\"2\">\n<li>Choose the bucket, navigate to the <code>/inputVideo</code> directory, and upload the video file (the solution is tested with videos of type mp4). At this point, an S3 event notification triggers the Lambda function, which starts the state machine.</li>\n<li>On the Step Functions console, browse to the state machine (<code>ProcessAudioWithSubtitles</code>).</li>\n<li>Choose one of the runs of the state machine to locate the Graph Inspector.</li>\n</ol>\nThis shows the run results for each state. The Step Functions workflow takes a few minutes to complete, after which you can verify if all the steps successfully completed.\n<img src=\"https://dev-media.amazoncloud.cn/2ee9acbe56804fcfae7ed08194d08c5f_image.png\" alt=\"image.png\" />\n<h3><a id=\"Review_the_output_236\"></a>Review the output</h3>\nTo review the output, open the Amazon S3 console and check if the audio file (.mp3) and the speech mark file (.marks) are stored in the S3 bucket under <code><ROOT_S3_BUCKET>/<UID>/synthesisOutput/</code>.\n<img src=\"https://dev-media.amazoncloud.cn/404fa3c0992643ef80d4d3e6fbb5d260_image.png\" alt=\"image.png\" />\nThe following is a sample of the speech mark file generated from the translated text:\n<pre><code class=\"lang-\">{"time":6,"type":"word","start":2,"end":6,"value":"Qué"}\n{"time":109,"type":"word","start":7,"end":10,"value":"tal"}\n{"time":347,"type":"word","start":11,"end":13,"value":"el"}\n{"time":453,"type":"word","start":14,"end":20,"value":"idioma"}\n{"time":1351,"type":"word","start":22,"end":24,"value":"Ya"}\n{"time":1517,"type":"word","start":25,"end":30,"value":"sabes"}\n{"time":2240,"type":"word","start":32,"end":38,"value":"hablé"}\n{"time":2495,"type":"word","start":39,"end":44,"value":"antes"}\n{"time":2832,"type":"word","start":45,"end":50,"value":"sobre"}\n{"time":3125,"type":"word","start":51,"end":53,"value":"el"}\n{"time":3227,"type":"word","start":54,"end":59,"value":"hecho"}\n{"time":3464,"type":"word","start":60,"end":62,"value":"de"}\n\n</code></pre>\nIn this output, each part of the text is broken out in terms of speech marks:\n<ul>\n<li>time – The timestamp in milliseconds from the beginning of the corresponding audio stream</li>\n<li>type – The type of speech mark (sentence, word, viseme, or SSML)</li>\n<li>start – The offset in bytes (not characters) of the start of the object in the input text (not including viseme marks)</li>\n<li>end – The offset in bytes (not characters) of the object’s end in the input text (not including viseme marks)</li>\n<li>value – Individual words in the sentence</li>\n</ul>\nThe generated subtitle file is written back to the S3 bucket. You can find the file under <code><ROOT_S3_BUCKET>/<UID>/subtitlesOutput/</code>. Inspect the subtitle file; the content should be similar to the following text:\n<pre><code class=\"lang-\">1\n00:00:00,006 --> 00:00:03,226\n¿Qué tal el idioma? Ya sabes, hablé antes sobre el\n\n2\n00:00:03,227 --> 00:00:06,065\nhecho de que el año pasado lanzamos Polly y Lex,\n\n3\n00:00:06,066 --> 00:00:09,263\npero hay muchas otras cosas que los constructores quieren hacer\n\n4\n00:00:09,264 --> 00:00:11,642\ncon el lenguaje. Y una de las cosas que ha\n\n5\n00:00:11,643 --> 00:00:14,549\nsido interesante es que ahora hay tantos datos que están\n\n</code></pre>\nAfter the subtitles file and audio file are generated, the final source video file is created using MediaConvert. Check the MediaConvert console to verify if the job status is <code>COMPLETE</code>.\nWhen the MediaConvert job is complete, the final video file is generated and saved back to the S3 bucket, which can be found under <code><ROOT_S3_BUCKET>/<UID>/convertedAV/</code>.\nAs part of this deployment, the final video is distributed through an<a href=\"https://aws.amazon.com/cloudfront/\" target=\"_blank\"> Amazon CloudFront</a> (CDN) link and displayed in the terminal or in the <a href=\"https://aws.amazon.com/cloudformation/\" target=\"_blank\">AWS CloudFormation</a> console.\nOpen the URL in a browser to view the original video with additional options for audio and subtitles. You can verify that the translated audio and subtitles are in sync.\n<h3><a id=\"Conclusion_303\"></a>Conclusion</h3>\nIn this post, we discussed how to create new language versions of video files without the need of manual intervention. Content creators can use this process to synchronize the audio and subtitles of their videos and reach a global audience.\nYou can easily integrate this approach into your own production pipelines to handle large volumes and scale according to your needs. Amazon Polly uses <a href=\"https://docs.aws.amazon.com/polly/latest/dg/NTTS-main.html\" target=\"_blank\">Neural TTS (NTTS)</a> to produce natural and human-like text-to-speech voices. It also supports <a href=\"https://docs.aws.amazon.com/polly/latest/dg/ssml.html\" target=\"_blank\">generating speech from SSML</a>, which gives you additional control over how Amazon Polly generates speech from the text provided. Amazon Polly also provides a <a href=\"https://docs.aws.amazon.com/polly/latest/dg/voicelist.html\" target=\"_blank\">variety of different voices</a> in multiple languages to support your needs.\nGet started with AWS machine learning services by visiting the <a href=\"https://aws.amazon.com/machine-learning/\" target=\"_blank\">product page</a>, or refer the <a href=\"https://aws.amazon.com/ml-solutions-lab/\" target=\"_blank\">Amazon Machine Learning Solutions Lab</a> page where you can collaborate with experts to bring machine learning solutions to your organization.\n<h3><a id=\"Additional_resources_312\"></a>Additional resources</h3>\nFor more information about the services used in this solution, refer to the following:\n<ul>\n<li><a href=\"https://docs.aws.amazon.com/transcribe/latest/dg/transcribe-dg.pdf\" target=\"_blank\">Amazon Transcribe Developer Guide</a></li>\n<li><a href=\"https://docs.aws.amazon.com/translate/latest/dg/translate-dg.pdf\" target=\"_blank\">Amazon Translate Developer Guide</a></li>\n<li><a href=\"https://docs.aws.amazon.com/polly/latest/dg/what-is.html\" target=\"_blank\">AWS Polly Developer Guide</a></li>\n<li><a href=\"https://docs.aws.amazon.com/step-functions/latest/dg/step-functions-dg.pdf\" target=\"_blank\">AWS Step Functions Developer Guide</a></li>\n<li><a href=\"https://docs.aws.amazon.com/mediaconvert/latest/ug/mediaconvert-guide.pdf\" target=\"_blank\">AWS Elemental MediaConvert User Guide</a></li>\n<li><a href=\"https://docs.aws.amazon.com/polly/latest/dg/SupportedLanguage.html\" target=\"_blank\">Languages Supported by Amazon Polly</a></li>\n<li><a href=\"https://docs.aws.amazon.com/transcribe/latest/dg/supported-languages.html\" target=\"_blank\">Languages Supported by Amazon Transcribe</a></li>\n</ul>\n<h4><a id=\"About_the_authors_324\"></a>About the authors</h4>\n<img src=\"https://dev-media.amazoncloud.cn/cecd9facd8a2422ea7634705a6d220da_image.png\" alt=\"image.png\" />\nReagan Rosario works as a solutions architect at AWS focusing on education technology companies. He loves helping customers build scalable, highly available, and secure solutions in the AWS Cloud. He has more than a decade of experience working in a variety of technology roles, with a focus on software engineering and architecture.\n<img src=\"https://dev-media.amazoncloud.cn/bc2a050ec94346fa9ee2576d0243fc74_image.png\" alt=\"image.png\" />\nAnil Kodali is a Solutions Architect with Amazon Web Services. He works with AWS EdTech customers, guiding them with architectural best practices for migrating existing workloads to the cloud and designing new workloads with a cloud-first approach. Prior to joining AWS, he worked with large retailers to help them with their cloud migrations.\n<img src=\"https://dev-media.amazoncloud.cn/07046593896e45d6bd9f09349370402e_image.png\" alt=\"image.png\" />\nPrasanna Saraswathi Krishnan is a Solutions Architect with Amazon Web Services working with EdTech customers. He helps them drive their cloud architecture and data strategy using best practices. His background is in distributed computing, big data analytics, and data engineering. He is passionate about machine learning and natural language processing.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家