Break through language barriers with Amazon Transcribe, Amazon Translate, and Amazon Polly

海外精选
海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时,内容中提到的“AWS” 是 “Amazon Web Services” 的缩写,在此网站不作为商标展示。
0
0
{"value":"Imagine a surgeon taking video calls with patients across the globe without the need of a human translator. What if a fledgling startup could easily expand their product across borders and into new geographical markets by offering fluid, accurate, multilingual customer support and sales, all without the need of a live human translator? What happens to your business when you’re no longer bound by language?\n\nIt’s common today to have virtual meetings with international teams and customers that speak many different languages. Whether they’re internal or external meetings, meaning often gets lost in complex discussions and you may encounter language barriers that prevent you from being as effective as you could be.\n\nIn this post, you will learn how to use three fully managed AWS services ([Amazon Transcribe](https://aws.amazon.com/cn/transcribe/), A[mazon Translate](https://aws.amazon.com/cn/translate/), and [Amazon Polly](https://aws.amazon.com/cn/polly/)) to produce a near-real-time speech-to-speech translator solution that can quickly translate a source speaker’s live voice input into a spoken, accurate, translated target language, all with zero machine learning (ML) experience.\n\n#### **Overview of solution**\n\nOur translator consists of three fully managed AWS ML services working together in a single Python script by using the [AWS SDK for Python (Boto3)](https://aws.amazon.com/cn/sdk-for-python/) for our text translation and text-to-speech portions, and an asynchronous streaming SDK for audio input transcription.\n\n#### **Amazon Transcribe: Streaming speech to text**\n\nThe first service you use in our stack is Amazon Transcribe, a fully managed speech-to-text service that takes input speech and transcribes it to text. Amazon Transcribe has flexible ingestion methods, batch or streaming, because it accepts either stored audio files or streaming audio data. In this post, you use the [asynchronous Amazon Transcribe streaming SDK for Python](https://aws.amazon.com/cn/blogs/developer/transcribe-streaming-sdk-for-python-preview/), which uses the HTTP/2 streaming protocol to stream live audio and receive live transcriptions.\n\nWhen we first built this prototype, Amazon Transcribe streaming ingestion didn’t support automatic language detection, but this is no longer the case as of November 2021. Both batch and streaming ingestion now support automatic language detection for all [supported languages](supported languages). In this post, we show how a parameter-based solution though a seamless multi-language parameterless design is possible through the use of streaming automatic language detection. After our transcribed speech segment is returned as text, you send a request to Amazon Translate to translate and return the results in our Amazon Transcribe ```EventHandler```method.\n\n#### **Amazon Translate: State-of-the-art, fully managed translation API**\n\nNext in our stack is Amazon Translate, a neural machine translation service that delivers fast, high-quality, affordable, and customizable language translation. As of June of 2022, Amazon Translate supports translation across 75 languages, with new language pairs and improvements being made constantly. Amazon Translate uses deep learning models hosted on a highly scalable and resilient AWS Cloud architecture to quickly deliver accurate translations either in real time or batched, depending on your use case. Using Amazon Translate is straightforward and requires no management of underlying architecture or ML skills. Amazon Translate has several features, like creating and using a [custom terminology](https://docs.aws.amazon.com/translate/latest/dg/how-custom-terminology.html) to handle mapping between industry-specific terms. For more information on Amazon Translate service limits, refer to [Guidelines and limits](Guidelines and limits). After the application receives the translated text in our target language, it sends the translated text to Amazon Polly for immediate translated audio playback.\n\n#### **Amazon Polly: Fully managed text-to-speech API**\n\nFinally, you send the translated text to Amazon Polly, a fully managed text-to-speech service that can either send back lifelike audio clip responses for immediate streaming playback or batched and saved in Amazon Simple Storage Service (Amazon S3) for later use. You can control various aspects of speech such as pronunciation, volume, pitch, speech rate, and more using standardized [Speech Synthesis Markup Language](https://www.w3.org/TR/speech-synthesis11/) (SSML).\n\nYou can synthesize speech for certain Amazon Polly [Neural voices](https://docs.aws.amazon.com/polly/latest/dg/ntts-voices-main.html) using the Newscaster style to make them sound like a TV or radio newscaster. You can also detect when specific words or sentences in the text are being spoken based on the metadata included in the audio stream. This allows the developer to synchronize graphical highlighting and animations, such as the lip movements of an avatar, with the synthesized speech.\n\nYou can modify the pronunciation of particular words, such as company names, acronyms, foreign words, or neologisms, for example “P!nk,” “ROTFL,” or “C’est la vie” (when spoken in a non-French voice), using custom lexicons.\n\n#### **Architecture overview**\n\nThe following diagram illustrates our solution architecture.\n\n![image.png](https://dev-media.amazoncloud.cn/1bee7d54f88c4aada69039b91c58ad34_image.png)\n\nThis diagram shows the data flow from the client device to Amazon Transcribe, Amazon Translate, and Amazon Polly\n\nThe workflow is as follows:\n\n1. Audio is ingested by the Python SDK.\n2. Amazon Polly converts the speech to text, in 39 possible languages.\n3. Amazon Translate converts the languages.\n4. Amazon Live Transcribe converts text to speech.\n5. Audio is outputted to speakers.\n\n#### **Prerequisites**\n\nYou need a host machine set up with a microphone, speakers, and reliable internet connection. A modern laptop should work fine for this because no additional hardware is needed. Next, you need to set up the machine with some software tools.\n\nYou must have Python 3.7+ installed to use the asynchronous Amazon Transcribe streaming SDK and for a Python module called ```pyaudio```, which you use to control the machine’s microphone and speakers. This module depends on a C library called ```portaudio.h```. If you encounter issues with ```pyaudio```errors, we suggest checking your OS to see if you have the ```portaudio.h```library installed.\n\nFor authorization and authentication of service calls, you create an [AWS Identity and Access Management](https://aws.amazon.com/cn/iam/) (IAM) service role with permissions to call the necessary AWS services. By configuring the [AWS Command Line Interface](https://aws.amazon.com/cn/cli/) (AWS CLI) with this IAM service role, you can run our script on your machine without having to pass in keys or passwords, because the AWS libraries are written to use the configured AWS CLI user’s credentials. This is a convenient method for rapid prototyping and ensures our services are being called by an authorized identity. As always, follow the principle of least privilege when assigning IAM policies when creating an IAM user or role.\n\nTo summarize, you need the following prerequisites:\n\n- A PC, Mac, or Linux machine with microphone, speakers, and internet connection\n- The ```portaudio.h```C library for your OS (brew, apt get, wget), which is needed for pyaudio to work\n- AWS CLI 2.0 with properly authorized IAM user configured by running aws configure in the AWS CLI\n- Python 3.7+\n- [The asynchronous Amazon Transcribe Python SDK](https://aws.amazon.com/cn/blogs/developer/transcribe-streaming-sdk-for-python-preview/)\n- The following Python libraries:\n\n\t- boto3\n\t- amazon-transcribe\n\t- pyaudio\n\t- asyncio\n\t- concurrent\n\n#### **Implement the solution**\n\nYou will be relying heavily on the asynchronous Amazon Transcribe streaming SDK for Python as a starting point, and are going to build on top of that specific SDK. After you have experimented with the streaming SDK for Python, you add [streaming microphone](https://github.com/awslabs/amazon-transcribe-streaming-sdk/blob/develop/examples/simple_mic.py) input by using ```pyaudio```, a commonly used Python open-source library used for manipulating audio data. Then you add Boto3 calls to Amazon Translate and Amazon Polly for our translation and text-to-speech functionality. Finally, you stream out translated speech through the computer’s speakers again with ```pyaudio```. The Python module ```concurrent```gives you the ability to run blocking code in its own asynchronous thread to play back your returned Amazon Polly speech in a seamless, non-blocking way.\n\nLet’s import all our necessary modules, transcribe streaming classes, and instantiate some globals:\n\n```\nimport boto3\n import asyncio\n import pyaudio\n import concurrent\n from amazon_transcribe.client import TranscribeStreamingClient\n from amazon_transcribe.handlers import TranscriptResultStreamHandler\n from amazon_transcribe.model import TranscriptEvent\n\n\n polly = boto3.client('polly', region_name = 'us-west-2')\n translate = boto3.client(service_name='translate', region_name='us-west-2', use_ssl=True)\n pa = pyaudio.PyAudio()\n\n #for mic stream, 1024 should work fine\n default_frames = 1024\n\n #current params are set up for English to Mandarin, modify to your liking\n params['source_language'] = \"en\"\n params['target_language'] = \"zh\"\n params['lang_code_for_polly'] = \"cmn-CN\"\n params['voice_id'] = \"Zhiyu\"\n params['lang_code_for_transcribe'] = \"en-US\"\n```\n\nFirst, you use``` pyaudio``` to obtain the input device’s sampling rate, device index, and channel count:\n\n```\n#try grabbing the default input device and see if we get lucky\n default_indput_device = pa.get_default_input_device_info()\n\n # verify this is your microphone device \n print(default_input_device)\n\n #if correct then set it as your input device and define some globals\n input_device = default_input_device\n\n input_channel_count = input_device[\"maxInputChannels\"]\n input_sample_rate = input_device[\"defaultSampleRate\"]\n input_dev_index = input_device[\"index\"]\n```\n\nIf this isn’t working, you can also loop through and print your devices as shown in the following code, and then use the device index to retrieve the device information with ```pyaudio```:\n```\nprint (\"Available devices:\\n\")\n for i in range(0, pa.get_device_count()):\n info = pa.get_device_info_by_index(i)\n print (str(info[\"index\"]) + \": \\t %s \\n \\t %s \\n\" % (info[\"name\"], p.get_host_api_info_by_index(info[\"hostApi\"])[\"name\"]))\n\n # select the correct index from the above returned list of devices, for example zero\n dev_index = 0 \n input_device = pa.get_device_info_by_index(dev_index)\n\n #set globals for microphone stream\n input_channel_count = input_device[\"maxInputChannels\"]\n input_sample_rate = input_device[\"defaultSampleRate\"]\n input_dev_index = input_device[\"index\"]\n```\n\nYou use```channel_count```,```sample_rate```, and```dev_index```as parameters in a mic stream. In that stream’s callback function, you use an ```asyncio```nonblocking thread-safe callback to put the input bytes of the mic stream into an ```asyncio``` input queue. Take note of the loop and input_queue objects created with ```asyncio``` and how they’re used in the following code:\n\n```\nasync def mic_stream():\n # This function wraps the raw input stream from the microphone forwarding\n # the blocks to an asyncio.Queue.\n \n loop = asyncio.get_event_loop()\n input_queue = asyncio.Queue()\n \n def callback(indata, frame_count, time_info, status):\n loop.call_soon_threadsafe(input_queue.put_nowait, indata)\n return (indata, pyaudio.paContinue)\n \n # Be sure to use the correct parameters for the audio stream that matches\n # the audio formats described for the source language you'll be using:\n # https://docs.aws.amazon.com/transcribe/latest/dg/streaming.html\n \n print(input_device)\n \n #Open stream\n stream = pa.open(format = pyaudio.paInt16,\n channels = input_channel_count,\n rate = int(input_sample_rate),\n input = True,\n frames_per_buffer = default_frames,\n input_device_index = input_dev_index,\n stream_callback=callback)\n # Initiate the audio stream and asynchronously yield the audio chunks\n # as they become available.\n stream.start_stream()\n print(\"started stream\")\n while True:\n indata = await input_queue.get()\n yield indata\n```\n\nNow when the generator function ```mic_stream()```is called, it continually yields input bytes as long as there is microphone input data in the input queue.\n\nNow that you know how to get input bytes from the microphone, let’s look at how to write Amazon Polly output audio bytes to a speaker output stream:\n\n```\n#text will come from MyEventsHandler\n def aws_polly_tts(text):\n\n response = polly.synthesize_speech(\n Engine = 'standard',\n LanguageCode = params['lang_code_for_polly'],\n Text=text,\n VoiceId = params['voice_id'],\n OutputFormat = \"pcm\",\n )\n output_bytes = response['AudioStream']\n \n #play to the speakers\n write_to_speaker_stream(output_bytes)\n \n #how to write audio bytes to speakers\n\n def write_to_speaker_stream(output_bytes):\n \"\"\"Consumes bytes in chunks to produce the response's output'\"\"\"\n print(\"Streaming started...\")\n chunk_len = 1024\n channels = 1\n sample_rate = 16000\n \n if output_bytes:\n polly_stream = pa.open(\n format = pyaudio.paInt16,\n channels = channels,\n rate = sample_rate,\n output = True,\n )\n #this is a blocking call - will sort this out with concurrent later\n while True:\n data = output_bytes.read(chunk_len)\n polly_stream.write(data)\n \n #If there's no more data to read, stop streaming\n if not data:\n output_bytes.close()\n polly_stream.stop_stream()\n polly_stream.close()\n break\n print(\"Streaming completed.\")\n else:\n print(\"Nothing to stream.\")\n```\n\nNow let’s expand on what you built in the post Asynchronous Amazon Transcribe Streaming SDK for Python. In the following code, you create an executor object using the ```ThreadPoolExecutor```subclass with three workers with concurrent. You then add an Amazon Translate call on the finalized returned transcript in the EventHandler and pass that translated text, the executor object, and our ```aws_polly_tts()```function into an ```asyncio```loop with ```loop.run_in_executor()```, which runs our Amazon Polly function (with translated input text) asynchronously at the start of next iteration of the ```asyncio```l loop.\n\n```\n#use concurrent package to create an executor object with 3 workers ie threads\n executor = concurrent.futures.ThreadPoolExecutor(max_workers=3)\n\n class MyEventHandler(TranscriptResultStreamHandler):\n async def handle_transcript_event(self, transcript_event: TranscriptEvent):\n\n #If the transcription is finalized, send it to translate\n \n results = transcript_event.transcript.results\n if len(results) > 0:\n if len(results[0].alternatives) > 0:\n transcript = results[0].alternatives[0].transcript\n print(\"transcript:\", transcript)\n\n print(results[0].channel_id)\n if hasattr(results[0], \"is_partial\") and results[0].is_partial == False:\n \n #translate only 1 channel. the other channel is a duplicate\n if results[0].channel_id == \"ch_0\":\n trans_result = translate.translate_text(\n Text = transcript,\n SourceLanguageCode = params['source_language'],\n TargetLanguageCode = params['target_language']\n )\n print(\"translated text:\" + trans_result.get(\"TranslatedText\"))\n text = trans_result.get(\"TranslatedText\")\n\n #we run aws_polly_tts with a non-blocking executor at every loop iteration\n await loop.run_in_executor(executor, aws_polly_tts, text) \n```\n\nFinally, we have the ```loop_me()```function. In it, you define ```write_chunks()```, which takes an Amazon Transcribe stream as an argument and asynchronously writes chunks of streaming mic input to it. You then use ```MyEventHandler()``` with the output transcription stream as its argument and create a handler object. Then you use await with asyncio.gather() and pass in the write_chunks() and handler with the handle_events() method to handle the eventual futures of these coroutines. Lastly, you gather all event loops and loop the ```loop_me()``` function with ```run_until_complete()```. See the following code:\n\n\n```\nasync def loop_me():\n # Setup up our client with our chosen AWS region\n\n client = TranscribeStreamingClient(region=\"us-west-2\")\n stream = await client.start_stream_transcription(\n language_code=params['lang_code_for_transcribe'],\n media_sample_rate_hz=int(device_info[\"defaultSampleRate\"]),\n number_of_channels = 2,\n enable_channel_identification=True,\n media_encoding=\"pcm\",\n )\n recorded_frames = []\n async def write_chunks(stream):\n \n # This connects the raw audio chunks generator coming from the microphone\n # and passes them along to the transcription stream.\n print(\"getting mic stream\")\n async for chunk in mic_stream():\n t.tic()\n recorded_frames.append(chunk)\n await stream.input_stream.send_audio_event(audio_chunk=chunk)\n t.toc(\"chunks passed to transcribe: \")\n await stream.input_stream.end_stream()\n\n handler = MyEventHandler(stream.output_stream)\n await asyncio.gather(write_chunks(stream), handler.handle_events())\n\n #write a proper while loop here\n loop = asyncio.get_event_loop()\n loop.run_until_complete(loop_me())\n loop.close()\n```\n\nWhen the preceding code is run together without errors, you can speak into the microphone and quickly hear your voice translated to Mandarin Chinese. The automatic language detection feature for Amazon Transcribe and Amazon Translate translates any supported input language into the target language. You can speak for quite some time and because of the non-blocking nature of the function calls, all your speech input is translated and spoken, making this an excellent tool for translating live speeches.\n\n#### **Conclusion**\n\nAlthough this post demonstrated how these three fully managed AWS APIs can function seamlessly together, we encourage you to think about how you could use these services in other ways to deliver multilingual support for services or media like multilingual closed captioning for a fraction of the current cost. Medicine, business, and even diplomatic relations could all benefit from an ever-improving, low-cost, low-maintenance translation service.\n\nFor more information about the proof of concept code base for this use case check out our [Github](https://github.com/aws-samples/amazon-live-translation-polly-transcribe).\n\t——————————————————————————————————————\n\n#### **About the Authors**\n\n![image.png](https://dev-media.amazoncloud.cn/7606aadb112b469d91f713b2d1823743_image.png) **Michael Tran** is a Solutions Architect with Envision Engineering team at Amazon Web Services. He provides technical guidance and helps customers accelerate their ability to innovate through showing the art of the possible on AWS. He has built multiple prototypes around AI/ML, and IoT for our customers. You can contact me @Mike_Trann on Twitter.\n\n![image.png](https://dev-media.amazoncloud.cn/18c9d5948a9749f99df14f79ae45592e_image.png) **Cameron Wilkes** is a Prototyping Architect on the AWS Industry Accelerator team. While on the team he delivered several ML based prototypes to customers to demonstrate the “Art of the Possible” of ML on AWS. He enjoys music production, off-roading and design.\n\n\n\n\n\n\n\n\n","render":"<p>Imagine a surgeon taking video calls with patients across the globe without the need of a human translator. What if a fledgling startup could easily expand their product across borders and into new geographical markets by offering fluid, accurate, multilingual customer support and sales, all without the need of a live human translator? What happens to your business when you’re no longer bound by language?</p>\n<p>It’s common today to have virtual meetings with international teams and customers that speak many different languages. Whether they’re internal or external meetings, meaning often gets lost in complex discussions and you may encounter language barriers that prevent you from being as effective as you could be.</p>\n<p>In this post, you will learn how to use three fully managed AWS services (<a href=\"https://aws.amazon.com/cn/transcribe/\" target=\"_blank\">Amazon Transcribe</a>, A<a href=\"https://aws.amazon.com/cn/translate/\" target=\"_blank\">mazon Translate</a>, and <a href=\"https://aws.amazon.com/cn/polly/\" target=\"_blank\">Amazon Polly</a>) to produce a near-real-time speech-to-speech translator solution that can quickly translate a source speaker’s live voice input into a spoken, accurate, translated target language, all with zero machine learning (ML) experience.</p>\n<h4><a id=\"Overview_of_solution_6\"></a><strong>Overview of solution</strong></h4>\n<p>Our translator consists of three fully managed AWS ML services working together in a single Python script by using the <a href=\"https://aws.amazon.com/cn/sdk-for-python/\" target=\"_blank\">AWS SDK for Python (Boto3)</a> for our text translation and text-to-speech portions, and an asynchronous streaming SDK for audio input transcription.</p>\n<h4><a id=\"Amazon_Transcribe_Streaming_speech_to_text_10\"></a><strong>Amazon Transcribe: Streaming speech to text</strong></h4>\n<p>The first service you use in our stack is Amazon Transcribe, a fully managed speech-to-text service that takes input speech and transcribes it to text. Amazon Transcribe has flexible ingestion methods, batch or streaming, because it accepts either stored audio files or streaming audio data. In this post, you use the <a href=\"https://aws.amazon.com/cn/blogs/developer/transcribe-streaming-sdk-for-python-preview/\" target=\"_blank\">asynchronous Amazon Transcribe streaming SDK for Python</a>, which uses the HTTP/2 streaming protocol to stream live audio and receive live transcriptions.</p>\n<p>When we first built this prototype, Amazon Transcribe streaming ingestion didn’t support automatic language detection, but this is no longer the case as of November 2021. Both batch and streaming ingestion now support automatic language detection for all [supported languages](supported languages). In this post, we show how a parameter-based solution though a seamless multi-language parameterless design is possible through the use of streaming automatic language detection. After our transcribed speech segment is returned as text, you send a request to Amazon Translate to translate and return the results in our Amazon Transcribe <code>EventHandler</code>method.</p>\n<h4><a id=\"Amazon_Translate_Stateoftheart_fully_managed_translation_API_16\"></a><strong>Amazon Translate: State-of-the-art, fully managed translation API</strong></h4>\n<p>Next in our stack is Amazon Translate, a neural machine translation service that delivers fast, high-quality, affordable, and customizable language translation. As of June of 2022, Amazon Translate supports translation across 75 languages, with new language pairs and improvements being made constantly. Amazon Translate uses deep learning models hosted on a highly scalable and resilient AWS Cloud architecture to quickly deliver accurate translations either in real time or batched, depending on your use case. Using Amazon Translate is straightforward and requires no management of underlying architecture or ML skills. Amazon Translate has several features, like creating and using a <a href=\"https://docs.aws.amazon.com/translate/latest/dg/how-custom-terminology.html\" target=\"_blank\">custom terminology</a> to handle mapping between industry-specific terms. For more information on Amazon Translate service limits, refer to [Guidelines and limits](Guidelines and limits). After the application receives the translated text in our target language, it sends the translated text to Amazon Polly for immediate translated audio playback.</p>\n<h4><a id=\"Amazon_Polly_Fully_managed_texttospeech_API_20\"></a><strong>Amazon Polly: Fully managed text-to-speech API</strong></h4>\n<p>Finally, you send the translated text to Amazon Polly, a fully managed text-to-speech service that can either send back lifelike audio clip responses for immediate streaming playback or batched and saved in Amazon Simple Storage Service (Amazon S3) for later use. You can control various aspects of speech such as pronunciation, volume, pitch, speech rate, and more using standardized <a href=\"https://www.w3.org/TR/speech-synthesis11/\" target=\"_blank\">Speech Synthesis Markup Language</a> (SSML).</p>\n<p>You can synthesize speech for certain Amazon Polly <a href=\"https://docs.aws.amazon.com/polly/latest/dg/ntts-voices-main.html\" target=\"_blank\">Neural voices</a> using the Newscaster style to make them sound like a TV or radio newscaster. You can also detect when specific words or sentences in the text are being spoken based on the metadata included in the audio stream. This allows the developer to synchronize graphical highlighting and animations, such as the lip movements of an avatar, with the synthesized speech.</p>\n<p>You can modify the pronunciation of particular words, such as company names, acronyms, foreign words, or neologisms, for example “P!nk,” “ROTFL,” or “C’est la vie” (when spoken in a non-French voice), using custom lexicons.</p>\n<h4><a id=\"Architecture_overview_28\"></a><strong>Architecture overview</strong></h4>\n<p>The following diagram illustrates our solution architecture.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/1bee7d54f88c4aada69039b91c58ad34_image.png\" alt=\"image.png\" /></p>\n<p>This diagram shows the data flow from the client device to Amazon Transcribe, Amazon Translate, and Amazon Polly</p>\n<p>The workflow is as follows:</p>\n<ol>\n<li>Audio is ingested by the Python SDK.</li>\n<li>Amazon Polly converts the speech to text, in 39 possible languages.</li>\n<li>Amazon Translate converts the languages.</li>\n<li>Amazon Live Transcribe converts text to speech.</li>\n<li>Audio is outputted to speakers.</li>\n</ol>\n<h4><a id=\"Prerequisites_44\"></a><strong>Prerequisites</strong></h4>\n<p>You need a host machine set up with a microphone, speakers, and reliable internet connection. A modern laptop should work fine for this because no additional hardware is needed. Next, you need to set up the machine with some software tools.</p>\n<p>You must have Python 3.7+ installed to use the asynchronous Amazon Transcribe streaming SDK and for a Python module called <code>pyaudio</code>, which you use to control the machine’s microphone and speakers. This module depends on a C library called <code>portaudio.h</code>. If you encounter issues with <code>pyaudio</code>errors, we suggest checking your OS to see if you have the <code>portaudio.h</code>library installed.</p>\n<p>For authorization and authentication of service calls, you create an <a href=\"https://aws.amazon.com/cn/iam/\" target=\"_blank\">AWS Identity and Access Management</a> (IAM) service role with permissions to call the necessary AWS services. By configuring the <a href=\"https://aws.amazon.com/cn/cli/\" target=\"_blank\">AWS Command Line Interface</a> (AWS CLI) with this IAM service role, you can run our script on your machine without having to pass in keys or passwords, because the AWS libraries are written to use the configured AWS CLI user’s credentials. This is a convenient method for rapid prototyping and ensures our services are being called by an authorized identity. As always, follow the principle of least privilege when assigning IAM policies when creating an IAM user or role.</p>\n<p>To summarize, you need the following prerequisites:</p>\n<ul>\n<li>\n<p>A PC, Mac, or Linux machine with microphone, speakers, and internet connection</p>\n</li>\n<li>\n<p>The <code>portaudio.h</code>C library for your OS (brew, apt get, wget), which is needed for pyaudio to work</p>\n</li>\n<li>\n<p>AWS CLI 2.0 with properly authorized IAM user configured by running aws configure in the AWS CLI</p>\n</li>\n<li>\n<p>Python 3.7+</p>\n</li>\n<li>\n<p><a href=\"https://aws.amazon.com/cn/blogs/developer/transcribe-streaming-sdk-for-python-preview/\" target=\"_blank\">The asynchronous Amazon Transcribe Python SDK</a></p>\n</li>\n<li>\n<p>The following Python libraries:</p>\n<ul>\n<li>boto3</li>\n<li>amazon-transcribe</li>\n<li>pyaudio</li>\n<li>asyncio</li>\n<li>concurrent</li>\n</ul>\n</li>\n</ul>\n<h4><a id=\"Implement_the_solution_67\"></a><strong>Implement the solution</strong></h4>\n<p>You will be relying heavily on the asynchronous Amazon Transcribe streaming SDK for Python as a starting point, and are going to build on top of that specific SDK. After you have experimented with the streaming SDK for Python, you add <a href=\"https://github.com/awslabs/amazon-transcribe-streaming-sdk/blob/develop/examples/simple_mic.py\" target=\"_blank\">streaming microphone</a> input by using <code>pyaudio</code>, a commonly used Python open-source library used for manipulating audio data. Then you add Boto3 calls to Amazon Translate and Amazon Polly for our translation and text-to-speech functionality. Finally, you stream out translated speech through the computer’s speakers again with <code>pyaudio</code>. The Python module <code>concurrent</code>gives you the ability to run blocking code in its own asynchronous thread to play back your returned Amazon Polly speech in a seamless, non-blocking way.</p>\n<p>Let’s import all our necessary modules, transcribe streaming classes, and instantiate some globals:</p>\n<pre><code class=\"lang-\">import boto3\n import asyncio\n import pyaudio\n import concurrent\n from amazon_transcribe.client import TranscribeStreamingClient\n from amazon_transcribe.handlers import TranscriptResultStreamHandler\n from amazon_transcribe.model import TranscriptEvent\n\n\n polly = boto3.client('polly', region_name = 'us-west-2')\n translate = boto3.client(service_name='translate', region_name='us-west-2', use_ssl=True)\n pa = pyaudio.PyAudio()\n\n #for mic stream, 1024 should work fine\n default_frames = 1024\n\n #current params are set up for English to Mandarin, modify to your liking\n params['source_language'] = &quot;en&quot;\n params['target_language'] = &quot;zh&quot;\n params['lang_code_for_polly'] = &quot;cmn-CN&quot;\n params['voice_id'] = &quot;Zhiyu&quot;\n params['lang_code_for_transcribe'] = &quot;en-US&quot;\n</code></pre>\n<p>First, you use<code> pyaudio</code> to obtain the input device’s sampling rate, device index, and channel count:</p>\n<pre><code class=\"lang-\">#try grabbing the default input device and see if we get lucky\n default_indput_device = pa.get_default_input_device_info()\n\n # verify this is your microphone device \n print(default_input_device)\n\n #if correct then set it as your input device and define some globals\n input_device = default_input_device\n\n input_channel_count = input_device[&quot;maxInputChannels&quot;]\n input_sample_rate = input_device[&quot;defaultSampleRate&quot;]\n input_dev_index = input_device[&quot;index&quot;]\n</code></pre>\n<p>If this isn’t working, you can also loop through and print your devices as shown in the following code, and then use the device index to retrieve the device information with <code>pyaudio</code>:</p>\n<pre><code class=\"lang-\">print (&quot;Available devices:\\n&quot;)\n for i in range(0, pa.get_device_count()):\n info = pa.get_device_info_by_index(i)\n print (str(info[&quot;index&quot;]) + &quot;: \\t %s \\n \\t %s \\n&quot; % (info[&quot;name&quot;], p.get_host_api_info_by_index(info[&quot;hostApi&quot;])[&quot;name&quot;]))\n\n # select the correct index from the above returned list of devices, for example zero\n dev_index = 0 \n input_device = pa.get_device_info_by_index(dev_index)\n\n #set globals for microphone stream\n input_channel_count = input_device[&quot;maxInputChannels&quot;]\n input_sample_rate = input_device[&quot;defaultSampleRate&quot;]\n input_dev_index = input_device[&quot;index&quot;]\n</code></pre>\n<p>You use<code>channel_count</code>,<code>sample_rate</code>, and<code>dev_index</code>as parameters in a mic stream. In that stream’s callback function, you use an <code>asyncio</code>nonblocking thread-safe callback to put the input bytes of the mic stream into an <code>asyncio</code> input queue. Take note of the loop and input_queue objects created with <code>asyncio</code> and how they’re used in the following code:</p>\n<pre><code class=\"lang-\">async def mic_stream():\n # This function wraps the raw input stream from the microphone forwarding\n # the blocks to an asyncio.Queue.\n \n loop = asyncio.get_event_loop()\n input_queue = asyncio.Queue()\n \n def callback(indata, frame_count, time_info, status):\n loop.call_soon_threadsafe(input_queue.put_nowait, indata)\n return (indata, pyaudio.paContinue)\n \n # Be sure to use the correct parameters for the audio stream that matches\n # the audio formats described for the source language you'll be using:\n # https://docs.aws.amazon.com/transcribe/latest/dg/streaming.html\n \n print(input_device)\n \n #Open stream\n stream = pa.open(format = pyaudio.paInt16,\n channels = input_channel_count,\n rate = int(input_sample_rate),\n input = True,\n frames_per_buffer = default_frames,\n input_device_index = input_dev_index,\n stream_callback=callback)\n # Initiate the audio stream and asynchronously yield the audio chunks\n # as they become available.\n stream.start_stream()\n print(&quot;started stream&quot;)\n while True:\n indata = await input_queue.get()\n yield indata\n</code></pre>\n<p>Now when the generator function <code>mic_stream()</code>is called, it continually yields input bytes as long as there is microphone input data in the input queue.</p>\n<p>Now that you know how to get input bytes from the microphone, let’s look at how to write Amazon Polly output audio bytes to a speaker output stream:</p>\n<pre><code class=\"lang-\">#text will come from MyEventsHandler\n def aws_polly_tts(text):\n\n response = polly.synthesize_speech(\n Engine = 'standard',\n LanguageCode = params['lang_code_for_polly'],\n Text=text,\n VoiceId = params['voice_id'],\n OutputFormat = &quot;pcm&quot;,\n )\n output_bytes = response['AudioStream']\n \n #play to the speakers\n write_to_speaker_stream(output_bytes)\n \n #how to write audio bytes to speakers\n\n def write_to_speaker_stream(output_bytes):\n &quot;&quot;&quot;Consumes bytes in chunks to produce the response's output'&quot;&quot;&quot;\n print(&quot;Streaming started...&quot;)\n chunk_len = 1024\n channels = 1\n sample_rate = 16000\n \n if output_bytes:\n polly_stream = pa.open(\n format = pyaudio.paInt16,\n channels = channels,\n rate = sample_rate,\n output = True,\n )\n #this is a blocking call - will sort this out with concurrent later\n while True:\n data = output_bytes.read(chunk_len)\n polly_stream.write(data)\n \n #If there's no more data to read, stop streaming\n if not data:\n output_bytes.close()\n polly_stream.stop_stream()\n polly_stream.close()\n break\n print(&quot;Streaming completed.&quot;)\n else:\n print(&quot;Nothing to stream.&quot;)\n</code></pre>\n<p>Now let’s expand on what you built in the post Asynchronous Amazon Transcribe Streaming SDK for Python. In the following code, you create an executor object using the <code>ThreadPoolExecutor</code>subclass with three workers with concurrent. You then add an Amazon Translate call on the finalized returned transcript in the EventHandler and pass that translated text, the executor object, and our <code>aws_polly_tts()</code>function into an <code>asyncio</code>loop with <code>loop.run_in_executor()</code>, which runs our Amazon Polly function (with translated input text) asynchronously at the start of next iteration of the <code>asyncio</code>l loop.</p>\n<pre><code class=\"lang-\">#use concurrent package to create an executor object with 3 workers ie threads\n executor = concurrent.futures.ThreadPoolExecutor(max_workers=3)\n\n class MyEventHandler(TranscriptResultStreamHandler):\n async def handle_transcript_event(self, transcript_event: TranscriptEvent):\n\n #If the transcription is finalized, send it to translate\n \n results = transcript_event.transcript.results\n if len(results) &gt; 0:\n if len(results[0].alternatives) &gt; 0:\n transcript = results[0].alternatives[0].transcript\n print(&quot;transcript:&quot;, transcript)\n\n print(results[0].channel_id)\n if hasattr(results[0], &quot;is_partial&quot;) and results[0].is_partial == False:\n \n #translate only 1 channel. the other channel is a duplicate\n if results[0].channel_id == &quot;ch_0&quot;:\n trans_result = translate.translate_text(\n Text = transcript,\n SourceLanguageCode = params['source_language'],\n TargetLanguageCode = params['target_language']\n )\n print(&quot;translated text:&quot; + trans_result.get(&quot;TranslatedText&quot;))\n text = trans_result.get(&quot;TranslatedText&quot;)\n\n #we run aws_polly_tts with a non-blocking executor at every loop iteration\n await loop.run_in_executor(executor, aws_polly_tts, text) \n</code></pre>\n<p>Finally, we have the <code>loop_me()</code>function. In it, you define <code>write_chunks()</code>, which takes an Amazon Transcribe stream as an argument and asynchronously writes chunks of streaming mic input to it. You then use <code>MyEventHandler()</code> with the output transcription stream as its argument and create a handler object. Then you use await with asyncio.gather() and pass in the write_chunks() and handler with the handle_events() method to handle the eventual futures of these coroutines. Lastly, you gather all event loops and loop the <code>loop_me()</code> function with <code>run_until_complete()</code>. See the following code:</p>\n<pre><code class=\"lang-\">async def loop_me():\n # Setup up our client with our chosen AWS region\n\n client = TranscribeStreamingClient(region=&quot;us-west-2&quot;)\n stream = await client.start_stream_transcription(\n language_code=params['lang_code_for_transcribe'],\n media_sample_rate_hz=int(device_info[&quot;defaultSampleRate&quot;]),\n number_of_channels = 2,\n enable_channel_identification=True,\n media_encoding=&quot;pcm&quot;,\n )\n recorded_frames = []\n async def write_chunks(stream):\n \n # This connects the raw audio chunks generator coming from the microphone\n # and passes them along to the transcription stream.\n print(&quot;getting mic stream&quot;)\n async for chunk in mic_stream():\n t.tic()\n recorded_frames.append(chunk)\n await stream.input_stream.send_audio_event(audio_chunk=chunk)\n t.toc(&quot;chunks passed to transcribe: &quot;)\n await stream.input_stream.end_stream()\n\n handler = MyEventHandler(stream.output_stream)\n await asyncio.gather(write_chunks(stream), handler.handle_events())\n\n #write a proper while loop here\n loop = asyncio.get_event_loop()\n loop.run_until_complete(loop_me())\n loop.close()\n</code></pre>\n<p>When the preceding code is run together without errors, you can speak into the microphone and quickly hear your voice translated to Mandarin Chinese. The automatic language detection feature for Amazon Transcribe and Amazon Translate translates any supported input language into the target language. You can speak for quite some time and because of the non-blocking nature of the function calls, all your speech input is translated and spoken, making this an excellent tool for translating live speeches.</p>\n<h4><a id=\"Conclusion_294\"></a><strong>Conclusion</strong></h4>\n<p>Although this post demonstrated how these three fully managed AWS APIs can function seamlessly together, we encourage you to think about how you could use these services in other ways to deliver multilingual support for services or media like multilingual closed captioning for a fraction of the current cost. Medicine, business, and even diplomatic relations could all benefit from an ever-improving, low-cost, low-maintenance translation service.</p>\n<p>For more information about the proof of concept code base for this use case check out our <a href=\"https://github.com/aws-samples/amazon-live-translation-polly-transcribe\" target=\"_blank\">Github</a>.<br />\n——————————————————————————————————————</p>\n<h4><a id=\"About_the_Authors_301\"></a><strong>About the Authors</strong></h4>\n<p><img src=\"https://dev-media.amazoncloud.cn/7606aadb112b469d91f713b2d1823743_image.png\" alt=\"image.png\" /> <strong>Michael Tran</strong> is a Solutions Architect with Envision Engineering team at Amazon Web Services. He provides technical guidance and helps customers accelerate their ability to innovate through showing the art of the possible on AWS. He has built multiple prototypes around AI/ML, and IoT for our customers. You can contact me @Mike_Trann on Twitter.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/18c9d5948a9749f99df14f79ae45592e_image.png\" alt=\"image.png\" /> <strong>Cameron Wilkes</strong> is a Prototyping Architect on the AWS Industry Accelerator team. While on the team he delivered several ML based prototypes to customers to demonstrate the “Art of the Possible” of ML on AWS. He enjoys music production, off-roading and design.</p>\n"}
目录
亚马逊云科技解决方案 基于行业客户应用场景及技术领域的解决方案
联系亚马逊云科技专家
亚马逊云科技解决方案
基于行业客户应用场景及技术领域的解决方案
联系专家
0
目录
关闭