Step Functions Distributed Map – A Serverless Solution for Large-Scale Parallel Data Processing

海外精选
re:Invent
Amazon Lambda
海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时,内容中提到的“AWS” 是 “Amazon Web Services” 的缩写,在此网站不作为商标展示。
0
0
{"value":"I am excited to announce the availability of a **distributed map** for [AWS Step Functions](https://aws.amazon.com/step-functions). This flow extends support for orchestrating large-scale parallel workloads such as the on-demand processing of semi-structured data.\n\nStep Function’s [map state](https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-map-state.html) executes the same processing steps for multiple entries in a dataset. The existing map state is limited to 40 parallel iterations at a time. This limit makes it challenging to scale data processing workloads to process thousands of items (or even more) in parallel. In order to achieve higher parallel processing prior to today, you had to implement complex workarounds to the existing map state component.\n\n\nThe new distributed map state allows you to write Step Functions to coordinate large-scale parallel workloads within your serverless applications. You can now iterate over millions of objects such as logs, images, or .csv files stored in [Amazon Simple Storage Service (Amazon S3)](https://aws.amazon.com/s3/). The new distributed map state can launch up to ten thousand parallel workflows to process data.\n\n\nYou can process data by composing any service API supported by Step Functions, but typically, you will invoke Lambda functions to process the data with code written in [your favorite programming language](https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtimes.html).\n\nStep Functions distributed map supports a maximum concurrency of up to 10,000 executions in parallel, which is well above the concurrency supported by many other AWS services. You can use the maximum concurrency feature of the distributed map to ensure that you do not exceed the concurrency of a downstream service. There are two factors to consider when working with other services. First, the maximum concurrency supported by the service for your account. Second, the burst and ramping rates, which determine how quickly you can achieve the maximum concurrency.\n\n\nLet’s use Lambda as an example. Your functions’ concurrency is the number of instances that serve requests at a given time. The [default maximum concurrency](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) quota for Lambda is 1,000 per AWS Region.[ You can ask for an increase](https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html) at any time. For an initial burst of traffic, your functions’ cumulative concurrency in a Region can reach [an initial level of between 500 and 3000](https://docs.aws.amazon.com/lambda/latest/dg/invocation-scaling.html), which varies per Region. The burst concurrency quota applies to all your functions in the Region.\n\n\nWhen using a distributed map, be sure to verify the quota on downstream services. Limit the distributed map maximum concurrency during your development, and plan for service quota increases accordingly.\n\nTo compare the new distributed map with the original map state flow, I created this table.\n\n![image.png](https://dev-media.amazoncloud.cn/e4e70f878d40498a85ef0def9fe0327b_image.png)\n\nSub-workflows within a distributed map work with both [Standard workflows](https://docs.aws.amazon.com/step-functions/latest/dg/concepts-standard-vs-express.html) and the low-latency, short-duration [Express Workflows](https://docs.aws.amazon.com/step-functions/latest/dg/concepts-standard-vs-express.html).\n\nThis new capability is optimized to work with S3. I can configure the bucket and prefix where my data are stored directly from the distributed map configuration. The distributed map stops reading after 100 million items and supports JSON or csv files of up to 10GB.\n\n\nWhen processing large files, think about downstream service capabilities. Let’s take Lambda again as an example. Each input—a file on S3, for example—must fit within [the Lambda function execution environment in terms of temporary storage and memory](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html#function-configuration-deployment-and-execution). To make it easier to handle large files, [Lambda Powertools for Python](https://awslabs.github.io/aws-lambda-powertools-python) introduced [a new streaming feature](https://awslabs.github.io/aws-lambda-powertools-python/2.4.0/utilities/streaming/) to fetch, transform, and process S3 objects with minimal memory footprint. This allows your Lambda functions to handle files larger than the size of their execution environment. To learn more about this new capability, check [the Lambda Powertools documentation](https://awslabs.github.io/aws-lambda-powertools-python/2.4.0/utilities/streaming/).\n\n### ++Let’s See It in Action++\nFor this demo, I will create a workflow that processes one thousand dog images stored on S3. The images are already stored on S3.\n\n```\n➜ ~ aws s3 ls awsnewsblog-distributed-map/images/\n2022-11-08 15:03:36 27034 n02085620_10074.jpg\n2022-11-08 15:03:36 34458 n02085620_10131.jpg\n2022-11-08 15:03:36 12883 n02085620_10621.jpg\n2022-11-08 15:03:36 34910 n02085620_1073.jpg\n...\n\n➜ ~ aws s3 ls awsnewsblog-distributed-map/images/ | wc -l\n 1000\n```\n\nThe workflow and the S3 bucket must be in the same Region.\n\nTo get started, I navigate to the Step Functions page of the [AWS Management Console](https://console.aws.amazon.com/) and select **Create state machine**. On the next page, I choose to design my workflow using the visual editor. The distributed map works with **Standard** workflows, and I keep the default selection as-is. I select **Next** to enter the visual editor.\n\n![image.png](https://dev-media.amazoncloud.cn/dbd9dd3873d844b494c0bd29681522db_image.png)\n\n\nIn the visual editor, I search and select the **Map** component on the left-side pane, and I drag it to the workflow area. On the right side, I configure the component. I choose **Distributed** as **Processing mode** and **Amazon S3** as **Item Source**.\n\nDistributed maps are natively integrated with S3. I enter the name of the bucket (```awsnewsblog-distributed-map```) and the prefix (```images```) where my images are stored.\n\n![image.png](https://dev-media.amazoncloud.cn/37dccf99fe66470397fd5b03a610faea_image.png)\n\nOn the **Runtime Settings** section, I choose **Express** for **Child workflow type**. I also may decide to restrict the **Concurrency limit**. It helps to ensure we operate within the concurrency quotas of the downstream services (Lambda in this demo) for a particular account or Region.\n\nBy default, the output of my sub-workflows will be aggregated as state output, up to 256KB. To process larger outputs, I may choose to **Export map state results to Amazon S3**.\n\n![image.png](https://dev-media.amazoncloud.cn/02f004b1e0a349a894df58eaaea93290_image.png)\n\nFinally, I define what to do for each file. In this demo, I want to invoke a Lambda function for each file in the S3 bucket. The function exists already. I search for and select the Lambda invocation action on the left-side pane. I drag it to the distributed map component. Then, I use the right-side configuration panel to select the actual Lambda function to invoke: ```AWSNewsBlogDistributedMap``` in this example.\n\n\n![image.png](https://dev-media.amazoncloud.cn/f51efd9d047a4d53b7e7fafeea576117_image.png)\n\n\nWhen I am done, I select **Next**. I select **Next** again on the **Review generated code** page (not shown here).\n\nOn the **Specify state machine settings** page, I enter a **Name** for my state machine and the IAM **Permissions** to run. Then, I select **Create state machine**.\n\n![image.png](https://dev-media.amazoncloud.cn/8a1c81bae65a498294932256ba8dec2b_image.png)\n\nNow I am ready to start the execution. On the State machine page, I select the new workflow and select **Start execution**. I can optionally enter a JSON document to pass to the workflow. In this demo, the workflow does not handle the input data. I leave it as-is, and I select **Start execution**.\n\n![image.png](https://dev-media.amazoncloud.cn/7d1ac6ffaf6045649f3b7d86bdd33870_image.png)\n\nDuring the execution of the workflow, I can monitor the progress. I observe the number of iterations, and the number of items successfully processed or in error.\n\n![image.png](https://dev-media.amazoncloud.cn/dbee6ec2d361400992cb861fffb8bda3_image.png)\n\n\nI can drill down on one specific execution to see the details.\n\n![image.png](https://dev-media.amazoncloud.cn/234ae21a19944a61a396b17476bdfb07_image.png)\n\nWith just a few clicks, I created a large-scale and heavily parallel workflow able to handle a very large quantity of data.\n\n### ++Which AWS Service Should I Use++\nAs often happens on AWS, you might observe an overlap between this new capability and existing services such as [AWS Glue](https://aws.amazon.com/glue/), [Amazon EMR](https://aws.amazon.com/emr), or [Amazon S3 Batch Operations](https://aws.amazon.com/s3/features/batch-operations/). Let’s try to differentiate the use cases.\n\n\nIn my mental model, data scientists and data engineers use AWS Glue and EMR to process large amounts of data. On the other hand, application developers will use Step Functions to add serverless data processing into their applications. Step Functions is able to scale from zero quickly, which makes it a good fit for interactive workloads where customers may be waiting for the results. Finally, system administrators and IT operation teams are likely to use [Amazon S3 Batch Operations](https://aws.amazon.com/s3/features/batch-operations/) for single-step IT automation operations such as copying, tagging, or changing permissions on billions of S3 objects.\n\n\n### ++Pricing and Availability++\nAWS Step Functions’ distributed map is generally available in the following ten AWS Regions: US East (Ohio, N. Virginia), US West (Oregon), Asia Pacific (Singapore, Sydney, Tokyo), Canada (Central), and Europe (Frankfurt, Ireland, Stockholm).\n\nThe pricing model for the existing inline map state does not change. For the new distributed map state, we charge one state transition per iteration. Pricing varies between Regions, and it starts at $0.025 per 1,000 state transitions. When you process your data using [express workflows](https://docs.aws.amazon.com/step-functions/latest/dg/concepts-standard-vs-express.html), you are also charged based on the number of requests for your workflow and its duration. Again, prices vary between Regions, but they start at $1.00 per 1 million requests and $0.06 per GB-hour (prorated to 100ms).\n\nFor the same amount of iterations, you will observe a cost reduction when using the combination of the distributed map and standard workflows compared to the existing inline map. When you use express workflows, expect the costs to stay the same for more value with the distributed map.\n\nI am really excited to discover what you will build using this new capability and how it will unlock innovation. [Go start to build highly parallel serverless data processing workflows today!](https://console.aws.amazon.com/states)\n\n-- [seb](https://twitter.com/sebsto)\n\n![image.png](https://dev-media.amazoncloud.cn/1dfbcd89d02541d88b5808c109cd6ef5_image.png)\n\n\n### **[Sébastien Stormacq](https://aws.amazon.com/blogs/aws/author/stormacq/)**\nSeb has been writing code since he first touched a Commodore 64 in the mid-eighties. He inspires builders to unlock the value of the AWS cloud, using his secret blend of passion, enthusiasm, customer advocacy, curiosity and creativity. His interests are software architecture, developer tools and mobile computing. If you want to sell him something, be sure it has an API. Follow him on Twitter @sebsto.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n","render":"<p>I am excited to announce the availability of a <strong>distributed map</strong> for <a href=\"https://aws.amazon.com/step-functions\" target=\"_blank\">AWS Step Functions</a>. This flow extends support for orchestrating large-scale parallel workloads such as the on-demand processing of semi-structured data.</p>\n<p>Step Function’s <a href=\"https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-map-state.html\" target=\"_blank\">map state</a> executes the same processing steps for multiple entries in a dataset. The existing map state is limited to 40 parallel iterations at a time. This limit makes it challenging to scale data processing workloads to process thousands of items (or even more) in parallel. In order to achieve higher parallel processing prior to today, you had to implement complex workarounds to the existing map state component.</p>\n<p>The new distributed map state allows you to write Step Functions to coordinate large-scale parallel workloads within your serverless applications. You can now iterate over millions of objects such as logs, images, or .csv files stored in <a href=\"https://aws.amazon.com/s3/\" target=\"_blank\">Amazon Simple Storage Service (Amazon S3)</a>. The new distributed map state can launch up to ten thousand parallel workflows to process data.</p>\n<p>You can process data by composing any service API supported by Step Functions, but typically, you will invoke Lambda functions to process the data with code written in <a href=\"https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtimes.html\" target=\"_blank\">your favorite programming language</a>.</p>\n<p>Step Functions distributed map supports a maximum concurrency of up to 10,000 executions in parallel, which is well above the concurrency supported by many other AWS services. You can use the maximum concurrency feature of the distributed map to ensure that you do not exceed the concurrency of a downstream service. There are two factors to consider when working with other services. First, the maximum concurrency supported by the service for your account. Second, the burst and ramping rates, which determine how quickly you can achieve the maximum concurrency.</p>\n<p>Let’s use Lambda as an example. Your functions’ concurrency is the number of instances that serve requests at a given time. The <a href=\"https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html\" target=\"_blank\">default maximum concurrency</a> quota for Lambda is 1,000 per AWS Region.<a href=\"https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html\" target=\"_blank\"> You can ask for an increase</a> at any time. For an initial burst of traffic, your functions’ cumulative concurrency in a Region can reach <a href=\"https://docs.aws.amazon.com/lambda/latest/dg/invocation-scaling.html\" target=\"_blank\">an initial level of between 500 and 3000</a>, which varies per Region. The burst concurrency quota applies to all your functions in the Region.</p>\n<p>When using a distributed map, be sure to verify the quota on downstream services. Limit the distributed map maximum concurrency during your development, and plan for service quota increases accordingly.</p>\n<p>To compare the new distributed map with the original map state flow, I created this table.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/e4e70f878d40498a85ef0def9fe0327b_image.png\" alt=\"image.png\" /></p>\n<p>Sub-workflows within a distributed map work with both <a href=\"https://docs.aws.amazon.com/step-functions/latest/dg/concepts-standard-vs-express.html\" target=\"_blank\">Standard workflows</a> and the low-latency, short-duration <a href=\"https://docs.aws.amazon.com/step-functions/latest/dg/concepts-standard-vs-express.html\" target=\"_blank\">Express Workflows</a>.</p>\n<p>This new capability is optimized to work with S3. I can configure the bucket and prefix where my data are stored directly from the distributed map configuration. The distributed map stops reading after 100 million items and supports JSON or csv files of up to 10GB.</p>\n<p>When processing large files, think about downstream service capabilities. Let’s take Lambda again as an example. Each input—a file on S3, for example—must fit within <a href=\"https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html#function-configuration-deployment-and-execution\" target=\"_blank\">the Lambda function execution environment in terms of temporary storage and memory</a>. To make it easier to handle large files, <a href=\"https://awslabs.github.io/aws-lambda-powertools-python\" target=\"_blank\">Lambda Powertools for Python</a> introduced <a href=\"https://awslabs.github.io/aws-lambda-powertools-python/2.4.0/utilities/streaming/\" target=\"_blank\">a new streaming feature</a> to fetch, transform, and process S3 objects with minimal memory footprint. This allows your Lambda functions to handle files larger than the size of their execution environment. To learn more about this new capability, check <a href=\"https://awslabs.github.io/aws-lambda-powertools-python/2.4.0/utilities/streaming/\" target=\"_blank\">the Lambda Powertools documentation</a>.</p>\n<h3><a id=\"Lets_See_It_in_Action_29\"></a><ins>Let’s See It in Action</ins></h3>\n<p>For this demo, I will create a workflow that processes one thousand dog images stored on S3. The images are already stored on S3.</p>\n<pre><code class=\"lang-\">➜ ~ aws s3 ls awsnewsblog-distributed-map/images/\n2022-11-08 15:03:36 27034 n02085620_10074.jpg\n2022-11-08 15:03:36 34458 n02085620_10131.jpg\n2022-11-08 15:03:36 12883 n02085620_10621.jpg\n2022-11-08 15:03:36 34910 n02085620_1073.jpg\n...\n\n➜ ~ aws s3 ls awsnewsblog-distributed-map/images/ | wc -l\n 1000\n</code></pre>\n<p>The workflow and the S3 bucket must be in the same Region.</p>\n<p>To get started, I navigate to the Step Functions page of the <a href=\"https://console.aws.amazon.com/\" target=\"_blank\">AWS Management Console</a> and select <strong>Create state machine</strong>. On the next page, I choose to design my workflow using the visual editor. The distributed map works with <strong>Standard</strong> workflows, and I keep the default selection as-is. I select <strong>Next</strong> to enter the visual editor.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/dbd9dd3873d844b494c0bd29681522db_image.png\" alt=\"image.png\" /></p>\n<p>In the visual editor, I search and select the <strong>Map</strong> component on the left-side pane, and I drag it to the workflow area. On the right side, I configure the component. I choose <strong>Distributed</strong> as <strong>Processing mode</strong> and <strong>Amazon S3</strong> as <strong>Item Source</strong>.</p>\n<p>Distributed maps are natively integrated with S3. I enter the name of the bucket (<code>awsnewsblog-distributed-map</code>) and the prefix (<code>images</code>) where my images are stored.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/37dccf99fe66470397fd5b03a610faea_image.png\" alt=\"image.png\" /></p>\n<p>On the <strong>Runtime Settings</strong> section, I choose <strong>Express</strong> for <strong>Child workflow type</strong>. I also may decide to restrict the <strong>Concurrency limit</strong>. It helps to ensure we operate within the concurrency quotas of the downstream services (Lambda in this demo) for a particular account or Region.</p>\n<p>By default, the output of my sub-workflows will be aggregated as state output, up to 256KB. To process larger outputs, I may choose to <strong>Export map state results to Amazon S3</strong>.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/02f004b1e0a349a894df58eaaea93290_image.png\" alt=\"image.png\" /></p>\n<p>Finally, I define what to do for each file. In this demo, I want to invoke a Lambda function for each file in the S3 bucket. The function exists already. I search for and select the Lambda invocation action on the left-side pane. I drag it to the distributed map component. Then, I use the right-side configuration panel to select the actual Lambda function to invoke: <code>AWSNewsBlogDistributedMap</code> in this example.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/f51efd9d047a4d53b7e7fafeea576117_image.png\" alt=\"image.png\" /></p>\n<p>When I am done, I select <strong>Next</strong>. I select <strong>Next</strong> again on the <strong>Review generated code</strong> page (not shown here).</p>\n<p>On the <strong>Specify state machine settings</strong> page, I enter a <strong>Name</strong> for my state machine and the IAM <strong>Permissions</strong> to run. Then, I select <strong>Create state machine</strong>.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/8a1c81bae65a498294932256ba8dec2b_image.png\" alt=\"image.png\" /></p>\n<p>Now I am ready to start the execution. On the State machine page, I select the new workflow and select <strong>Start execution</strong>. I can optionally enter a JSON document to pass to the workflow. In this demo, the workflow does not handle the input data. I leave it as-is, and I select <strong>Start execution</strong>.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/7d1ac6ffaf6045649f3b7d86bdd33870_image.png\" alt=\"image.png\" /></p>\n<p>During the execution of the workflow, I can monitor the progress. I observe the number of iterations, and the number of items successfully processed or in error.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/dbee6ec2d361400992cb861fffb8bda3_image.png\" alt=\"image.png\" /></p>\n<p>I can drill down on one specific execution to see the details.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/234ae21a19944a61a396b17476bdfb07_image.png\" alt=\"image.png\" /></p>\n<p>With just a few clicks, I created a large-scale and heavily parallel workflow able to handle a very large quantity of data.</p>\n<h3><a id=\"Which_AWS_Service_Should_I_Use_90\"></a><ins>Which AWS Service Should I Use</ins></h3>\n<p>As often happens on AWS, you might observe an overlap between this new capability and existing services such as <a href=\"https://aws.amazon.com/glue/\" target=\"_blank\">AWS Glue</a>, <a href=\"https://aws.amazon.com/emr\" target=\"_blank\">Amazon EMR</a>, or <a href=\"https://aws.amazon.com/s3/features/batch-operations/\" target=\"_blank\">Amazon S3 Batch Operations</a>. Let’s try to differentiate the use cases.</p>\n<p>In my mental model, data scientists and data engineers use AWS Glue and EMR to process large amounts of data. On the other hand, application developers will use Step Functions to add serverless data processing into their applications. Step Functions is able to scale from zero quickly, which makes it a good fit for interactive workloads where customers may be waiting for the results. Finally, system administrators and IT operation teams are likely to use <a href=\"https://aws.amazon.com/s3/features/batch-operations/\" target=\"_blank\">Amazon S3 Batch Operations</a> for single-step IT automation operations such as copying, tagging, or changing permissions on billions of S3 objects.</p>\n<h3><a id=\"Pricing_and_Availability_97\"></a><ins>Pricing and Availability</ins></h3>\n<p>AWS Step Functions’ distributed map is generally available in the following ten AWS Regions: US East (Ohio, N. Virginia), US West (Oregon), Asia Pacific (Singapore, Sydney, Tokyo), Canada (Central), and Europe (Frankfurt, Ireland, Stockholm).</p>\n<p>The pricing model for the existing inline map state does not change. For the new distributed map state, we charge one state transition per iteration. Pricing varies between Regions, and it starts at $0.025 per 1,000 state transitions. When you process your data using <a href=\"https://docs.aws.amazon.com/step-functions/latest/dg/concepts-standard-vs-express.html\" target=\"_blank\">express workflows</a>, you are also charged based on the number of requests for your workflow and its duration. Again, prices vary between Regions, but they start at $1.00 per 1 million requests and $0.06 per GB-hour (prorated to 100ms).</p>\n<p>For the same amount of iterations, you will observe a cost reduction when using the combination of the distributed map and standard workflows compared to the existing inline map. When you use express workflows, expect the costs to stay the same for more value with the distributed map.</p>\n<p>I am really excited to discover what you will build using this new capability and how it will unlock innovation. <a href=\"https://console.aws.amazon.com/states\" target=\"_blank\">Go start to build highly parallel serverless data processing workflows today!</a></p>\n<p>– <a href=\"https://twitter.com/sebsto\" target=\"_blank\">seb</a></p>\n<p><img src=\"https://dev-media.amazoncloud.cn/1dfbcd89d02541d88b5808c109cd6ef5_image.png\" alt=\"image.png\" /></p>\n<h3><a id=\"Sbastien_Stormacqhttpsawsamazoncomblogsawsauthorstormacq_111\"></a><strong><a href=\"https://aws.amazon.com/blogs/aws/author/stormacq/\" target=\"_blank\">Sébastien Stormacq</a></strong></h3>\n<p>Seb has been writing code since he first touched a Commodore 64 in the mid-eighties. He inspires builders to unlock the value of the AWS cloud, using his secret blend of passion, enthusiasm, customer advocacy, curiosity and creativity. His interests are software architecture, developer tools and mobile computing. If you want to sell him something, be sure it has an API. Follow him on Twitter @sebsto.</p>\n"}
0
目录
关闭