Build Health Aware CI/CD Pipelines

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"At the moment of imminent failure, you want to avoid an unlucky deployment. I’ll start here with a short story that demonstrates the purpose of this post.\n\nThe DevOps team has just started a database upgrade with a planned outage of 30 minutes. The team automated the entire upgrade flow, triggered a CI/CD pipeline with no human intervention, and the upgrade is progressing smoothly. Then, 20 minutes in, the pipeline is stuck, and your upgrade isn’t progressing. The maintenance window has expired and customers can’t transact. You’ve created a support case, and the AWS engineer confirmed that the upgrade is failing because of a running AWS Health incident in the us-west-2 Region. The engineer has directed the DevOps team to continue monitoring the [status.aws.amazon.com](http://status.aws.amazon.com/) page for updates regarding incident resolution. The event continued running for three hours, during which time customers couldn’t transact. Once resolved, the DevOps team retried the failed pipeline, and it completed successfully.\n\nAfter the incident, the DevOps team explored the possibilities for avoiding these types of incidents in the future. The team was made aware of AWS Health API that provides programmatic access to AWS Health information. In this post, we’ll help the DevOps team make the most of the [AWS Health](https://aws.amazon.com/health/) API to proactively prevent unintended outages.\n\nAWS provides Business and Enterprise Support customers with access to the AWS Health API. Customers can have access to running events in the AWS infrastructure that may impact their service usage. Incidents could be Regional, AZ-specific, or even account specific. During these incidents, it isn’t recommended to deploy or change services that are impacted by the event.\n\nIn this post, I will walk you through how to embed AWS Health API insights into your CI/CD pipelines to automatically stop deployments whenever an AWS Health event is reported in a Region that you’re operating in. Furthermore, I will demonstrate how you can automate detection and remediation.\n\n### **The Demo**\nIn this demo, I will use [AWS CodePipeline](https://aws.amazon.com/codepipeline/) to demonstrate the idea. I will build a simple pipeline that demonstrates the concept without going into the build, test, and deployment specifics.\n\n### **CodePipeline Flow**\nThe CodePipeline flow consists of three steps:\n\n1. Source stage that downloads a CloudFormation template from [AWS CodeCommit](https://aws.amazon.com/codecommit/). The template will be deployed in the last stage.\n2. Custom stage that invokes the [AWS Lambda](https://aws.amazon.com/lambda/) function to evaluate the AWS Health. The Lambda function calls the AWS Health API, evaluates the health risk, and calls back CodePipeline with the assessment result.\n3. Deploy stage that deploys the CloudFormation templates downloaded from CodeCommit in the first stage.\n\n![image.png](https://dev-media.amazoncloud.cn/1e28dc7022324d30a8b0406c3d230c64_image.png)\n\nFigure 1. CodePipeline workflow.\n\n### **Lambda evaluation logic**\n\nThe Lambda function evaluates whether or not a running AWS Health event may be impacted by the deployment. In this case, the following criteria must be met to consider it as safe to deploy:\n\n- Deployment will take place in the North Virginia Region and accordingly the Lambda function will filter on the **us-east-1** Region.\n- A closed event is irrelevant. The Lambda function will filter events with only the **open** status.\n- AWS Health API can return different event types that may not be relevant, such as: Scheduled Maintenance, and Account and Billing notifications. The Lambda function will filter only “**Issue**” type events.\n\n\nThe AWS Health API follows a multi-Region application architecture and has two regional endpoints in an active-passive configuration. To support active-passive DNS failover, AWS Health provides a global endpoint. The Python code is available on [GitHub](https://github.com/aws-samples/building-health-aware-cicd-pipelines) with more information in the README on how to build the Lambda code package.\n\nThe Lambda function requires the following AWS Identity and Access Management (IAM) permissions to access AWS Health API, CodePipeline, and publish logs to CloudWatch:\n\nJSON\n```\n{\n \"Version\": \"2012-10-17\", \n \"Statement\": [\n {\n \"Action\": [ \n \"logs:CreateLogStream\",\n \"logs:CreateLogGroup\",\n \"logs:PutLogEvents\"\n ],\n \"Effect\": \"Allow\", \n \"Resource\": \"arn:aws:logs:us-east-1:replaceWithAccountNumber:*\"\n },\n {\n \"Action\": [\n \"codepipeline:PutJobSuccessResult\",\n \"codepipeline:PutJobFailureResult\"\n ],\n \"Effect\": \"Allow\",\n \"Resource\": \"*\"\n },\n {\n \"Effect\": \"Allow\",\n \"Action\": \"health:DescribeEvents\",\n \"Resource\": \"*\"\n }\n ]\n}\n```\n### **Solution architecture**\n\n![image.png](https://dev-media.amazoncloud.cn/fbb632988c9e4c4cab8df997e4df6193_image.png)\n\nFigure 2. Solution architecture diagram.\n\nIn CodePipeline, create a new stage with a single action to asynchronously invoke a Lambda function. The function will call AWS Health [DescribeEvents API](https://docs.aws.amazon.com/health/latest/APIReference/API_DescribeEvents.html) to retrieve the list of active health incidents. Then, the function will complete the event analysis and decide whether or not it may impact the running deployment. Finally, the function will call back CodePipeline with the evaluation results through either [PutJobSuccessResult](https://docs.aws.amazon.com/codepipeline/latest/APIReference/API_PutJobSuccessResult.html) or [PutJobFailureResult](https://docs.aws.amazon.com/codepipeline/latest/APIReference/API_PutJobFailureResult.html) API operations.\n\nIf the Lambda evaluation succeeds, then it will call back the pipeline with a PutJobSuccessResult API. In turn, the pipeline will mark the step as successful and complete the execution.\n\n![image.png](https://dev-media.amazoncloud.cn/863afe0b804e403296f9a3214cf790d5_image.png)\n\nFigure 3. AWS Code Pipeline workflow successful execution.\n\nIf the Lambda evaluation fails, then it will call back the pipeline with a PutJobFailureResult API specifying a failure message. Once the DevOps team is made aware that the event has been resolved, select the **Retry** button to re-evaluate the health status.\n\n![image.png](https://dev-media.amazoncloud.cn/c7b2a5255c194124b55c742745667387_image.png)\n\nFigure 4. AWS CodePipeline workflow failed execution.\n\nYour DevOps team must be aware of failed deployments. Therefore, it’s a good idea to configure alerts to notify concerned stakeholders with failed stage executions. Create a notification rule that posts a Slack message if a stage fails. For detailed steps, see [Create a notification rule – AWS CodePipeline](https://docs.aws.amazon.com/codepipeline/latest/userguide/notification-rule-create.html). In case of failure, a Slack notification will be sent through [AWS Chatbot](https://aws.amazon.com/chatbot/).\n\n![image.png](https://dev-media.amazoncloud.cn/e871238f44744c3488392ec2696d42f4_image.png)\n\nFigure 5. Slack UI snapshot notification for a failed deployment.\n\nA more elegant solution involves pushing the notification to an SNS topic that in turns calls a Lambda function to retry the failed stage. The Lambda function extracts the pipeline failed stage identifier, and then calls the [RetryStageExecution](https://docs.aws.amazon.com/codepipeline/latest/APIReference/API_RetryStageExecution.html) CodePipeline API.\n\n### **Conclusion**\nWe’ve learned how to create an automation that evaluates the risk associated with proceeding with a deployment in conjunction with a running AWS Health event. Then, the automation decides whether to proceed with the deployment or block the progress to avoid unintended downtime. Accordingly, this results in the improved availability of your application.\n\nThis solution isn’t exclusive to CodePipeline. However, the pattern can be applied to other CI/CD tools that your DevOps team uses.\n\n#### **Author:**\n\n![image.png](https://dev-media.amazoncloud.cn/b123a212200e4d4589ae7a41dd654d0c_image.png)\n\n**Islam Ghanim**\nIslam Ghanim is a Senior Technical Account Manager at Amazon Web Services in Melbourne, Australia. He enjoys helping customers build resilient and cost-efficient architectures. Outside work, he plays squash, tennis and almost any other racket sport.","render":"At the moment of imminent failure, you want to avoid an unlucky deployment. I’ll start here with a short story that demonstrates the purpose of this post.\nThe DevOps team has just started a database upgrade with a planned outage of 30 minutes. The team automated the entire upgrade flow, triggered a CI/CD pipeline with no human intervention, and the upgrade is progressing smoothly. Then, 20 minutes in, the pipeline is stuck, and your upgrade isn’t progressing. The maintenance window has expired and customers can’t transact. You’ve created a support case, and the AWS engineer confirmed that the upgrade is failing because of a running AWS Health incident in the us-west-2 Region. The engineer has directed the DevOps team to continue monitoring the <a href=\"http://status.aws.amazon.com/\" target=\"_blank\">status.aws.amazon.com</a> page for updates regarding incident resolution. The event continued running for three hours, during which time customers couldn’t transact. Once resolved, the DevOps team retried the failed pipeline, and it completed successfully.\nAfter the incident, the DevOps team explored the possibilities for avoiding these types of incidents in the future. The team was made aware of AWS Health API that provides programmatic access to AWS Health information. In this post, we’ll help the DevOps team make the most of the <a href=\"https://aws.amazon.com/health/\" target=\"_blank\">AWS Health</a> API to proactively prevent unintended outages.\nAWS provides Business and Enterprise Support customers with access to the AWS Health API. Customers can have access to running events in the AWS infrastructure that may impact their service usage. Incidents could be Regional, AZ-specific, or even account specific. During these incidents, it isn’t recommended to deploy or change services that are impacted by the event.\nIn this post, I will walk you through how to embed AWS Health API insights into your CI/CD pipelines to automatically stop deployments whenever an AWS Health event is reported in a Region that you’re operating in. Furthermore, I will demonstrate how you can automate detection and remediation.\n<h3><a id=\"The_Demo_10\"></a>The Demo</h3>\nIn this demo, I will use <a href=\"https://aws.amazon.com/codepipeline/\" target=\"_blank\">AWS CodePipeline</a> to demonstrate the idea. I will build a simple pipeline that demonstrates the concept without going into the build, test, and deployment specifics.\n<h3><a id=\"CodePipeline_Flow_13\"></a>CodePipeline Flow</h3>\nThe CodePipeline flow consists of three steps:\n<ol>\n<li>Source stage that downloads a CloudFormation template from <a href=\"https://aws.amazon.com/codecommit/\" target=\"_blank\">AWS CodeCommit</a>. The template will be deployed in the last stage.</li>\n<li>Custom stage that invokes the <a href=\"https://aws.amazon.com/lambda/\" target=\"_blank\">AWS Lambda</a> function to evaluate the AWS Health. The Lambda function calls the AWS Health API, evaluates the health risk, and calls back CodePipeline with the assessment result.</li>\n<li>Deploy stage that deploys the CloudFormation templates downloaded from CodeCommit in the first stage.</li>\n</ol>\n<img src=\"https://dev-media.amazoncloud.cn/1e28dc7022324d30a8b0406c3d230c64_image.png\" alt=\"image.png\" />\nFigure 1. CodePipeline workflow.\n<h3><a id=\"Lambda_evaluation_logic_24\"></a>Lambda evaluation logic</h3>\nThe Lambda function evaluates whether or not a running AWS Health event may be impacted by the deployment. In this case, the following criteria must be met to consider it as safe to deploy:\n<ul>\n<li>Deployment will take place in the North Virginia Region and accordingly the Lambda function will filter on the us-east-1 Region.</li>\n<li>A closed event is irrelevant. The Lambda function will filter events with only the open status.</li>\n<li>AWS Health API can return different event types that may not be relevant, such as: Scheduled Maintenance, and Account and Billing notifications. The Lambda function will filter only “Issue” type events.</li>\n</ul>\nThe AWS Health API follows a multi-Region application architecture and has two regional endpoints in an active-passive configuration. To support active-passive DNS failover, AWS Health provides a global endpoint. The Python code is available on <a href=\"https://github.com/aws-samples/building-health-aware-cicd-pipelines\" target=\"_blank\">GitHub</a> with more information in the README on how to build the Lambda code package.\nThe Lambda function requires the following AWS Identity and Access Management (IAM) permissions to access AWS Health API, CodePipeline, and publish logs to CloudWatch:\nJSON\n<pre><code class=\"lang-\">{\n "Version": "2012-10-17", \n "Statement": [\n {\n "Action": [ \n "logs:CreateLogStream",\n "logs:CreateLogGroup",\n "logs:PutLogEvents"\n ],\n "Effect": "Allow", \n "Resource": "arn:aws:logs:us-east-1:replaceWithAccountNumber:*"\n },\n {\n "Action": [\n "codepipeline:PutJobSuccessResult",\n "codepipeline:PutJobFailureResult"\n ],\n "Effect": "Allow",\n "Resource": "*"\n },\n {\n "Effect": "Allow",\n "Action": "health:DescribeEvents",\n "Resource": "*"\n }\n ]\n}\n</code></pre>\n<h3><a id=\"Solution_architecture_67\"></a>Solution architecture</h3>\n<img src=\"https://dev-media.amazoncloud.cn/fbb632988c9e4c4cab8df997e4df6193_image.png\" alt=\"image.png\" />\nFigure 2. Solution architecture diagram.\nIn CodePipeline, create a new stage with a single action to asynchronously invoke a Lambda function. The function will call AWS Health <a href=\"https://docs.aws.amazon.com/health/latest/APIReference/API_DescribeEvents.html\" target=\"_blank\">DescribeEvents API</a> to retrieve the list of active health incidents. Then, the function will complete the event analysis and decide whether or not it may impact the running deployment. Finally, the function will call back CodePipeline with the evaluation results through either <a href=\"https://docs.aws.amazon.com/codepipeline/latest/APIReference/API_PutJobSuccessResult.html\" target=\"_blank\">PutJobSuccessResult</a> or <a href=\"https://docs.aws.amazon.com/codepipeline/latest/APIReference/API_PutJobFailureResult.html\" target=\"_blank\">PutJobFailureResult</a> API operations.\nIf the Lambda evaluation succeeds, then it will call back the pipeline with a PutJobSuccessResult API. In turn, the pipeline will mark the step as successful and complete the execution.\n<img src=\"https://dev-media.amazoncloud.cn/863afe0b804e403296f9a3214cf790d5_image.png\" alt=\"image.png\" />\nFigure 3. AWS Code Pipeline workflow successful execution.\nIf the Lambda evaluation fails, then it will call back the pipeline with a PutJobFailureResult API specifying a failure message. Once the DevOps team is made aware that the event has been resolved, select the Retry button to re-evaluate the health status.\n<img src=\"https://dev-media.amazoncloud.cn/c7b2a5255c194124b55c742745667387_image.png\" alt=\"image.png\" />\nFigure 4. AWS CodePipeline workflow failed execution.\nYour DevOps team must be aware of failed deployments. Therefore, it’s a good idea to configure alerts to notify concerned stakeholders with failed stage executions. Create a notification rule that posts a Slack message if a stage fails. For detailed steps, see <a href=\"https://docs.aws.amazon.com/codepipeline/latest/userguide/notification-rule-create.html\" target=\"_blank\">Create a notification rule – AWS CodePipeline</a>. In case of failure, a Slack notification will be sent through <a href=\"https://aws.amazon.com/chatbot/\" target=\"_blank\">AWS Chatbot</a>.\n<img src=\"https://dev-media.amazoncloud.cn/e871238f44744c3488392ec2696d42f4_image.png\" alt=\"image.png\" />\nFigure 5. Slack UI snapshot notification for a failed deployment.\nA more elegant solution involves pushing the notification to an SNS topic that in turns calls a Lambda function to retry the failed stage. The Lambda function extracts the pipeline failed stage identifier, and then calls the <a href=\"https://docs.aws.amazon.com/codepipeline/latest/APIReference/API_RetryStageExecution.html\" target=\"_blank\">RetryStageExecution</a> CodePipeline API.\n<h3><a id=\"Conclusion_95\"></a>Conclusion</h3>\nWe’ve learned how to create an automation that evaluates the risk associated with proceeding with a deployment in conjunction with a running AWS Health event. Then, the automation decides whether to proceed with the deployment or block the progress to avoid unintended downtime. Accordingly, this results in the improved availability of your application.\nThis solution isn’t exclusive to CodePipeline. However, the pattern can be applied to other CI/CD tools that your DevOps team uses.\n<h4><a id=\"Author_100\"></a>Author:</h4>\n<img src=\"https://dev-media.amazoncloud.cn/b123a212200e4d4589ae7a41dd654d0c_image.png\" alt=\"image.png\" />\nIslam Ghanim \nIslam Ghanim is a Senior Technical Account Manager at Amazon Web Services in Melbourne, Australia. He enjoys helping customers build resilient and cost-efficient architectures. Outside work, he plays squash, tennis and almost any other racket sport.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家