How Amazon DevOps Guru for RDS helps NRI Digital with database performance monitoring

海外精选
海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时,内容中提到的“AWS” 是 “Amazon Web Services” 的缩写,在此网站不作为商标展示。
0
0
{"value":"*This guest post is co-authored by Ryota Shima, Application Architect, and Kazuki Matsumura, Lead Architect at NRI Digital.*\n\nNRI Digital has a wide variety of systems in production, both on-premises and cloud-based. Among them, many systems are built on Amazon Web Services, and [Amazon Aurora](https://aws.amazon.com/rds/aurora/) and [Amazon Relational Database Service](http://aws.amazon.com/rds) ([Amazon RDS](https://aws.amazon.com/cn/rds/?trk=cndc-detail)) are often used as the database tier.\n\nFor systems running in production, the top root cause of downtime is database-related performance issues. Even if the system has been thoroughly tested in a test environment, production workloads are difficult to predict in advance. Therefore, the key to operating a high quality production system is to be able to quickly detect issues, identify the cause, and respond to the problem when it occurs.\n\nHowever, database performance issues are difficult to analyze and require a considerable skill set to solve. When database-related performance issues occurred in our production environments, developers would spend hours gathering and analyzing information in order to find the root cause. If developers didn’t have enough information to find the root cause, then the problem would reoccur. Furthermore, the recent advancements in cloud computing and cloud-native technologies have made our system environment more decentralized, further increasing our challenge of effectively identifying the issue.\n\nNRI Digital believes that there are limits to approaches that rely on the skills of individuals and teams, and believes that machine learning (ML)-powered services and services that analyze production workloads in real time, identify bottlenecks, and recommend areas for improvement will become necessary in the future.\n\n[Amazon DevOps Guru](https://aws.amazon.com/cn/devops-guru/?trk=cndc-detail) for RDS was [announced](https://aws.amazon.com/about-aws/whats-new/2021/12/amazon-devops-guru-rds-ml-powered-capability-amazon-aurora/) at Amazon Web Services re:Invent 2021. The service is a new ML-powered capability of [Amazon DevOps Guru](https://aws.amazon.com/devops-guru/) that is designed to allow developers and DevOps engineers to quickly detect, diagnose, and remediate a wide variety of database-related issues in [Amazon RDS](https://aws.amazon.com/cn/rds/?trk=cndc-detail).\n\nIn this post, we describe the DevOps Guru for RDS proof of concept (POC) process for NRI Digital.\n\n### **How [Amazon DevOps Guru](https://aws.amazon.com/cn/devops-guru/?trk=cndc-detail) for RDS can solve our problem**\n\nNRI Digital considers the following items as necessary for performance monitoring.\n\n1. **Detection **– Detect anomalies quickly\n2. **Root cause analysis** – Immediately identify what’s happening on the database that is causing performance degradation, and determine the root cause\n3. **Response (workaround or permanent)** – Identify the specific remediation process for problematic areas and apply them to fix the problem and prevent recurrence. The problem can be corrected and recurrence can be prevented by applying the improvement method\n\nThe POC comprised of running multiple SQLs for several minutes, mixing normal SQLs and inefficient SQLs with high load, in order to easily experience the full functionality and usability of DevOps Guru for RDS. For this POC, we used [Amazon Aurora MySQL-Compatible Edition](https://aws.amazon.com/rds/aurora/mysql-features/).\n\nThe process started and an alert email was sent to engineers shortly after, as shown in the following screenshot.\n\n![image.png](https://dev-media.amazoncloud.cn/d23451c0a95c4f46bc2eb41dbf362dfb_image.png)\n\nThis notification indicated an anomaly in the DB load metric. For detailed analysis, we went to the DevOps Guru for RDS dashboard.\n\nOn the DevOps Guru **Reactive insights** page, we noticed that an “RDS DB load anomaly” was an on-going event. By choosing **RDS DB Load Anomalous**, we reviewed the aggregated metrics.\n\n![image.png](https://dev-media.amazoncloud.cn/283a43047668419095b6cf8ca3bf3df2_image.png)\n\nOn the anomaly page, we saw the aggregated metrics that DevOps Guru detected as anomalous. By choosing **View analysis** under **DB Load**, we could see the analysis for this metric.\n\n![image.png](https://dev-media.amazoncloud.cn/3b7cc8a7951a4177ac569b66c59eca79_image.png)\n\nThe first part of the analysis page helped visualize the anomalous metrics, and the bottom section provided the analysis results and recommendations.\n\n![image.png](https://dev-media.amazoncloud.cn/6202ada90c304748a23da942c966d2c0_image.png)\n\nThe analysis section on the left provided the following information:\n\n- ```ActiveSession```exceeds 9\n- The cause of the stagnation of ```ActiveSession```is the DB load related to I/O wait\n- The DB load related to the I/O wait accounts for 82% of the total DB load\n- ```ActiveSession```needs to be lowered to 2 in order for this database to be in a normal state\n\nIt was easy to determine what was happening on the database that was slowing the database performance.\n\nThe next step was to find out what was causing this event. The recommendation on the right side of the **Analysis and recommendations** section listed SQL digest IDs to investigate.\n\n![image.png](https://dev-media.amazoncloud.cn/fba6e079a84742e39ea8729f43d911d4_image.png)\n\nBy choosing **View Top SQL in Performance Insights**, we could check the likely causal SQL information using [Amazon RDS Performance Insights.](https://aws.amazon.com/rds/performance-insights/)\n\n![image.png](https://dev-media.amazoncloud.cn/3ac16e72ac1f4cd3a25ec8929280c7f3_image.png)\n\nWe were able to easily identify what was causing the performance delays. In this case, we identified the SQL that was causing the latency and statistics (average latency).\n\nThen we checked the recommendations for specific remedies for the SQL that seemed to be causing the problem.\n\n![image.png](https://dev-media.amazoncloud.cn/c1ca12fc704c4dd8a4b56072ea9b45ce_image.png)\n\nA link was provided to the Aurora MySQL troubleshooting guide for the wait event [io/table/sql/handler](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/ams-waits.waitio.html#ams-waits.waitio.context). It described the possible causes of increased wait events and the actions to take to investigate.\n\nAs of this writing, DevOps Guru for RDS doesn’t provide recommendations for how to optimize SQL queries, but we look forward to when this additional functionality becomes available. In the meantime, we proceeded to tune the query in question ourselves.\n\nBased on the results of this verification and the specification survey, a comparison of performance monitoring with and without DevOps Guru for RDS is shown in the following table.\n\n![image.png](https://dev-media.amazoncloud.cn/b373169704f848e198e165f43a1e44b9_image.png)\n\n### **Conclusion**\n\nThrough our POC, we were able to determine that DevOps Guru for RDS has the potential to fundamentally solve the conventional performance monitoring issues we mentioned in this post.\n\nLong-term performance problems can lead to loss of end-user trust and an increased potential for lost opportunities. DevOps Guru for RDS has the potential to allow our teams with any skill set to approach problem resolution quickly, rather than be dependent on individual skills and expertise. Refer to [Amazon DevOps Guru for RDS](https://aws.amazon.com/devops-guru/features/devops-guru-for-rds/) to learn more.\n\n### **About us**\n![image.png](https://dev-media.amazoncloud.cn/17475cbf322a4f3b95821f509d56cf8b_image.png)\n\n[NRI Digital](https://www.nri-digital.jp/), established in August 2016, is a digital business specialist of the Nomura Research Institute (NRI) Group. Experts in consulting and solutions in the digital field work with client companies to support them from conceptualization of digitalization strategies, selection and construction of advanced IT solutions, support for business execution, and verification and improvement of the entire project.\n\n#### **About the Authors**\n\n![image.png](https://dev-media.amazoncloud.cn/bf37fc92aabb46a1b88d37cc53581831_image.png)\n\n**Ryota Shima** is an Application architect at NRI Digital.\n\n![image.png](https://dev-media.amazoncloud.cn/0a7d2ed838ab445a903333339ff4994d_image.png)\n\n**Kazuki Matsumura** is a Lead architect at NRI Digital.","render":"<p><em>This guest post is co-authored by Ryota Shima, Application Architect, and Kazuki Matsumura, Lead Architect at NRI Digital.</em></p>\\n<p>NRI Digital has a wide variety of systems in production, both on-premises and cloud-based. Among them, many systems are built on Amazon Web Services, and <a href=\\"https://aws.amazon.com/rds/aurora/\\" target=\\"_blank\\">Amazon Aurora</a> and <a href=\\"http://aws.amazon.com/rds\\" target=\\"_blank\\">Amazon Relational Database Service</a> ([Amazon RDS](https://aws.amazon.com/cn/rds/?trk=cndc-detail)) are often used as the database tier.</p>\\n<p>For systems running in production, the top root cause of downtime is database-related performance issues. Even if the system has been thoroughly tested in a test environment, production workloads are difficult to predict in advance. Therefore, the key to operating a high quality production system is to be able to quickly detect issues, identify the cause, and respond to the problem when it occurs.</p>\n<p>However, database performance issues are difficult to analyze and require a considerable skill set to solve. When database-related performance issues occurred in our production environments, developers would spend hours gathering and analyzing information in order to find the root cause. If developers didn’t have enough information to find the root cause, then the problem would reoccur. Furthermore, the recent advancements in cloud computing and cloud-native technologies have made our system environment more decentralized, further increasing our challenge of effectively identifying the issue.</p>\n<p>NRI Digital believes that there are limits to approaches that rely on the skills of individuals and teams, and believes that machine learning (ML)-powered services and services that analyze production workloads in real time, identify bottlenecks, and recommend areas for improvement will become necessary in the future.</p>\n<p>Amazon DevOps Guru for RDS was <a href=\\"https://aws.amazon.com/about-aws/whats-new/2021/12/amazon-devops-guru-rds-ml-powered-capability-amazon-aurora/\\" target=\\"_blank\\">announced</a> at Amazon Web Services re:Invent 2021. The service is a new ML-powered capability of <a href=\\"https://aws.amazon.com/devops-guru/\\" target=\\"_blank\\">Amazon DevOps Guru</a> that is designed to allow developers and DevOps engineers to quickly detect, diagnose, and remediate a wide variety of database-related issues in [Amazon RDS](https://aws.amazon.com/cn/rds/?trk=cndc-detail).</p>\\n<p>In this post, we describe the DevOps Guru for RDS proof of concept (POC) process for NRI Digital.</p>\n<h3><a id=\\"How_Amazon_DevOps_Guru_for_RDS_can_solve_our_problem_14\\"></a><strong>How Amazon DevOps Guru for RDS can solve our problem</strong></h3>\\n<p>NRI Digital considers the following items as necessary for performance monitoring.</p>\n<ol>\\n<li>**Detection **– Detect anomalies quickly</li>\n<li><strong>Root cause analysis</strong> – Immediately identify what’s happening on the database that is causing performance degradation, and determine the root cause</li>\\n<li><strong>Response (workaround or permanent)</strong> – Identify the specific remediation process for problematic areas and apply them to fix the problem and prevent recurrence. The problem can be corrected and recurrence can be prevented by applying the improvement method</li>\\n</ol>\n<p>The POC comprised of running multiple SQLs for several minutes, mixing normal SQLs and inefficient SQLs with high load, in order to easily experience the full functionality and usability of DevOps Guru for RDS. For this POC, we used <a href=\\"https://aws.amazon.com/rds/aurora/mysql-features/\\" target=\\"_blank\\">Amazon Aurora MySQL-Compatible Edition</a>.</p>\\n<p>The process started and an alert email was sent to engineers shortly after, as shown in the following screenshot.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/d23451c0a95c4f46bc2eb41dbf362dfb_image.png\\" alt=\\"image.png\\" /></p>\n<p>This notification indicated an anomaly in the DB load metric. For detailed analysis, we went to the DevOps Guru for RDS dashboard.</p>\n<p>On the DevOps Guru <strong>Reactive insights</strong> page, we noticed that an “RDS DB load anomaly” was an on-going event. By choosing <strong>RDS DB Load Anomalous</strong>, we reviewed the aggregated metrics.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/283a43047668419095b6cf8ca3bf3df2_image.png\\" alt=\\"image.png\\" /></p>\n<p>On the anomaly page, we saw the aggregated metrics that DevOps Guru detected as anomalous. By choosing <strong>View analysis</strong> under <strong>DB Load</strong>, we could see the analysis for this metric.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/3b7cc8a7951a4177ac569b66c59eca79_image.png\\" alt=\\"image.png\\" /></p>\n<p>The first part of the analysis page helped visualize the anomalous metrics, and the bottom section provided the analysis results and recommendations.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/6202ada90c304748a23da942c966d2c0_image.png\\" alt=\\"image.png\\" /></p>\n<p>The analysis section on the left provided the following information:</p>\n<ul>\\n<li><code>ActiveSession</code>exceeds 9</li>\\n<li>The cause of the stagnation of <code>ActiveSession</code>is the DB load related to I/O wait</li>\\n<li>The DB load related to the I/O wait accounts for 82% of the total DB load</li>\n<li><code>ActiveSession</code>needs to be lowered to 2 in order for this database to be in a normal state</li>\\n</ul>\n<p>It was easy to determine what was happening on the database that was slowing the database performance.</p>\n<p>The next step was to find out what was causing this event. The recommendation on the right side of the <strong>Analysis and recommendations</strong> section listed SQL digest IDs to investigate.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/fba6e079a84742e39ea8729f43d911d4_image.png\\" alt=\\"image.png\\" /></p>\n<p>By choosing <strong>View Top SQL in Performance Insights</strong>, we could check the likely causal SQL information using <a href=\\"https://aws.amazon.com/rds/performance-insights/\\" target=\\"_blank\\">Amazon RDS Performance Insights.</a></p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/3ac16e72ac1f4cd3a25ec8929280c7f3_image.png\\" alt=\\"image.png\\" /></p>\n<p>We were able to easily identify what was causing the performance delays. In this case, we identified the SQL that was causing the latency and statistics (average latency).</p>\n<p>Then we checked the recommendations for specific remedies for the SQL that seemed to be causing the problem.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/c1ca12fc704c4dd8a4b56072ea9b45ce_image.png\\" alt=\\"image.png\\" /></p>\n<p>A link was provided to the Aurora MySQL troubleshooting guide for the wait event <a href=\\"https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/ams-waits.waitio.html#ams-waits.waitio.context\\" target=\\"_blank\\">io/table/sql/handler</a>. It described the possible causes of increased wait events and the actions to take to investigate.</p>\\n<p>As of this writing, DevOps Guru for RDS doesn’t provide recommendations for how to optimize SQL queries, but we look forward to when this additional functionality becomes available. In the meantime, we proceeded to tune the query in question ourselves.</p>\n<p>Based on the results of this verification and the specification survey, a comparison of performance monitoring with and without DevOps Guru for RDS is shown in the following table.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/b373169704f848e198e165f43a1e44b9_image.png\\" alt=\\"image.png\\" /></p>\n<h3><a id=\\"Conclusion_73\\"></a><strong>Conclusion</strong></h3>\\n<p>Through our POC, we were able to determine that DevOps Guru for RDS has the potential to fundamentally solve the conventional performance monitoring issues we mentioned in this post.</p>\n<p>Long-term performance problems can lead to loss of end-user trust and an increased potential for lost opportunities. DevOps Guru for RDS has the potential to allow our teams with any skill set to approach problem resolution quickly, rather than be dependent on individual skills and expertise. Refer to <a href=\\"https://aws.amazon.com/devops-guru/features/devops-guru-for-rds/\\" target=\\"_blank\\">Amazon DevOps Guru for RDS</a> to learn more.</p>\\n<h3><a id=\\"About_us_79\\"></a><strong>About us</strong></h3>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/17475cbf322a4f3b95821f509d56cf8b_image.png\\" alt=\\"image.png\\" /></p>\n<p><a href=\\"https://www.nri-digital.jp/\\" target=\\"_blank\\">NRI Digital</a>, established in August 2016, is a digital business specialist of the Nomura Research Institute (NRI) Group. Experts in consulting and solutions in the digital field work with client companies to support them from conceptualization of digitalization strategies, selection and construction of advanced IT solutions, support for business execution, and verification and improvement of the entire project.</p>\\n<h4><a id=\\"About_the_Authors_84\\"></a><strong>About the Authors</strong></h4>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/bf37fc92aabb46a1b88d37cc53581831_image.png\\" alt=\\"image.png\\" /></p>\n<p><strong>Ryota Shima</strong> is an Application architect at NRI Digital.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/0a7d2ed838ab445a903333339ff4994d_image.png\\" alt=\\"image.png\\" /></p>\n<p><strong>Kazuki Matsumura</strong> is a Lead architect at NRI Digital.</p>\n"}
目录
亚马逊云科技解决方案 基于行业客户应用场景及技术领域的解决方案
联系亚马逊云科技专家
亚马逊云科技解决方案
基于行业客户应用场景及技术领域的解决方案
联系专家
0
目录
关闭