Achieving Large-Scale Cost and Efficiency Optimization of Apache Kafka with AutoMQ

Kafka

海外精选

Amazon Elastic Kubernetes Service (EKS)

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

[Apache Kafka](https://kafka.apache.org/) is a distributed streaming platform widely used for building real-time data pipelines and streaming applications. It is renowned for its high throughput, low latency, and scalability. Kafka has a wide range of applications: - Message Queue: Used as high-performance message middleware to decouple system components with different production and consumption speeds. - Real-Time Stream Processing: Combined with tools like Apache Flink and Spark Streaming to achieve real-time data analysis and processing. - Data Integration: Acts as a bridge between different data sources and target systems, supporting data synchronization and ETL (Extract, Transform, Load) processes. - Metrics and Monitoring: Collects and processes performance metrics and monitoring data of applications or systems. In the era of cloud computing, software defines hardware and provides high availability and high reliability with Service Level Agreement (SLA) guarantees. Building Kafka with cloud services eliminates the need for implementing complex distributed multi-replica replication protocols, resulting in a more straightforward architecture and lower costs. AutoMQ Cloud is a next-generation fully managed Kafka cloud service provided by [AutoMQ CO., LTD](https://www.automq.com/company). Using [AutoMQ Cloud](https://www.automq.com/product) enables enterprise developers to easily build and run event streaming applications in a public cloud environment without worrying about cluster operation and maintenance. AutoMQ Cloud is 100% compatible with open-source Apache Kafka and offers significant enhancements and improvements over the community version for enterprise scenarios such as high availability disaster recovery architecture, elasticity, and observable operations. Additionally, AutoMQ Cloud provides commercial support versions of RocketMQ, which are currently available on the AWS Marketplace. This article will focus on the technical details and core advantages of the AutoMQ architecture on AWS. #### Overview of Architecture ![图片 1.png](https://dev-media.amazoncloud.cn/e42fc2fdc95a477e8e78fca576a1ba1e_%E5%9B%BE%E7%89%87%201.png "图片 1.png") - AutoMQ Deployment on [Amazon EKS](https://aws.amazon.com/cn/eks/?trk=cndc-detail): AutoMQ supports deployment on [Amazon Elastic Kubernetes Service](https://aws.amazon.com/cn/eks/?trk=cndc-detail) ([Amazon EKS](https://aws.amazon.com/cn/eks/?trk=cndc-detail)), allowing users to deploy their workloads on Kubernetes without the need to install, operate, and maintain their own Kubernetes control plane. - Integration with AWS Networking and Security Services: Using [Amazon EKS](https://aws.amazon.com/cn/eks/?trk=cndc-detail), workloads are integrated with AWS networking and security services, including integration with [AWS Identity and Access Management](https://aws.amazon.com/cn/iam/?trk=cndc-detail) (IAM) for authenticating Kubernetes clusters. This includes resource management for S3 and EBS. - S3Stream - Core Streaming Storage Component: S3Stream is the core streaming storage component in AutoMQ, adhering to the separation of storage and computation. It offloads Apache Kafka's native ISR-based log storage layer to cloud storage EBS and object storage. AutoMQ innovatively implements a set of core streaming storage APIs on top of object storage, including offset management, Append, Fetch, and Trim data. - Cost Optimization with [Amazon EC2 ](https://aws.amazon.com/cn/ec2/?trk=cndc-detail)Spot Instances: Managed node groups can be configured with [Amazon EC2 ](https://aws.amazon.com/cn/ec2/?trk=cndc-detail)Spot instances to optimize the cost of compute nodes running in the [Amazon EKS](https://aws.amazon.com/cn/eks/?trk=cndc-detail) cluster. - Support for AWS Graviton Processor Instances: AutoMQ supports ARM-based AWS Graviton processor instances, providing the best price-performance ratio for running cloud workloads. #### Core Advantages - Pay-As-You-Go with Up to 50% Cloud Bill Reduction: AutoMQ's new cloud-native architecture leverages object storage and elastic cloud computing resources, offering significant cost advantages compared to self-built Apache Kafka and other solutions. By subscribing to AutoMQ Cloud, users can fully utilize resources on demand and pay per usage, with potential cloud bill reductions of up to 50%. - Fully Managed and Maintenance-Free with SLA Guarantee: AutoMQ Cloud is a fully managed service provided by the AutoMQ team for public cloud environments. Users can start using the service with a single click, without worrying about cluster deployment, changes, or maintenance operations. The AutoMQ team offers 24/7 operational support and SLA guarantees. - Support for BYOC, Resource Deployment in User VPC, No Cross-Network Data: In addition to the traditional SaaS model, AutoMQ Cloud supports a BYOC (Bring Your Own Cloud) model, where cloud resources are deployed in the user's VPC. This allows users to access AutoMQ without cross-network connectivity, leveraging existing cloud account resource discounts and benefits. - Enterprise-Grade Capabilities Out-of-the-Box: AutoMQ Cloud provides various enterprise-grade enhancements such as automatic elastic scaling, second-level partition migration, traffic self-balancing, and system observability integration. Users can utilize these capabilities immediately after service activation without additional development. These core advantages make AutoMQ Cloud an attractive option for organizations looking to optimize costs, ensure high availability, and leverage advanced enterprise features for their Kafka workloads in a cloud environment. #### Functionality and Performance Testing ##### Compatibility AutoMQ strategically reuses Apache Kafka's computational layer code in its technical architecture, making minimal replacements only at the storage layer. This approach ensures full compatibility with relevant versions of Apache Kafka. Applications based on Apache Kafka can seamlessly transition to AutoMQ. During compatibility validation, AutoMQ utilized Apache Kafka's test case suite and successfully passed tests for the corresponding versions. Specific data results are as follows: The compatibility relationship between AutoMQ and Apache Kafka versions is as follows: ![image.png](https://dev-media.amazoncloud.cn/0a3d3d12d0824899a97d909b53549a8d_image.png "image.png") ![image.png](https://dev-media.amazoncloud.cn/7c0a1d1c06a64df39fdfe327876b8af8_image.png "image.png") ##### Elasticity [Amazon EKS](https://aws.amazon.com/cn/eks/?trk=cndc-detail) supports rapid scaling of compute nodes through Karpenter. Karpenter is an open-source autoscaling project designed specifically for Kubernetes. It enhances the availability of Kubernetes applications by automatically scaling compute resources based on aggregated resource requests from unscheduled Pods. Karpenter intelligently decides when to launch or terminate nodes, providing the right amount of compute resources within seconds rather than minutes, thereby minimizing scheduling delays and meeting real-time application demands. In practical scenarios using Apache Kafka, due to the complexities of migrating business traffic, Apache Kafka clusters often cannot directly benefit from elasticity. Operations personnel need to manually adjust traffic, a process that typically takes hours. This manual intervention is impractical for online clusters with fluctuating traffic, forcing operators to deploy clusters at maximum capacity to avoid risks associated with delayed scaling in response to traffic peaks. Consequently, this approach leads to significant resource waste. AutoMQ achieves sub-second scaling capabilities, facilitated by its atomic capability: sub-second partition migration. Leveraging Horizontal Pod Autoscaler (HPA) and Karpenter, AutoMQ can rapidly scale its clusters and EKS nodes. After scaling, a portion of the cluster's partitions is migrated in batches to new nodes, achieving traffic rebalancing typically within tens of seconds. Automatic scaling provides clusters with better cost efficiency, stability, and multi-tenant advantages. Horizontal Pod Autoscaler (HPA) Example： ``` apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: kafka-broker-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: StatefulSet name: automq-automq-enterprise-broker minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 50` ``` Using Node Group, pre-provisioning two nodes for broker and controller instances, and leveraging Karpenter for elastic scaling of spot nodes. Example configuration: ``` cat <<EOF | envsubst | kubectl apply -f - apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: default spec: template: spec: requirements: - key: kubernetes.io/arch operator: In values: ["arm64"] - key: kubernetes.io/os operator: In values: ["linux"] - key: karpenter.sh/capacity-type operator: In values: ["spot"] - key: node.kubernetes.io/instance-type operator: In values: ["r6g.large", "r6g.xlarge", "r6g.2xlarge"] nodeClassRef: apiVersion: karpenter.k8s.aws/v1beta1 kind: EC2NodeClass name: default limits: cpu: 1000 disruption: consolidationPolicy: WhenUnderutilized expireAfter: 720h # 30 * 24h = 720h --- apiVersion: karpenter.k8s.aws/v1beta1 kind: EC2NodeClass metadata: name: default spec: amiFamily: AL2 # Amazon Linux 2 role: "KarpenterNodeRole-eks" subnetSelectorTerms: - tags: karpenter.sh/discovery: "eks" securityGroupSelectorTerms: - tags: karpenter.sh/discovery: "eks" amiSelectorTerms: - id: "\${ARM_AMI_ID}" - id: "\${AMD_AMI_ID}" EOF ``` When a Pod requires additional nodes, check Karpenter logs to promptly respond by spinning up designated Graviton Spot nodes. kubectl logs -f -n "${KARPENTER_NAMESPACE}" -l app.kubernetes.io/name=karpenter -c controller ![image.png](https://dev-media.amazoncloud.cn/1579deee19e44527a8acaf6f4874e583_image.png "image.png") The following diagram illustrates the dynamic changes in the number of brokers in the AutoMQ Kafka cluster as traffic increases. It shows that as traffic linearly rises, brokers are dynamically created and added to the cluster to balance the load. ![image.png](https://dev-media.amazoncloud.cn/9620cdc52165492eb7e9742bba4c04c9_image.png "image.png") The diagram below shows the traffic changes across various Broker nodes during traffic increase. It demonstrates that newly created Brokers achieve traffic redistribution within seconds. ![image.png](https://dev-media.amazoncloud.cn/35160ba0932c4134b60de08873c31c48_image.png "image.png") The diagram below illustrates the dynamic changes in the number of brokers in an AutoMQ Kafka cluster as traffic decreases. It shows that as traffic linearly decreases, brokers are dynamically taken offline to conserve resources. ![image.png](https://dev-media.amazoncloud.cn/03480645ad3d425296b9b92dd58e5ca7_image.png "image.png") The diagram below illustrates the traffic changes across various Broker nodes during a decrease in traffic. It shows that as brokers are taken offline, their workload is transferred to the remaining brokers (significant increases in traffic observed on the remaining brokers whenever a broker is taken offline). ![image.png](https://dev-media.amazoncloud.cn/91478cc240f4466fb813e85a1c061be5_image.png "image.png") ##### Performance The test measures the performance and throughput limits of AutoMQ and Apache Kafka under different traffic scales with the same cluster configuration. The testing scenarios are as follows: Deploy a cluster with 23 brokers and create a topic with 1000 partitions. Simulate 1:1 read/write traffic at rates of 500 MiB/s and 1 GiB/s. Additionally, test the maximum throughput for each system: AutoMQ at 2200 MiB/s and Apache Kafka at 1100 MiB/s. For Apache Kafka, each broker is additionally equipped with a 500GB 156MiB/s gp3 EBS volume for data storage. ![image.png](https://dev-media.amazoncloud.cn/61b360e4049c4ec1b373550ba25a09b6_image.png "image.png") ![image.png](https://dev-media.amazoncloud.cn/7476fde5a5ff4e29beccbe22c4803fbb_image.png "image.png") #### Cost Comparison The cost of an Apache Kafka cluster primarily includes compute and storage expenses. Compute costs encompass the servers required to run Kafka brokers, such as AWS EC2 instances. Storage costs involve the storage devices needed for data retention, such as AWS EBS volumes. AutoMQ's new cloud-native architecture optimizes both compute and storage significantly, reducing total cluster costs by up to tenfold under equivalent traffic conditions. Regarding storage, AutoMQ leverages AWS's highly reliable and cost-effective cloud storage solutions like S3 and EBS gp3. For compute, it utilizes on-demand pricing, elastic scaling capabilities, and supports spot instances, allowing all brokers to run as spot instances, which can significantly reduce costs. When running the above workload in the AWS Ningxia region (cn-northwest-1) continuously for 24 hours, the results obtained demonstrate dynamic scaling through AWS CloudWatch, showing the relationship between broker count and total cluster traffic over time. See the diagram below: ![image.png](https://dev-media.amazoncloud.cn/cde41fd1766c4f4694ebf61d84d94086_image.png "image.png") The blue line in the graph represents the total message production throughput of the cluster (i.e., total size of messages produced per second), which, due to a 1:1 production-to-consumption ratio, also represents the total message consumption throughput. The units are in bytes per second (byte/s), with the units on the left Y-axis represented in exponential form with a base of 10 (e.g., 1M = 1,000,000, 1G = 1,000,000,000). Despite maintaining stable cluster traffic due to [AWS Auto Scaling](https://aws.amazon.com/cn/autoscaling/?trk=cndc-detail) group rebalancing and spot instance terminations, there may still be short-term fluctuations in the number of brokers. Cost Components: Compared to Apache Kafka (version below 3.6.0 without tiered storage), estimated costs for the same scenario were calculated. - Maximum throughput per single broker: 100 MiB/s * 80% / (1 + 2) = 26.67 MiB/s - Number of brokers in the cluster: 1200 MiB/s ÷ 26.67 MiB/s = 45 - Required instance count: 45 + 3 = 48 - Daily compute cost: 48 * 24 hours * 0.88313 CNY/hour = 1017.366 CNY - Required storage size: 16242 GiB * 3 / 80% = 60907.5 GiB - Daily storage cost: 60907.5 GiB * 0.5312 CNY/(GiB*month) / 730 hours/month * 24 hours = 1063.695 CNY - Total cost: 1017.366 CNY + 1063.695 CNY = 2081.061 CNY ![image.png](https://dev-media.amazoncloud.cn/046afc53bc0a4603b3f25c89b6855a7e_image.png "image.png") ##### Observability AutoMQ uses the OpenTelemetry SDK for metrics collection and export, supporting the exposure of Apache Kafka operational metrics and underlying storage-related metrics. Both categories of metrics are converted and exposed using the OTLP format. ![image.png](https://dev-media.amazoncloud.cn/8d42eb1d443e48dc913b83c2e008d6c6_image.png "image.png") AutoMQ provides pre-configured Grafana dashboard templates (link). After exporting metrics to Prometheus, import the Grafana dashboard template and configure the Grafana data source to point to the corresponding Prometheus instance to start monitoring AutoMQ. The pre-configured Grafana dashboard templates offer monitoring across different dimensions, such as Cluster Overview, Broker Metrics, and Topic Metrics. The Cluster Overview section provides cluster-level monitoring information including node count, data size, cluster traffic, and other metrics. Additionally, it offers overview metrics for topics, consumer groups, brokers, and supports drill-down capabilities to view detailed monitoring information, as shown in the diagram below: ![image.png](https://dev-media.amazoncloud.cn/5eef8623fcc94b4a847de0c77b61509c_image.png "image.png") ##### Conclusion This article introduces the core advantages of AutoMQ, a next-generation Apache Kafka distribution redesigned based on cloud-native principles. It details the architecture on AWS, summarizes AutoMQ's compatibility, cloud elasticity, and performance testing, highlighting cost advantages for specific workloads. This provides users with a new solution for using Apache Kafka in the cloud, emphasizing cost efficiency and scalability. Additionally, AutoMQ offers commercial support for RocketMQ, both available on the AWS Marketplace, aiming to accelerate business value realization for users. Looking ahead, we will continue researching integration solutions of AutoMQ Kafka with AWS cloud-native services, such as real-time data ingestion with [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) and real-time data synchronization using Kafka Connect plugins, focusing on large-scale message queue cost optimizations and other use cases. #### Reference AutoMQ Product Introduction: https://www.automq.com/zh/product How AutoMQ Implements Self-Balancing for Kafka on S3: https://www.automq.com/zh/blog/how-to-implement-self-balancing-for-kafka-on-s3 AutoMQ Testing Report: https://docs.automq.com/zh/docs/automq-opensource/UjE4wOmajifbrtkSJKecAfrrnvb AutoMQ Cloud Architecture Overview: https://docs.automq.com/zh/docs/automq-onperm/Dtv2wrUVPiBxc3kgs4cciWD4nQh [Amazon EKS](https://aws.amazon.com/cn/eks/?trk=cndc-detail) Introduction: https://aws.amazon.com/cn/eks/ Karpenter: https://karpenter.sh/

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家