如何使用 Terraform 在亚马逊云科技上创建 ShardingSphere Proxy 高可用集群？

数据库

存储

ZooKeeper

负载均衡

Amazon Auto Scaling

## **背景介绍** ### **Terraform** Terraform[1] 是一个 Hashicorp[2] 开源的基础设施自动化编排工具，使用 IaC 的理念来管理基础设施的变更，并得到了亚马逊云科技，GCP，AZURE 等公有云厂商的支持以及社区提供的各种各样的 provider，已成为「基础设施即代码」领域最流行的实践方式之一，Terraform 有以下优点： 📌 支持多云部署 Terraform 适用于多云方案，将类似的基础结构部署到阿里云、其他云提供商或者本地数据中心。开发人员能够使用相同的工具和相似的配置文件同时管理不同云提供商的资源。 📌 自动化管理基础架构 Terraform 能够创建模块，可重复使用，从而减少因人为因素导致的部署和管理错误。 📌 基础架构即代码可以用代码来管理维护资源，允许保存基础设施状态，从而使用户能够跟踪系统中不同组件所做的更改，并与其他人共享这些配置。 ### **ShardingSphere-Proxy** Apache ShardingSphere 是一款分布式的数据库生态系统，可以将任意数据库转换为分布式数据库，并通过数据分片、弹性伸缩、加密等能力对原有数据库进行增强。其设计哲学为 Database Plus，旨在构建异构数据库上层的标准和生态。它关注如何充分利用数据库的计算和存储能力，而并非实现一个全新的数据库。它站在数据库的上层视角，关注数据库之间的协作多于它们自身。 ShardingSphere-Proxy 的定位为透明化的数据库代理，理论上支持任何使用 MySQL、PostgreSQL、openGauss 协议的客户端操作数据，对异构语言、运维场景更友好。其对应用代码是无侵入的，用户只需更改数据库的连接串，就可以实现数据分片，读写分离等功能，作为数据基础设施的一部分，其自身的高可用性将非常重要。 ## **使用 Terraform 部署** 我们希望您通过 IaC 的方式去部署管理 ShardingSphere Proxy 集群，去享受 IaC 带来的好处。基于以上，我们计划使用 Terraform 创建一个多可用区的 ShardingSphere-Proxy 高可用集群。除此之外，在开始编写 Terraform 配置之前，我们先需要了解 ShardingSphere-Proxy 集群的基本架构图： ![image.png](https://dev-media.amazoncloud.cn/8753ebad00754e1fa3b14c4eb66b13f8_image.png "image.png") 我们使用 ZooKeeper 来作为 Governance Center。可以看出，ShardingSphere-Proxy 自身是一个无状态的应用，在实际场景中，对外提供一个负载均衡即可，由负载均衡去弹性分配各个实例之间的流量。为了保证 ZooKeeper 集群及 ShardingSphere-Proxy 集群的高可用，我们将使用以下架构创建： ![image.png](https://dev-media.amazoncloud.cn/c234a1ec5c8644478d851489eb42f3ed_image.png "image.png") ### **ZooKeeper 集群** **定义输入参数** 为了达到可重用配置的目的，我们定义了一系列的变量，内容如下： ``` variable "cluster_size" { type = number description = "The cluster size that same size as available_zones" } variable "key_name" { type = string description = "The ssh keypair for remote connection" } variable "instance_type" { type = string description = "The EC2 instance type" } variable "vpc_id" { type = string description = "The id of VPC" } variable "subnet_ids" { type = list(string) description = "List of subnets sorted by availability zone in your VPC" } variable "security_groups" { type = list(string) default = [] description = "List of the Security Group, it must be allow access 2181, 2888, 3888 port" } variable "hosted_zone_name" { type = string default = "shardingsphere.org" description = "The name of the hosted private zone" } variable "tags" { type = map(any) description = "A map of zk instance resource, the default tag is Name=zk-\$\${count.idx}" default = {} } variable "zk_version" { type = string description = "The zookeeper version" default = "3.7.1" } variable "zk_config" { default = { client_port = 2181 zk_heap = 1024 } description = "The default config of zookeeper server" } ``` 这些变量也可以在下面安装 ShardingSphere-Proxy 集群时更改。 **配置 ZooKeeper 集群** ZooKeeper 服务实例我们使用了亚马逊云科技原生的 `amzn2-ami-hvm` 镜像，我们使用 ++count++ 参数来部署 ZooKeeper 服务，它指示 Terraform 创建的 ZooKeeper 集群的节点数量为 ++var.cluster_size++ 。在创建 ZooKeeper 实例时，我们使用了 ignore_changes 参数来忽略人为的更改 tag ，以避免在下次运行 Terraform 时实例被重新创建。可使用 `cloud-init` 来初始化 ZooKeeper 相关配置，具体内容见 [3]。我们为每个 ZooKeeper 服务都创建了对应的域名，应用只需要使用域名即可，以避免 ZooKeeper 服务重启导致 ip 地址更改带来的问题。 ``` data "aws_ami" "base" { owners = ["amazon"] filter { name = "name" values = ["amzn2-ami-hvm-*-x86_64-ebs"] } most_recent = true } data "aws_availability_zones" "available" { state = "available" } resource "aws_network_interface" "zk" { count = var.cluster_size subnet_id = element(var.subnet_ids, count.index) security_groups = var.security_groups } resource "aws_instance" "zk" { count = var.cluster_size ami = data.aws_ami.base.id instance_type = var.instance_type key_name = var.key_name network_interface { delete_on_termination = false device_index = 0 network_interface_id = element(aws_network_interface.zk.*.id, count.index) } tags = merge( var.tags, { Name = "zk-\${count.index}" } ) user_data = base64encode(templatefile("\${path.module}/cloud-init.yml", { version = var.zk_version nodes = range(1, var.cluster_size + 1) domain = var.hosted_zone_name index = count.index + 1 client_port = var.zk_config["client_port"] zk_heap = var.zk_config["zk_heap"] })) lifecycle { ignore_changes = [ # Ignore changes to tags. tags ] } } data "aws_route53_zone" "zone" { name = "\${var.hosted_zone_name}." private_zone = true } resource "aws_route53_record" "zk" { count = var.cluster_size zone_id = data.aws_route53_zone.zone.zone_id name = "zk-\${count.index + 1}" type = "A" ttl = 60 records = element(aws_network_interface.zk.*.private_ips, count.index) } ``` **定义输出** 在成功运行 `terraform apply` 后会输出 ZooKeeper 服务实例的 IP 及对应的域名。 ``` output "zk_node_private_ip" { value = aws_instance.zk.*.private_ip description = "The private ips of zookeeper instances" } output "zk_node_domain" { value = [for v in aws_route53_record.zk.*.name : format("%s.%s", v, var.hosted_zone_name)] description = "The private domain names of zookeeper instances for use by ShardingSphere Proxy" } ``` ### **ShardingSphere-Proxy 集群** **定义输入参数** 定义输入参数的目的也是为了达到配置可重用的目的。 ``` variable "cluster_size" { type = number description = "The cluster size that same size as available_zones" } variable "shardingsphere_proxy_version" { type = string description = "The shardingsphere proxy version" } variable "shardingsphere_proxy_asg_desired_capacity" { type = string default = "3" description = "The desired capacity is the initial capacity of the Auto Scaling group at the time of its creation and the capacity it attempts to maintain. see https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-as-group.html#cfn-as-group-desiredcapacitytype, The default value is 3" } variable "shardingsphere_proxy_asg_max_size" { type = string default = "6" description = "The maximum size of ShardingSphere Proxy Auto Scaling Group. The default values is 6" } variable "shardingsphere_proxy_asg_healthcheck_grace_period" { type = number default = 120 description = "The amount of time, in seconds, that Amazon EC2 Auto Scaling waits before checking the health status of an EC2 instance that has come into service and marking it unhealthy due to a failed health check. see https://docs.aws.amazon.com/autoscaling/ec2/userguide/health-check-grace-period.html" } variable "image_id" { type = string description = "The AMI id" } variable "key_name" { type = string description = "the ssh keypair for remote connection" } variable "instance_type" { type = string description = "The EC2 instance type" } variable "vpc_id" { type = string description = "The id of your VPC" } variable "subnet_ids" { type = list(string) description = "List of subnets sorted by availability zone in your VPC" } variable "security_groups" { type = list(string) default = [] description = "List of The Security group IDs" } variable "lb_listener_port" { type = string description = "lb listener port" } variable "hosted_zone_name" { type = string default = "shardingsphere.org" description = "The name of the hosted private zone" } variable "zk_servers" { type = list(string) description = "The Zookeeper servers" } ``` **配置 AutoScalingGroup** 我们将创建一个 AutoScalingGroup 来让其管理 ShardingSphere-Proxy 实例，AutoScalingGroup 的健康检查类型被更改为 "ELB"，在负载均衡对实例执行健康检查失败后，AutoScalingGroup 能及时移出坏的节点。在创建 AutoScallingGroup 时会忽略以下更改，分别为：load_balancers 、 target_group_arns 。我们同样使用 cloud-init 来配置 ShardingSphere-Proxy 实例，具体内容见[4]。 ``` resource "aws_launch_template" "ss" { name = "shardingsphere-proxy-launch-template" image_id = var.image_id instance_initiated_shutdown_behavior = "terminate" instance_type = var.instance_type key_name = var.key_name iam_instance_profile { name = aws_iam_instance_profile.ss.name } user_data = base64encode(templatefile("\${path.module}/cloud-init.yml", { version = var.shardingsphere_proxy_version version_elems = split(".", var.shardingsphere_proxy_version) zk_servers = join(",", var.zk_servers) })) metadata_options { http_endpoint = "enabled" http_tokens = "required" http_put_response_hop_limit = 1 instance_metadata_tags = "enabled" } monitoring { enabled = true } vpc_security_group_ids = var.security_groups tag_specifications { resource_type = "instance" tags = { Name = "shardingsphere-proxy" } } } resource "aws_autoscaling_group" "ss" { name = "shardingsphere-proxy-asg" availability_zones = data.aws_availability_zones.available.names desired_capacity = var.shardingsphere_proxy_asg_desired_capacity min_size = 1 max_size = var.shardingsphere_proxy_asg_max_size health_check_grace_period = var.shardingsphere_proxy_asg_healthcheck_grace_period health_check_type = "ELB" launch_template { id = aws_launch_template.ss.id version = "\$Latest" } lifecycle { ignore_changes = [load_balancers, target_group_arns] } } ``` **配置负载均衡** 通过上一步创建好的 AutoScalingGroup 会 attach 到负载均衡上，经过负载均衡的流量会自动路由到 AutoScalingGroup 创建的 ShardingSphere-Proxy 实例上。 ``` resource "aws_lb_target_group" "ss_tg" { name = "shardingsphere-proxy-lb-tg" port = var.lb_listener_port protocol = "TCP" vpc_id = var.vpc_id preserve_client_ip = false health_check { protocol = "TCP" healthy_threshold = 2 unhealthy_threshold = 2 } tags = { Name = "shardingsphere-proxy" } } resource "aws_autoscaling_attachment" "asg_attachment_lb" { autoscaling_group_name = aws_autoscaling_group.ss.id lb_target_group_arn = aws_lb_target_group.ss_tg.arn } resource "aws_lb_listener" "ss" { load_balancer_arn = aws_lb.ss.arn port = var.lb_listener_port protocol = "TCP" default_action { type = "forward" target_group_arn = aws_lb_target_group.ss_tg.arn } tags = { Name = "shardingsphere-proxy" } } ``` **配置域名** 我们将创建默认为 `proxy.shardingsphere.org` 的内部域名，实际内部指向到上一步创建的负载均衡。 ``` data "aws_route53_zone" "zone" { name = "\${var.hosted_zone_name}." private_zone = true } resource "aws_route53_record" "ss" { zone_id = data.aws_route53_zone.zone.zone_id name = "proxy" type = "A" alias { name = aws_lb.ss.dns_name zone_id = aws_lb.ss.zone_id evaluate_target_health = true } } ``` **配置 CloudWatch** 我们将通过 STS 去创建包含 CloudWatch 权限的角色，角色会附加到由 AutoScalingGroup 创建的 ShardingSphere-Proxy 实例上，其运行日志会被 CloudWatch Agent 采集到 CloudWatch 上。默认会创建名为 `shardingsphere-proxy.log` 的 log_group，CloudWatch 的具体配置见 [5]。 ``` resource "aws_iam_role" "sts" { name = "shardingsphere-proxy-sts-role" assume_role_policy = <<EOF { "Version": "2012-10-17", "Statement": [ { "Action": "sts:AssumeRole", "Principal": { "Service": "ec2.amazonaws.com" }, "Effect": "Allow", "Sid": "" } ] } EOF } resource "aws_iam_role_policy" "ss" { name = "sharidngsphere-proxy-policy" role = aws_iam_role.sts.id policy = <<EOF { "Version": "2012-10-17", "Statement": [ { "Action": [ "cloudwatch:PutMetricData", "ec2:DescribeTags", "logs:PutLogEvents", "logs:DescribeLogStreams", "logs:DescribeLogGroups", "logs:CreateLogStream", "logs:CreateLogGroup" ], "Effect": "Allow", "Resource": "*" } ] } EOF } resource "aws_iam_instance_profile" "ss" { name = "shardingsphere-proxy-instance-profile" role = aws_iam_role.sts.name } ``` ### **部署** 在创建完所有的 Terraform 配置后就可以部署 ShardingSphere-Proxy 集群了。在实际部署之前，推荐您使用如下命令去检查配置是否按预期执行。 ``` terraform plan ``` 在确认完计划后，就可以去真正的执行了，运行如下命令： ``` terraform apply ``` 完整的代码可以在 [6] 找到。更多的内容请查看我们的网站 [7]。 ### **测试** 测试的目标是证明创建的集群是可用的，我们使用一个简单 case：使用 DistSQL 添加两个数据源及创建一个简单的分片规则，然后插入数据，查询能返回正确的结果。默认我们会创建一个 `proxy.shardingsphere.org` 的内部域名， ShardingSphere-Proxy 集群的用户名和密码都是 root。 ![image.png](https://dev-media.amazoncloud.cn/65851a971ff445ffad1a1ac98fc9e2d4_image.png "image.png") 注：DistSQL（Distributed SQL）是 ShardingSphere 特有的操作语言，它与标准 SQL 的使用方式完全一致，用于提供增量功能的 SQL 级别操作能力，详细说明见 [8]。 ## **总结** Terraform 是帮助你实现 IaC 的有效工具，使用 Terraform 对迭代 ShardingSphere-Proxy 集群将非常有用。希望这篇文章能够帮助到对 ShardingSphere 以及 Terraform 感兴趣的人。 ## **引用** 1. https://www.terraform.io/ 2. https://www.hashicorp.com/ 3. https://raw.githubusercontent.com/apache/shardingsphere-on-cloud/main/terraform/amazon/zk/cloud-init.yml 4. https://raw.githubusercontent.com/apache/shardingsphere-on-cloud/main/terraform/amazon/shardingsphere/cloud-init.yml 5. https://raw.githubusercontent.com/apache/shardingsphere-on-cloud/main/terraform/amazon/shardingsphere/cloudwatch-agent.json 6. https://github.com/apache/shardingsphere-on-cloud/tree/main/terraform/amazon 7. https://shardingsphere.apache.org/oncloud/current/en/overview/ 8. https://shardingsphere.apache.org/document/current/cn/user-manual/shardingsphere-proxy/distsql/

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家