{"value":"*This post was co-written by Lukonde Mwila, Principal Technical Evangelist at SUSE, an AWS Container Hero, and a HashiCorp Ambassador.*\n\n### **Introduction**\nCloud-native technologies are becoming increasingly ubiquitous, and Kubernetes is at the forefront of this movement. Today, Kubernetes is seeing widespread adoption across organizations in a variety of different industries. When implemented properly, Kubernetes can help these organizations achieve higher availability, scalability, and resiliency for their workloads. Combining Kubernetes with the attributes of cloud computing—such as unparalleled scalability and elasticity—can help organizations enhance their containerized applications’ resiliency and availability.\n\nAs detailed [in this introductory post](https://aws.amazon.com/blogs/aws/introducing-karpenter-an-open-source-high-performance-kubernetes-cluster-autoscaler/), [Karpenter](https://karpenter.sh/)‘s objective is to make sure that your cluster’s workloads have the compute they need, no more and no less, right when they need it.\n\nIn its most recent updates, [Karpenter](https://karpenter.sh/) added support for more advanced scheduling constraints, such as pod affinity and anti-affinity, topology spread, node affinity, node selection, and resource requests. This post will specifically delve into ```podAffinity```, ```podAntiAffinity```, and volume topology awareness and elaborate on the use cases that they’re best suited for.\n\n### **Prerequisites**\nTo carry out the examples in this post, you need to have [Karpenter](https://karpenter.sh/) installed in a Kubernetes cluster in AWS. We’ll be making use of [Amazon EKS](https://aws.amazon.com/eks/) for demonstrative purposes. You can automate the process of provisioning an EKS cluster, with [Karpenter](https://karpenter.sh/) as an add-on, by making use of the Terraform [EKS blueprints](https://karpenter.sh/v0.12.0/getting-started/getting-started-with-terraform/).\n\n#### **Pod affinity and pod anti-affinity scheduling**\nApplying scheduling constraints to pods is implemented by establishing relationships between pods and specific nodes or between pods themselves. The latter is known as inter-pod affinity. Using inter-pod affinity, you assign rules that inform the scheduler’s approach in deciding which pod goes to which node based on their relation to other pods. Inter-pod affinity includes both pod affinity and pod anti-affinity.\n\nLike node affinity, this can be done using the rules ```requiredDuringSchedulingIgnoredDuringExecution``` and ```preferredDuringSchedulingIgnoredDuringExecution``` depending on your requirements. As the names imply, required and preferred are terms that represent how hard or soft the scheduling constraints should be. If the scheduling criteria for a pod are set to the required rule, then Kubernetes ensures the pod is placed on a node that satisfies this. Similarly, pods that contain the preferred rule are scheduled to nodes that match the highest preference.\n\n**Pod affinity:** The ```podAffinity``` rule informs the scheduler to match pods that relate to each other based on their labels. If a new pod is created, then the scheduler takes care of searching the nodes for pods that match the label specification of the new pod’s label selector.\n\nPod anti-affinity: In contrast, the ```podAntiAffinity``` rule allows you to prevent certain pods from running on the same node if the matching label criteria are met.\n\nThese rules can be particularly helpful in various scenarios. For example, podAffinity can be beneficial for pods to co-locate each other in the same AZ or node to support any inter-dependencies and reduce network latency between services. On the other hand, ```podAntiAffinity``` is typically useful for preventing a single point of failure by spreading pods across AZs or nodes for high availability (HA). For such use cases, the recommended topology spread constraint for anti-affinity can be zonal or hostname. This can be implemented using the topologyKey property which determines the searching scope of the cluster nodes. The topologyKey is a key of a label attached to a node.\n\nAn example of a ```podAntiAffinity``` implementation would be the [CoreDNS](https://docs.aws.amazon.com/eks/latest/userguide/managing-coredns.html) Deployment. Its [Deployment resource](https://github.com/coredns/deployment/blob/master/kubernetes/coredns.yaml.sed) has the ```podAntiAffinity``` policy to ensure that the scheduler runs the ```CoreDNS``` pods on different nodes for HA and to avoid [VPC DNS throttling](https://aws.amazon.com/premiumsupport/knowledge-center/vpc-find-cause-of-failed-dns-queries/). You’ll notice that the Deployment’s anti-affinity ```topologyKey``` is set to the hostname. In addition to this, ```podAntiAffinity``` can be used to give a pod or set of pods resource isolation on exclusive nodes, as well as mitigating the risk of some pods interfering with the performance of others.\n\nUsing [Karpenter](https://karpenter.sh/) allows you to make sure that new compute provisioned for your cluster will satisfy these pod affinity rules as workloads scale, without configuring additional infrastructure. [Karpenter](https://karpenter.sh/) tracks unscheduled pods and will provision compute resources in accordance with the required or preferred affinity rules defined in your resource manifests.\n\n##### **Pod affinity example with Karpenter**\n\nIn this example, you’ll create a deployment resource with a ```podAffinity``` rule that requires scheduling the pods on nodes in the same AZ (availability zone). In the process, [Karpenter](https://karpenter.sh/) will interpret the requirements of the pods that need to be scheduled and provision nodes that allow for these affinity rules to be met in an optimal way.\n\nAs a starting point, you’ll need to install the Karpenter Provisioner on your cluster. The Provisioner is a CRD that details configuration specifications and parameters such as node types, labels, taints, tags, customer kubelet configurations, resource limits and cluster connections via subnet and security group associations. The Provisioner manifest used in this example can be seen below.\n\n```\napiVersion: karpenter.sh/v1alpha5\nkind: Provisioner\nmetadata:\n name: default\nspec:\n # Requirements that constrain the parameters of provisioned nodes.\n # These requirements are combined with pod.spec.affinity.nodeAffinity rules.\n # Operators { In, NotIn } are supported to enable including or excluding values\n requirements:\n - key: \"karpenter.sh/capacity-type\" # If not included, the webhook for the AWS cloud provider will default to on-demand\n operator: In\n values: [\"spot\", \"on-demand\"]\n # Resource limits constrain the total size of the cluster.\n # Limits prevent Karpenter from creating new instances once the limit is exceeded.\n limits:\n resources:\n cpu: 1000 \n memory: 1000Gi\n provider:\n subnetSelector:\n karpenter.sh/discovery: alpha\n securityGroupSelector:\n karpenter.sh/discovery: alpha\n tags:\n karpenter.sh/discovery: alpha\n ttlSecondsAfterEmpty: 30\n```\n\nYou can start by fetching all the nodes in your cluster using the ```kubectl get nodes``` command in your terminal. This will give you an idea of the existing nodes before [Karpenter](https://karpenter.sh/) launches new ones in response to the application you’ll deploy to the cluster shortly.\n\n```\nNAME STATUS ROLES AGE VERSION\nip-10-0-0-126.eu-west-1.compute.internal Ready <none> 2d2h v1.21.12-eks-5308cf7\nip-10-0-0-193.eu-west-1.compute.internal Ready <none> 2d2h v1.21.12-eks-5308cf7\nip-10-0-0-35.eu-west-1.compute.internal Ready <none> 47h v1.21.12-eks-5308cf7\nip-10-0-1-12.eu-west-1.compute.internal Ready <none> 2d2h v1.21.12-eks-5308cf7\nip-10-0-1-73.eu-west-1.compute.internal Ready <none> 2d2h v1.21.12-eks-5308cf7\nip-10-0-3-30.eu-west-1.compute.internal Ready <none> 2d2h v1.21.12-eks-5308cf7\n```\n\nAfter that, you can proceed to create a deployment resource with the following manifest:\n\n```\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: inflate\nspec:\n replicas: 8\n selector:\n matchLabels:\n app: inflate\n template:\n metadata:\n labels:\n app: inflate\n spec:\n affinity:\n podAffinity:\n requiredDuringSchedulingIgnoredDuringExecution:\n - labelSelector:\n matchExpressions:\n - key: app\n operator: In\n values:\n - inflate\n topologyKey: \"topology.kubernetes.io/zone\"\n terminationGracePeriodSeconds: 0\n containers:\n - name: inflate\n image: public.ecr.aws/eks-distro/kubernetes/pause:3.2\n resources:\n requests:\n cpu: 1\n```\n\n[Karpenter](https://karpenter.sh/) will detect the unscheduled pods and provision a node that will help fulfill the inter-pod affinity requirements of this deployment:\n\n```\nNAME STATUS ROLES AGE VERSION\nip-10-0-0-126.eu-west-1.compute.internal Ready <none> 2d3h v1.21.12-eks-5308cf7\nip-10-0-0-193.eu-west-1.compute.internal Ready <none> 2d3h v1.21.12-eks-5308cf7\nip-10-0-0-35.eu-west-1.compute.internal Ready <none> 2d v1.21.12-eks-5308cf7\nip-10-0-1-12.eu-west-1.compute.internal Ready <none> 2d3h v1.21.12-eks-5308cf7\nip-10-0-1-233.eu-west-1.compute.internal Ready <none> 32m v1.21.12-eks-5308cf7 // New node\nip-10-0-1-73.eu-west-1.compute.internal Ready <none> 2d3h v1.21.12-eks-5308cf7\nip-10-0-3-30.eu-west-1.compute.internal Ready <none> 2d3h v1.21.12-eks-5308cf7\n```\n\nThe newly created node has the hostname ```ip-10-0-1-233.eu-west-1.compute.internal```.\n\nHere is a partial description of the new node:\n\n```\nName: ip-10-0-1-233.eu-west-1.compute.internal\nRoles: <none>\nLabels: beta.kubernetes.io/arch=amd64\n beta.kubernetes.io/instance-type=c6i.2xlarge\n beta.kubernetes.io/os=linux\n failure-domain.beta.kubernetes.io/region=eu-west-1\n failure-domain.beta.kubernetes.io/zone=eu-west-1b\n karpenter.sh/capacity-type=spot\n karpenter.sh/provisioner-name=default\n kubernetes.io/arch=amd64\n kubernetes.io/hostname=ip-10-0-1-233.eu-west-1.compute.internal\n kubernetes.io/os=linux\n node.kubernetes.io/instance-type=c6i.2xlarge\n topology.ebs.csi.aws.com/zone=eu-west-1b\n topology.kubernetes.io/region=eu-west-1\n topology.kubernetes.io/zone=eu-west-1b\nAnnotations: csi.volume.kubernetes.io/nodeid: {\"ebs.csi.aws.com\":\"i-00725be7dfa8ef814\"}\n node.alpha.kubernetes.io/ttl: 0\n volumes.kubernetes.io/controller-managed-attach-detach: true\n...\n```\n\nWe can then fetch the relevant pods using the appropriate label, ```app=inflate``` in this case, to review how the pods have been scheduled.\n\n```\nkubectl get pods -l app=inflate -o wide\nNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES\ninflate-588d96b7f7-2lmzb 1/1 Running 0 104s 10.0.1.197 ip-10-0-1-233.eu-west-1.compute.internal <none> <none>\ninflate-588d96b7f7-gv68n 1/1 Running 0 104s 10.0.1.11 ip-10-0-1-233.eu-west-1.compute.internal <none> <none>\ninflate-588d96b7f7-jffqh 1/1 Running 0 104s 10.0.1.248 ip-10-0-1-233.eu-west-1.compute.internal <none> <none>\ninflate-588d96b7f7-ktjrg 1/1 Running 0 104s 10.0.1.81 ip-10-0-1-12.eu-west-1.compute.internal <none> <none>\ninflate-588d96b7f7-mhjrl 1/1 Running 0 104s 10.0.1.133 ip-10-0-1-233.eu-west-1.compute.internal <none> <none>\ninflate-588d96b7f7-vhjl7 1/1 Running 0 104s 10.0.1.21 ip-10-0-1-233.eu-west-1.compute.internal <none> <none>\ninflate-588d96b7f7-zb7l8 1/1 Running 0 104s 10.0.1.18 ip-10-0-1-73.eu-west-1.compute.internal <none> <none>\ninflate-588d96b7f7-zz2g4 1/1 Running 0 104s 10.0.1.207 ip-10-0-1-233.eu-west-1.compute.internal <none> <none>\n```\n\nAs you can see, six of the pods have been scheduled on the new node ```ip-10-0-1-233.eu-west-1.compute.internal```, whereas the other two have been scheduled to nodes in the same AZ (```eu-west-1b```) as per the ```topologyKey``` of the ```podAffinity``` rule. Since these nodes are all in the same AZ, they are part of the same topology, thus meeting the scheduling requirements.\n\n##### **Pod anti-affinity example with Karpenter**\n\nIn the second example, you’ll apply a ```podAntiAffinity``` rule to preferably schedule the pods across different nodes in the cluster, based on their AZs. As before, [Karpenter](https://karpenter.sh/) reads the pod requirements and launch nodes that support the ```podAntiAffinity``` configurations.\n\nSimilar to the previous example, start by fetching all the nodes in the cluster:\n\n```\nNAME STATUS ROLES AGE VERSION\nip-10-0-0-126.eu-west-1.compute.internal Ready <none> 2d2h v1.21.12-eks-5308cf7\nip-10-0-0-193.eu-west-1.compute.internal Ready <none> 2d2h v1.21.12-eks-5308cf7\nip-10-0-0-35.eu-west-1.compute.internal Ready <none> 47h v1.21.12-eks-5308cf7\nip-10-0-1-12.eu-west-1.compute.internal Ready <none> 2d2h v1.21.12-eks-5308cf7\nip-10-0-1-73.eu-west-1.compute.internal Ready <none> 2d2h v1.21.12-eks-5308cf7\nip-10-0-3-30.eu-west-1.compute.internal Ready <none> 2d2h v1.21.12-eks-5308cf7\n```\n\nAfter that, you can proceed to create a deployment resource.\n\n```\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: inflate\nspec:\n replicas: 8\n selector:\n matchLabels:\n app: inflate\n template:\n metadata:\n labels:\n app: inflate\n spec:\n affinity:\n podAntiAffinity:\n preferredDuringSchedulingIgnoredDuringExecution:\n - weight: 50\n podAffinityTerm:\n labelSelector:\n matchExpressions:\n - key: app\n operator: In\n values:\n - inflate\n topologyKey: \"topology.kubernetes.io/zone\"\n terminationGracePeriodSeconds: 0\n containers:\n - name: inflate\n image: public.ecr.aws/eks-distro/kubernetes/pause:3.2\n resources:\n requests:\n cpu: 1\n```\n\nIn response to the pod requirements, [Karpenter](https://karpenter.sh/) will provision a new node:\n\n```\nNAME STATUS ROLES AGE VERSION\nip-10-0-0-126.eu-west-1.compute.internal Ready <none> 2d3h v1.21.12-eks-5308cf7\nip-10-0-0-193.eu-west-1.compute.internal Ready <none> 2d3h v1.21.12-eks-5308cf7\nip-10-0-0-35.eu-west-1.compute.internal Ready <none> 2d v1.21.12-eks-5308cf7\nip-10-0-1-12.eu-west-1.compute.internal Ready <none> 2d3h v1.21.12-eks-5308cf7\nip-10-0-1-69.eu-west-1.compute.internal Ready <none> 73s v1.21.12-eks-5308cf7\nip-10-0-1-73.eu-west-1.compute.internal Ready <none> 2d3h v1.21.12-eks-5308cf7\nip-10-0-3-30.eu-west-1.compute.internal Ready <none> 2d3h v1.21.12-eks-5308cf7\n```\n\nThe newly created node has the hostname ```ip-10-0-1-69.eu-west-1.compute.internal```.\n\nHere is a partial description of the new node:\n\n```\nName: ip-10-0-1-69.eu-west-1.compute.internal\nRoles: <none>\nLabels: beta.kubernetes.io/arch=amd64\n beta.kubernetes.io/instance-type=c5.xlarge\n beta.kubernetes.io/os=linux\n failure-domain.beta.kubernetes.io/region=eu-west-1\n failure-domain.beta.kubernetes.io/zone=eu-west-1b\n karpenter.sh/capacity-type=spot\n karpenter.sh/provisioner-name=default\n kubernetes.io/arch=amd64\n kubernetes.io/hostname=ip-10-0-1-69.eu-west-1.compute.internal\n kubernetes.io/os=linux\n node.kubernetes.io/instance-type=c5.xlarge\n topology.ebs.csi.aws.com/zone=eu-west-1b\n topology.kubernetes.io/region=eu-west-1\n topology.kubernetes.io/zone=eu-west-1b\nAnnotations: csi.volume.kubernetes.io/nodeid: {\"ebs.csi.aws.com\":\"i-0617fbc688949e367\"}\n node.alpha.kubernetes.io/ttl: 0\n volumes.kubernetes.io/controller-managed-attach-detach: true\n...\n```\n\nAs before, fetch the pods with the relevant label:\n\n```\nkubectl get pods -l app=inflate -o wide\nNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES\ninflate-54cd576f79-88rt4 1/1 Running 0 101s 10.0.0.128 ip-10-0-0-35.eu-west-1.compute.internal <none> <none>\ninflate-54cd576f79-bpcc7 1/1 Running 0 101s 10.0.0.49 ip-10-0-0-126.eu-west-1.compute.internal <none> <none>\ninflate-54cd576f79-jdcfc 1/1 Running 0 101s 10.0.1.247 ip-10-0-1-12.eu-west-1.compute.internal <none> <none>\ninflate-54cd576f79-nh8zg 1/1 Running 0 101s 10.0.1.120 ip-10-0-1-69.eu-west-1.compute.internal <none> <none>\ninflate-54cd576f79-qm8dc 1/1 Running 0 101s 10.0.1.236 ip-10-0-1-69.eu-west-1.compute.internal <none> <none>\ninflate-54cd576f79-vvgkg 1/1 Running 0 101s 10.0.1.18 ip-10-0-1-73.eu-west-1.compute.internal <none> <none>\ninflate-54cd576f79-vzzjx 1/1 Running 0 101s 10.0.0.123 ip-10-0-0-193.eu-west-1.compute.internal <none> <none>\ninflate-54cd576f79-xtwkr 1/1 Running 0 101s 10.0.1.147 ip-10-0-1-69.eu-west-1.compute.internal <none> <none>\n```\n\nExcept for three pods, all the others are dispersed to other nodes across the different AZs of the select region (```eu-west-1```) as per the affinity rule specification.\n\nNext, we’ll further explore [Karpenter](https://karpenter.sh/)‘s use with another advanced scheduling technique, namely volume topology awareness.\n\n#### **Volume topology aware scheduling**\nBefore volume topology awareness, the processes of scheduling pods to nodes and dynamically provisioning volumes were independent. As you can imagine, this introduced the challenge of unpredictable outcomes for your workloads. For example, you might create a persistent volume claim which will trigger the dynamic creation of a volume in a certain AZ (i.e., ```eu-west-1a```), whereas the pod that needs to make use of the volume gets placed on a node in a separate AZ (```eu-west-1b```). As a result, the pod will fail to start.\n\nThis is especially problematic for your stateful workloads that rely on storage volumes to provide persistence of data. It would be inefficient, and counter-intuitive to dynamic provisioning, to manually provision the storage volumes in the appropriate AZs. That’s where topology awareness comes in.\n\nTopology awareness complements dynamic provisioning by ensuring that pods are placed on the nodes that meet their topology requirements, in this case, storage volumes. The goal of topology awareness scheduling is to provide alignment between topology resources and your workloads. Thus, it gives you a more reliable and predictable outcome. This is handled by the topology manager, a component of the ```kubelet```. This means the topology manager will make sure that your stateful workloads and the dynamically created persistent volumes are placed in the correct AZs.\n\nTo use volume topology awareness, ensure that you set the ```volumeBindingMode``` of your storage class to ```WaitForFirstConsumer```. This property delays the provisioning of persistent volumes until a persistent volume claim is created by a pod that will use it.\n\nIn scaling events, [Karpenter](https://karpenter.sh/), the scheduler, and the topology manager work well together. In combination, they optimize the process of provisioning the right compute resources and align scheduled workloads with their dynamically created persistent volumes.\n\nThese technologies enable you to run and reliably scale stateful workloads in multiple AZs. You can thus spread your applications or databases in your cluster across zones to prevent a single point of failure in the case that an AZ is impacted. Considering that elastic block store (EBS) volumes are AZ-specific, your workloads should be configured with nodeAffinity to ensure that they are provisioned in the same AZ where they were first scheduled for a successful reattachment.\n\n##### **Volume topology aware example with Karpenter**\n\nIn this example, you’ll create a stateful set for an application with 20 replicas, each with a persistent volume claim using a storage class with ```volumeBindingMode``` already set to ```WaitForFirstConsumer```. In addition, you’ll specify that the nodeAffinity of the workloads should be scheduled to topology resources in the ```eu-west-1a``` AZ.\n\nTo review the default storage class, run the ```kubectl get storageclass``` command:\n\n```\nNAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE\ngp2 (default) kubernetes.io/aws-ebs Delete WaitForFirstConsumer false 3d11h\n```\n\nNext, you’ll fetch the nodes in the respective AZs:\n\n```\nRunning kubectl get nodes -l topology.kubernetes.io/zone=eu-west-1a returns the following:\n\nNAME STATUS ROLES AGE VERSION\nip-10-0-0-126.eu-west-1.compute.internal Ready <none> 4d23h v1.21.12-eks-5308cf7\nip-10-0-0-193.eu-west-1.compute.internal Ready <none> 4d23h v1.21.12-eks-5308cf7\nip-10-0-0-35.eu-west-1.compute.internal Ready <none> 4d21h v1.21.12-eks-5308cf7\n```\n\nWhereas, running ```kubectl get nodes -l topology.kubernetes.io/zone=eu-west-1b``` returns the following:\n\nGit\n```\nNAME STATUS ROLES AGE VERSION\nip-10-0-1-12.eu-west-1.compute.internal Ready <none> 4d23h v1.21.12-eks-5308cf7\nip-10-0-1-73.eu-west-1.compute.internal Ready <none> 4d23h v1.21.12-eks-5308cf7\nip-10-0-3-30.eu-west-1.compute.internal Ready <none> 4d23h v1.21.12-eks-5308cf7\n```\n\n\nOnce you have the layout for your nodes, you can proceed to create the stateful set and an accompanying load balancer service for the application.\n\n```\napiVersion: v1\nkind: Service\nmetadata:\n name: express-nodejs-svc\nspec:\n selector:\n app: express-nodejs\n type: LoadBalancer\n ports:\n - protocol: TCP\n port: 8080\n targetPort: 8080\napiVersion: apps/v1\nkind: StatefulSet\nmetadata:\n name: express-nodejs\nspec: \n serviceName: express-nodejs-svc\n replicas: 20\n selector:\n matchLabels:\n app: express-nodejs\n template:\n metadata:\n labels:\n app: express-nodejs\n spec:\n affinity:\n nodeAffinity:\n requiredDuringSchedulingIgnoredDuringExecution:\n nodeSelectorTerms:\n - matchExpressions:\n - key: topology.kubernetes.io/zone\n operator: In\n values:\n - eu-west-1a\n containers:\n - name: express-nodejs\n image: lukondefmwila/express-test:1.1.4\n resources:\n requests:\n memory: \"512Mi\"\n cpu: \"500m\"\n limits:\n memory: \"512Mi\"\n cpu: \"500m\"\n ports:\n - containerPort: 8080\n name: express-nodejs\n volumeMounts:\n - name: express-nodejs\n mountPath: /data\n volumeClaimTemplates:\n - metadata:\n name: express-nodejs\n spec:\n accessModes: [ \"ReadWriteOnce\" ]\n storageClassName: gp2\n resources:\n requests:\n storage: 10Gi\n```\n\nAfter deploying the application, persistent volumes are dynamically created in response to each claim from the stateful set replicas. Each one is created in the appropriate AZ, resulting in the successful creation of each pod replica. In conjunction, [Karpenter](https://karpenter.sh/) provisions new nodes in ```eu-west-1a``` to meet the compute requirements of the stateful set.\n\nNow when we run ```kubectl get nodes -l topology.kubernetes.io/zone=eu-west-1a```, we get the following response:\n\n```\nNAME STATUS ROLES AGE VERSION\nip-10-0-0-126.eu-west-1.compute.internal Ready <none> 5d v1.21.12-eks-5308cf7\nip-10-0-0-176.eu-west-1.compute.internal Ready <none> 4m18s v1.21.12-eks-5308cf7\nip-10-0-0-193.eu-west-1.compute.internal Ready <none> 5d v1.21.12-eks-5308cf7\nip-10-0-0-35.eu-west-1.compute.internal Ready <none> 4d21h v1.21.12-eks-5308cf7\nip-10-0-0-4.eu-west-1.compute.internal Ready <none> 6m12s v1.21.12-eks-5308cf7\nip-10-0-0-53.eu-west-1.compute.internal Ready <none> 8m8s v1.21.12-eks-5308cf7\nip-10-0-0-96.eu-west-1.compute.internal Ready <none> 2m28s v1.21.12-eks-5308cf7\n```\n\nAs you can see, [Karpenter](https://karpenter.sh/) has launched the following additional nodes:\n\n- ip-10-0-0-176.eu-west-1.compute.internal\n- ip-10-0-0-4.eu-west-1.compute.internal\n- ip-10-0-0-53.eu-west-1.compute.internal\n- ip-10-0-0-96.eu-west-1.compute.internal\n\n\nFurthermore, we reviewed the created persistent volume claims, persistent volumes, and pods by running the appropriate commands as shown in the following code:\n\n```\nkubectl get pv\nNAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE\npvc-07ad8f9f-e14f-4760-a908-58d2c48c49ac 10Gi RWO Delete Bound default/express-nodejs-express-nodejs-11 gp2 9m53s\npvc-0c3a949a-aa2f-4244-9988-d650b409698a 10Gi RWO Delete Bound default/express-nodejs-express-nodejs-3 gp2 13m\npvc-32dfc65e-9d10-42ea-a1c1-946f25500766 10Gi RWO Delete Bound default/express-nodejs-express-nodejs-7 gp2 11m\n\npvc-3cf7afb9-8bf4-4a0c-8064-cc97f0cdcbd5 10Gi RWO Delete Bound default/express-nodejs-express-nodejs-9 gp2 11m\npvc-3d8d0cb8-c5f4-43ad-a3a4-a95922a13c53 10Gi RWO Delete Bound default/express-nodejs-express-nodejs-0 gp2\n...\nkubectl get pvc\n\nNAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE\nexpress-nodejs-express-nodejs-0 Bound pvc-3d8d0cb8-c5f4-43ad-a3a4-a95922a13c53 10Gi RWO gp2 14m\nexpress-nodejs-express-nodejs-1 Bound pvc-af841cf3-634a-4314-bfd1-a1d6da9d4101 10Gi RWO gp2 13m\nexpress-nodejs-express-nodejs-10 Bound pvc-741c0d72-7311-4063-9d9b-15942f71a9a9 10Gi RWO gp2 10m\nexpress-nodejs-express-nodejs-11 Bound pvc-07ad8f9f-e14f-4760-a908-58d2c48c49ac 10Gi RWO gp2 10m\nexpress-nodejs-express-nodejs-12 Bound pvc-4a91fb0a-778b-42d2-b4b7-33d5d3dc8a87 10Gi RWO gp2 9m45s\n...\nkubectl get pods -l app=express-nodejs -o wide\nNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES\nexpress-nodejs-0 1/1 Running 0 16m 10.0.0.174 ip-10-0-0-193.eu-west-1.compute.internal <none> <none>\n\nexpress-nodejs-1 1/1 Running 0 16m 10.0.0.30 ip-10-0-0-35.eu-west-1.compute.internal <none> <none>\nexpress-nodejs-10 1/1 Running 0 12m 10.0.0.235 ip-10-0-0-53.eu-west-1.compute.internal <none> <none>\nexpress-nodejs-11 1/1 Running 0 12m 10.0.0.137 ip-10-0-0-53.eu-west-1.compute.internal <none> <none>\nexpress-nodejs-12 1/1 Running 0 12m 10.0.0.161 ip-10-0-0-4.eu-west-1.compute.internal <none> <none>\n...\n```\n\n### **Cleanup**\nTo avoid incurring any additional costs, make sure you destroy all the infrastructure that you provisioned in relation to the examples detailed in this post.\n\n### **Conclusion**\nIn this post, we covered a hands-on approach to scaling Kubernetes swith [Karpenter](https://karpenter.sh/) specifically for supporting advanced scheduling techniques with inter-pod affinity and volume topology awareness.\n\nTo learn more about [Karpenter](https://karpenter.sh/), you can read the [documentation](https://karpenter.sh/v0.10.1/) and join the community channel, #karpenter, in the [Kubernetes Slack workspace](https://slack.k8s.io/). Also, if you like the project, you can star it on [GitHub here](https://github.com/aws/karpenter).\n\n![59c9055aae4d4332a28e2bd482f2265e_image1.png](1)\n\n**Lukonde Mwila, Principal Technical Evangelist, SUSE**\n\nLukonde is a Principal Technical Evangelist at SUSE, an AWS Container Hero, and a HashiCorp Ambassador. He has years of experience in application development, solution architecture, cloud engineering, and DevOps workflows. He is a life-long learner and is passionate about sharing knowledge through various mediums. Nowadays, Lukonde spends the majority of his time providing content, training, and support in the Kubernetes ecosystem and SUSE’s open-source container management stack.\n\n![image.png](https://dev-media.amazoncloud.cn/a2efff38cea14753875974d603e10bd9_image.png)\n\n**Jeremy Cowan**\n\nJeremy Cowan is a Specialist Solutions Architect for containers at AWS, although his family thinks he sells \"cloud space\". Prior to joining AWS, Jeremy worked for several large software vendors, including VMware, Microsoft, and IBM. When he's not working, you can usually find on a trail in the wilderness, far away from technology.","render":"<p><em>This post was co-written by Lukonde Mwila, Principal Technical Evangelist at SUSE, an AWS Container Hero, and a HashiCorp Ambassador.</em></p>\n<h3><a id=\"Introduction_2\"></a><strong>Introduction</strong></h3>\n<p>Cloud-native technologies are becoming increasingly ubiquitous, and Kubernetes is at the forefront of this movement. Today, Kubernetes is seeing widespread adoption across organizations in a variety of different industries. When implemented properly, Kubernetes can help these organizations achieve higher availability, scalability, and resiliency for their workloads. Combining Kubernetes with the attributes of cloud computing—such as unparalleled scalability and elasticity—can help organizations enhance their containerized applications’ resiliency and availability.</p>\n<p>As detailed <a href=\"https://aws.amazon.com/blogs/aws/introducing-karpenter-an-open-source-high-performance-kubernetes-cluster-autoscaler/\" target=\"_blank\">in this introductory post</a>, <a href=\"https://karpenter.sh/\" target=\"_blank\">Karpenter</a>‘s objective is to make sure that your cluster’s workloads have the compute they need, no more and no less, right when they need it.</p>\n<p>In its most recent updates, <a href=\"https://karpenter.sh/\" target=\"_blank\">Karpenter</a> added support for more advanced scheduling constraints, such as pod affinity and anti-affinity, topology spread, node affinity, node selection, and resource requests. This post will specifically delve into <code>podAffinity</code>, <code>podAntiAffinity</code>, and volume topology awareness and elaborate on the use cases that they’re best suited for.</p>\n<h3><a id=\"Prerequisites_9\"></a><strong>Prerequisites</strong></h3>\n<p>To carry out the examples in this post, you need to have <a href=\"https://karpenter.sh/\" target=\"_blank\">Karpenter</a> installed in a Kubernetes cluster in AWS. We’ll be making use of <a href=\"https://aws.amazon.com/eks/\" target=\"_blank\">Amazon EKS</a> for demonstrative purposes. You can automate the process of provisioning an EKS cluster, with <a href=\"https://karpenter.sh/\" target=\"_blank\">Karpenter</a> as an add-on, by making use of the Terraform <a href=\"https://karpenter.sh/v0.12.0/getting-started/getting-started-with-terraform/\" target=\"_blank\">EKS blueprints</a>.</p>\n<h4><a id=\"Pod_affinity_and_pod_antiaffinity_scheduling_12\"></a><strong>Pod affinity and pod anti-affinity scheduling</strong></h4>\n<p>Applying scheduling constraints to pods is implemented by establishing relationships between pods and specific nodes or between pods themselves. The latter is known as inter-pod affinity. Using inter-pod affinity, you assign rules that inform the scheduler’s approach in deciding which pod goes to which node based on their relation to other pods. Inter-pod affinity includes both pod affinity and pod anti-affinity.</p>\n<p>Like node affinity, this can be done using the rules <code>requiredDuringSchedulingIgnoredDuringExecution</code> and <code>preferredDuringSchedulingIgnoredDuringExecution</code> depending on your requirements. As the names imply, required and preferred are terms that represent how hard or soft the scheduling constraints should be. If the scheduling criteria for a pod are set to the required rule, then Kubernetes ensures the pod is placed on a node that satisfies this. Similarly, pods that contain the preferred rule are scheduled to nodes that match the highest preference.</p>\n<p><strong>Pod affinity:</strong> The <code>podAffinity</code> rule informs the scheduler to match pods that relate to each other based on their labels. If a new pod is created, then the scheduler takes care of searching the nodes for pods that match the label specification of the new pod’s label selector.</p>\n<p>Pod anti-affinity: In contrast, the <code>podAntiAffinity</code> rule allows you to prevent certain pods from running on the same node if the matching label criteria are met.</p>\n<p>These rules can be particularly helpful in various scenarios. For example, podAffinity can be beneficial for pods to co-locate each other in the same AZ or node to support any inter-dependencies and reduce network latency between services. On the other hand, <code>podAntiAffinity</code> is typically useful for preventing a single point of failure by spreading pods across AZs or nodes for high availability (HA). For such use cases, the recommended topology spread constraint for anti-affinity can be zonal or hostname. This can be implemented using the topologyKey property which determines the searching scope of the cluster nodes. The topologyKey is a key of a label attached to a node.</p>\n<p>An example of a <code>podAntiAffinity</code> implementation would be the <a href=\"https://docs.aws.amazon.com/eks/latest/userguide/managing-coredns.html\" target=\"_blank\">CoreDNS</a> Deployment. Its <a href=\"https://github.com/coredns/deployment/blob/master/kubernetes/coredns.yaml.sed\" target=\"_blank\">Deployment resource</a> has the <code>podAntiAffinity</code> policy to ensure that the scheduler runs the <code>CoreDNS</code> pods on different nodes for HA and to avoid <a href=\"https://aws.amazon.com/premiumsupport/knowledge-center/vpc-find-cause-of-failed-dns-queries/\" target=\"_blank\">VPC DNS throttling</a>. You’ll notice that the Deployment’s anti-affinity <code>topologyKey</code> is set to the hostname. In addition to this, <code>podAntiAffinity</code> can be used to give a pod or set of pods resource isolation on exclusive nodes, as well as mitigating the risk of some pods interfering with the performance of others.</p>\n<p>Using <a href=\"https://karpenter.sh/\" target=\"_blank\">Karpenter</a> allows you to make sure that new compute provisioned for your cluster will satisfy these pod affinity rules as workloads scale, without configuring additional infrastructure. <a href=\"https://karpenter.sh/\" target=\"_blank\">Karpenter</a> tracks unscheduled pods and will provision compute resources in accordance with the required or preferred affinity rules defined in your resource manifests.</p>\n<h5><a id=\"Pod_affinity_example_with_Karpenter_27\"></a><strong>Pod affinity example with Karpenter</strong></h5>\n<p>In this example, you’ll create a deployment resource with a <code>podAffinity</code> rule that requires scheduling the pods on nodes in the same AZ (availability zone). In the process, <a href=\"https://karpenter.sh/\" target=\"_blank\">Karpenter</a> will interpret the requirements of the pods that need to be scheduled and provision nodes that allow for these affinity rules to be met in an optimal way.</p>\n<p>As a starting point, you’ll need to install the Karpenter Provisioner on your cluster. The Provisioner is a CRD that details configuration specifications and parameters such as node types, labels, taints, tags, customer kubelet configurations, resource limits and cluster connections via subnet and security group associations. The Provisioner manifest used in this example can be seen below.</p>\n<pre><code class=\"lang-\">apiVersion: karpenter.sh/v1alpha5\nkind: Provisioner\nmetadata:\n name: default\nspec:\n # Requirements that constrain the parameters of provisioned nodes.\n # These requirements are combined with pod.spec.affinity.nodeAffinity rules.\n # Operators { In, NotIn } are supported to enable including or excluding values\n requirements:\n - key: "karpenter.sh/capacity-type" # If not included, the webhook for the AWS cloud provider will default to on-demand\n operator: In\n values: ["spot", "on-demand"]\n # Resource limits constrain the total size of the cluster.\n # Limits prevent Karpenter from creating new instances once the limit is exceeded.\n limits:\n resources:\n cpu: 1000 \n memory: 1000Gi\n provider:\n subnetSelector:\n karpenter.sh/discovery: alpha\n securityGroupSelector:\n karpenter.sh/discovery: alpha\n tags:\n karpenter.sh/discovery: alpha\n ttlSecondsAfterEmpty: 30\n</code></pre>\n<p>You can start by fetching all the nodes in your cluster using the <code>kubectl get nodes</code> command in your terminal. This will give you an idea of the existing nodes before <a href=\"https://karpenter.sh/\" target=\"_blank\">Karpenter</a> launches new ones in response to the application you’ll deploy to the cluster shortly.</p>\n<pre><code class=\"lang-\">NAME STATUS ROLES AGE VERSION\nip-10-0-0-126.eu-west-1.compute.internal Ready <none> 2d2h v1.21.12-eks-5308cf7\nip-10-0-0-193.eu-west-1.compute.internal Ready <none> 2d2h v1.21.12-eks-5308cf7\nip-10-0-0-35.eu-west-1.compute.internal Ready <none> 47h v1.21.12-eks-5308cf7\nip-10-0-1-12.eu-west-1.compute.internal Ready <none> 2d2h v1.21.12-eks-5308cf7\nip-10-0-1-73.eu-west-1.compute.internal Ready <none> 2d2h v1.21.12-eks-5308cf7\nip-10-0-3-30.eu-west-1.compute.internal Ready <none> 2d2h v1.21.12-eks-5308cf7\n</code></pre>\n<p>After that, you can proceed to create a deployment resource with the following manifest:</p>\n<pre><code class=\"lang-\">apiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: inflate\nspec:\n replicas: 8\n selector:\n matchLabels:\n app: inflate\n template:\n metadata:\n labels:\n app: inflate\n spec:\n affinity:\n podAffinity:\n requiredDuringSchedulingIgnoredDuringExecution:\n - labelSelector:\n matchExpressions:\n - key: app\n operator: In\n values:\n - inflate\n topologyKey: "topology.kubernetes.io/zone"\n terminationGracePeriodSeconds: 0\n containers:\n - name: inflate\n image: public.ecr.aws/eks-distro/kubernetes/pause:3.2\n resources:\n requests:\n cpu: 1\n</code></pre>\n<p><a href=\"https://karpenter.sh/\" target=\"_blank\">Karpenter</a> will detect the unscheduled pods and provision a node that will help fulfill the inter-pod affinity requirements of this deployment:</p>\n<pre><code class=\"lang-\">NAME STATUS ROLES AGE VERSION\nip-10-0-0-126.eu-west-1.compute.internal Ready <none> 2d3h v1.21.12-eks-5308cf7\nip-10-0-0-193.eu-west-1.compute.internal Ready <none> 2d3h v1.21.12-eks-5308cf7\nip-10-0-0-35.eu-west-1.compute.internal Ready <none> 2d v1.21.12-eks-5308cf7\nip-10-0-1-12.eu-west-1.compute.internal Ready <none> 2d3h v1.21.12-eks-5308cf7\nip-10-0-1-233.eu-west-1.compute.internal Ready <none> 32m v1.21.12-eks-5308cf7 // New node\nip-10-0-1-73.eu-west-1.compute.internal Ready <none> 2d3h v1.21.12-eks-5308cf7\nip-10-0-3-30.eu-west-1.compute.internal Ready <none> 2d3h v1.21.12-eks-5308cf7\n</code></pre>\n<p>The newly created node has the hostname <code>ip-10-0-1-233.eu-west-1.compute.internal</code>.</p>\n<p>Here is a partial description of the new node:</p>\n<pre><code class=\"lang-\">Name: ip-10-0-1-233.eu-west-1.compute.internal\nRoles: <none>\nLabels: beta.kubernetes.io/arch=amd64\n beta.kubernetes.io/instance-type=c6i.2xlarge\n beta.kubernetes.io/os=linux\n failure-domain.beta.kubernetes.io/region=eu-west-1\n failure-domain.beta.kubernetes.io/zone=eu-west-1b\n karpenter.sh/capacity-type=spot\n karpenter.sh/provisioner-name=default\n kubernetes.io/arch=amd64\n kubernetes.io/hostname=ip-10-0-1-233.eu-west-1.compute.internal\n kubernetes.io/os=linux\n node.kubernetes.io/instance-type=c6i.2xlarge\n topology.ebs.csi.aws.com/zone=eu-west-1b\n topology.kubernetes.io/region=eu-west-1\n topology.kubernetes.io/zone=eu-west-1b\nAnnotations: csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-00725be7dfa8ef814"}\n node.alpha.kubernetes.io/ttl: 0\n volumes.kubernetes.io/controller-managed-attach-detach: true\n...\n</code></pre>\n<p>We can then fetch the relevant pods using the appropriate label, <code>app=inflate</code> in this case, to review how the pods have been scheduled.</p>\n<pre><code class=\"lang-\">kubectl get pods -l app=inflate -o wide\nNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES\ninflate-588d96b7f7-2lmzb 1/1 Running 0 104s 10.0.1.197 ip-10-0-1-233.eu-west-1.compute.internal <none> <none>\ninflate-588d96b7f7-gv68n 1/1 Running 0 104s 10.0.1.11 ip-10-0-1-233.eu-west-1.compute.internal <none> <none>\ninflate-588d96b7f7-jffqh 1/1 Running 0 104s 10.0.1.248 ip-10-0-1-233.eu-west-1.compute.internal <none> <none>\ninflate-588d96b7f7-ktjrg 1/1 Running 0 104s 10.0.1.81 ip-10-0-1-12.eu-west-1.compute.internal <none> <none>\ninflate-588d96b7f7-mhjrl 1/1 Running 0 104s 10.0.1.133 ip-10-0-1-233.eu-west-1.compute.internal <none> <none>\ninflate-588d96b7f7-vhjl7 1/1 Running 0 104s 10.0.1.21 ip-10-0-1-233.eu-west-1.compute.internal <none> <none>\ninflate-588d96b7f7-zb7l8 1/1 Running 0 104s 10.0.1.18 ip-10-0-1-73.eu-west-1.compute.internal <none> <none>\ninflate-588d96b7f7-zz2g4 1/1 Running 0 104s 10.0.1.207 ip-10-0-1-233.eu-west-1.compute.internal <none> <none>\n</code></pre>\n<p>As you can see, six of the pods have been scheduled on the new node <code>ip-10-0-1-233.eu-west-1.compute.internal</code>, whereas the other two have been scheduled to nodes in the same AZ (<code>eu-west-1b</code>) as per the <code>topologyKey</code> of the <code>podAffinity</code> rule. Since these nodes are all in the same AZ, they are part of the same topology, thus meeting the scheduling requirements.</p>\n<h5><a id=\"Pod_antiaffinity_example_with_Karpenter_167\"></a><strong>Pod anti-affinity example with Karpenter</strong></h5>\n<p>In the second example, you’ll apply a <code>podAntiAffinity</code> rule to preferably schedule the pods across different nodes in the cluster, based on their AZs. As before, <a href=\"https://karpenter.sh/\" target=\"_blank\">Karpenter</a> reads the pod requirements and launch nodes that support the <code>podAntiAffinity</code> configurations.</p>\n<p>Similar to the previous example, start by fetching all the nodes in the cluster:</p>\n<pre><code class=\"lang-\">NAME STATUS ROLES AGE VERSION\nip-10-0-0-126.eu-west-1.compute.internal Ready <none> 2d2h v1.21.12-eks-5308cf7\nip-10-0-0-193.eu-west-1.compute.internal Ready <none> 2d2h v1.21.12-eks-5308cf7\nip-10-0-0-35.eu-west-1.compute.internal Ready <none> 47h v1.21.12-eks-5308cf7\nip-10-0-1-12.eu-west-1.compute.internal Ready <none> 2d2h v1.21.12-eks-5308cf7\nip-10-0-1-73.eu-west-1.compute.internal Ready <none> 2d2h v1.21.12-eks-5308cf7\nip-10-0-3-30.eu-west-1.compute.internal Ready <none> 2d2h v1.21.12-eks-5308cf7\n</code></pre>\n<p>After that, you can proceed to create a deployment resource.</p>\n<pre><code class=\"lang-\">apiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: inflate\nspec:\n replicas: 8\n selector:\n matchLabels:\n app: inflate\n template:\n metadata:\n labels:\n app: inflate\n spec:\n affinity:\n podAntiAffinity:\n preferredDuringSchedulingIgnoredDuringExecution:\n - weight: 50\n podAffinityTerm:\n labelSelector:\n matchExpressions:\n - key: app\n operator: In\n values:\n - inflate\n topologyKey: "topology.kubernetes.io/zone"\n terminationGracePeriodSeconds: 0\n containers:\n - name: inflate\n image: public.ecr.aws/eks-distro/kubernetes/pause:3.2\n resources:\n requests:\n cpu: 1\n</code></pre>\n<p>In response to the pod requirements, <a href=\"https://karpenter.sh/\" target=\"_blank\">Karpenter</a> will provision a new node:</p>\n<pre><code class=\"lang-\">NAME STATUS ROLES AGE VERSION\nip-10-0-0-126.eu-west-1.compute.internal Ready <none> 2d3h v1.21.12-eks-5308cf7\nip-10-0-0-193.eu-west-1.compute.internal Ready <none> 2d3h v1.21.12-eks-5308cf7\nip-10-0-0-35.eu-west-1.compute.internal Ready <none> 2d v1.21.12-eks-5308cf7\nip-10-0-1-12.eu-west-1.compute.internal Ready <none> 2d3h v1.21.12-eks-5308cf7\nip-10-0-1-69.eu-west-1.compute.internal Ready <none> 73s v1.21.12-eks-5308cf7\nip-10-0-1-73.eu-west-1.compute.internal Ready <none> 2d3h v1.21.12-eks-5308cf7\nip-10-0-3-30.eu-west-1.compute.internal Ready <none> 2d3h v1.21.12-eks-5308cf7\n</code></pre>\n<p>The newly created node has the hostname <code>ip-10-0-1-69.eu-west-1.compute.internal</code>.</p>\n<p>Here is a partial description of the new node:</p>\n<pre><code class=\"lang-\">Name: ip-10-0-1-69.eu-west-1.compute.internal\nRoles: <none>\nLabels: beta.kubernetes.io/arch=amd64\n beta.kubernetes.io/instance-type=c5.xlarge\n beta.kubernetes.io/os=linux\n failure-domain.beta.kubernetes.io/region=eu-west-1\n failure-domain.beta.kubernetes.io/zone=eu-west-1b\n karpenter.sh/capacity-type=spot\n karpenter.sh/provisioner-name=default\n kubernetes.io/arch=amd64\n kubernetes.io/hostname=ip-10-0-1-69.eu-west-1.compute.internal\n kubernetes.io/os=linux\n node.kubernetes.io/instance-type=c5.xlarge\n topology.ebs.csi.aws.com/zone=eu-west-1b\n topology.kubernetes.io/region=eu-west-1\n topology.kubernetes.io/zone=eu-west-1b\nAnnotations: csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0617fbc688949e367"}\n node.alpha.kubernetes.io/ttl: 0\n volumes.kubernetes.io/controller-managed-attach-detach: true\n...\n</code></pre>\n<p>As before, fetch the pods with the relevant label:</p>\n<pre><code class=\"lang-\">kubectl get pods -l app=inflate -o wide\nNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES\ninflate-54cd576f79-88rt4 1/1 Running 0 101s 10.0.0.128 ip-10-0-0-35.eu-west-1.compute.internal <none> <none>\ninflate-54cd576f79-bpcc7 1/1 Running 0 101s 10.0.0.49 ip-10-0-0-126.eu-west-1.compute.internal <none> <none>\ninflate-54cd576f79-jdcfc 1/1 Running 0 101s 10.0.1.247 ip-10-0-1-12.eu-west-1.compute.internal <none> <none>\ninflate-54cd576f79-nh8zg 1/1 Running 0 101s 10.0.1.120 ip-10-0-1-69.eu-west-1.compute.internal <none> <none>\ninflate-54cd576f79-qm8dc 1/1 Running 0 101s 10.0.1.236 ip-10-0-1-69.eu-west-1.compute.internal <none> <none>\ninflate-54cd576f79-vvgkg 1/1 Running 0 101s 10.0.1.18 ip-10-0-1-73.eu-west-1.compute.internal <none> <none>\ninflate-54cd576f79-vzzjx 1/1 Running 0 101s 10.0.0.123 ip-10-0-0-193.eu-west-1.compute.internal <none> <none>\ninflate-54cd576f79-xtwkr 1/1 Running 0 101s 10.0.1.147 ip-10-0-1-69.eu-west-1.compute.internal <none> <none>\n</code></pre>\n<p>Except for three pods, all the others are dispersed to other nodes across the different AZs of the select region (<code>eu-west-1</code>) as per the affinity rule specification.</p>\n<p>Next, we’ll further explore <a href=\"https://karpenter.sh/\" target=\"_blank\">Karpenter</a>‘s use with another advanced scheduling technique, namely volume topology awareness.</p>\n<h4><a id=\"Volume_topology_aware_scheduling_280\"></a><strong>Volume topology aware scheduling</strong></h4>\n<p>Before volume topology awareness, the processes of scheduling pods to nodes and dynamically provisioning volumes were independent. As you can imagine, this introduced the challenge of unpredictable outcomes for your workloads. For example, you might create a persistent volume claim which will trigger the dynamic creation of a volume in a certain AZ (i.e., <code>eu-west-1a</code>), whereas the pod that needs to make use of the volume gets placed on a node in a separate AZ (<code>eu-west-1b</code>). As a result, the pod will fail to start.</p>\n<p>This is especially problematic for your stateful workloads that rely on storage volumes to provide persistence of data. It would be inefficient, and counter-intuitive to dynamic provisioning, to manually provision the storage volumes in the appropriate AZs. That’s where topology awareness comes in.</p>\n<p>Topology awareness complements dynamic provisioning by ensuring that pods are placed on the nodes that meet their topology requirements, in this case, storage volumes. The goal of topology awareness scheduling is to provide alignment between topology resources and your workloads. Thus, it gives you a more reliable and predictable outcome. This is handled by the topology manager, a component of the <code>kubelet</code>. This means the topology manager will make sure that your stateful workloads and the dynamically created persistent volumes are placed in the correct AZs.</p>\n<p>To use volume topology awareness, ensure that you set the <code>volumeBindingMode</code> of your storage class to <code>WaitForFirstConsumer</code>. This property delays the provisioning of persistent volumes until a persistent volume claim is created by a pod that will use it.</p>\n<p>In scaling events, <a href=\"https://karpenter.sh/\" target=\"_blank\">Karpenter</a>, the scheduler, and the topology manager work well together. In combination, they optimize the process of provisioning the right compute resources and align scheduled workloads with their dynamically created persistent volumes.</p>\n<p>These technologies enable you to run and reliably scale stateful workloads in multiple AZs. You can thus spread your applications or databases in your cluster across zones to prevent a single point of failure in the case that an AZ is impacted. Considering that elastic block store (EBS) volumes are AZ-specific, your workloads should be configured with nodeAffinity to ensure that they are provisioned in the same AZ where they were first scheduled for a successful reattachment.</p>\n<h5><a id=\"Volume_topology_aware_example_with_Karpenter_293\"></a><strong>Volume topology aware example with Karpenter</strong></h5>\n<p>In this example, you’ll create a stateful set for an application with 20 replicas, each with a persistent volume claim using a storage class with <code>volumeBindingMode</code> already set to <code>WaitForFirstConsumer</code>. In addition, you’ll specify that the nodeAffinity of the workloads should be scheduled to topology resources in the <code>eu-west-1a</code> AZ.</p>\n<p>To review the default storage class, run the <code>kubectl get storageclass</code> command:</p>\n<pre><code class=\"lang-\">NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE\ngp2 (default) kubernetes.io/aws-ebs Delete WaitForFirstConsumer false 3d11h\n</code></pre>\n<p>Next, you’ll fetch the nodes in the respective AZs:</p>\n<pre><code class=\"lang-\">Running kubectl get nodes -l topology.kubernetes.io/zone=eu-west-1a returns the following:\n\nNAME STATUS ROLES AGE VERSION\nip-10-0-0-126.eu-west-1.compute.internal Ready <none> 4d23h v1.21.12-eks-5308cf7\nip-10-0-0-193.eu-west-1.compute.internal Ready <none> 4d23h v1.21.12-eks-5308cf7\nip-10-0-0-35.eu-west-1.compute.internal Ready <none> 4d21h v1.21.12-eks-5308cf7\n</code></pre>\n<p>Whereas, running <code>kubectl get nodes -l topology.kubernetes.io/zone=eu-west-1b</code> returns the following:</p>\n<p>Git</p>\n<pre><code class=\"lang-\">NAME STATUS ROLES AGE VERSION\nip-10-0-1-12.eu-west-1.compute.internal Ready <none> 4d23h v1.21.12-eks-5308cf7\nip-10-0-1-73.eu-west-1.compute.internal Ready <none> 4d23h v1.21.12-eks-5308cf7\nip-10-0-3-30.eu-west-1.compute.internal Ready <none> 4d23h v1.21.12-eks-5308cf7\n</code></pre>\n<p>Once you have the layout for your nodes, you can proceed to create the stateful set and an accompanying load balancer service for the application.</p>\n<pre><code class=\"lang-\">apiVersion: v1\nkind: Service\nmetadata:\n name: express-nodejs-svc\nspec:\n selector:\n app: express-nodejs\n type: LoadBalancer\n ports:\n - protocol: TCP\n port: 8080\n targetPort: 8080\napiVersion: apps/v1\nkind: StatefulSet\nmetadata:\n name: express-nodejs\nspec: \n serviceName: express-nodejs-svc\n replicas: 20\n selector:\n matchLabels:\n app: express-nodejs\n template:\n metadata:\n labels:\n app: express-nodejs\n spec:\n affinity:\n nodeAffinity:\n requiredDuringSchedulingIgnoredDuringExecution:\n nodeSelectorTerms:\n - matchExpressions:\n - key: topology.kubernetes.io/zone\n operator: In\n values:\n - eu-west-1a\n containers:\n - name: express-nodejs\n image: lukondefmwila/express-test:1.1.4\n resources:\n requests:\n memory: "512Mi"\n cpu: "500m"\n limits:\n memory: "512Mi"\n cpu: "500m"\n ports:\n - containerPort: 8080\n name: express-nodejs\n volumeMounts:\n - name: express-nodejs\n mountPath: /data\n volumeClaimTemplates:\n - metadata:\n name: express-nodejs\n spec:\n accessModes: [ "ReadWriteOnce" ]\n storageClassName: gp2\n resources:\n requests:\n storage: 10Gi\n</code></pre>\n<p>After deploying the application, persistent volumes are dynamically created in response to each claim from the stateful set replicas. Each one is created in the appropriate AZ, resulting in the successful creation of each pod replica. In conjunction, <a href=\"https://karpenter.sh/\" target=\"_blank\">Karpenter</a> provisions new nodes in <code>eu-west-1a</code> to meet the compute requirements of the stateful set.</p>\n<p>Now when we run <code>kubectl get nodes -l topology.kubernetes.io/zone=eu-west-1a</code>, we get the following response:</p>\n<pre><code class=\"lang-\">NAME STATUS ROLES AGE VERSION\nip-10-0-0-126.eu-west-1.compute.internal Ready <none> 5d v1.21.12-eks-5308cf7\nip-10-0-0-176.eu-west-1.compute.internal Ready <none> 4m18s v1.21.12-eks-5308cf7\nip-10-0-0-193.eu-west-1.compute.internal Ready <none> 5d v1.21.12-eks-5308cf7\nip-10-0-0-35.eu-west-1.compute.internal Ready <none> 4d21h v1.21.12-eks-5308cf7\nip-10-0-0-4.eu-west-1.compute.internal Ready <none> 6m12s v1.21.12-eks-5308cf7\nip-10-0-0-53.eu-west-1.compute.internal Ready <none> 8m8s v1.21.12-eks-5308cf7\nip-10-0-0-96.eu-west-1.compute.internal Ready <none> 2m28s v1.21.12-eks-5308cf7\n</code></pre>\n<p>As you can see, <a href=\"https://karpenter.sh/\" target=\"_blank\">Karpenter</a> has launched the following additional nodes:</p>\n<ul>\n<li>ip-10-0-0-176.eu-west-1.compute.internal</li>\n<li>ip-10-0-0-4.eu-west-1.compute.internal</li>\n<li>ip-10-0-0-53.eu-west-1.compute.internal</li>\n<li>ip-10-0-0-96.eu-west-1.compute.internal</li>\n</ul>\n<p>Furthermore, we reviewed the created persistent volume claims, persistent volumes, and pods by running the appropriate commands as shown in the following code:</p>\n<pre><code class=\"lang-\">kubectl get pv\nNAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE\npvc-07ad8f9f-e14f-4760-a908-58d2c48c49ac 10Gi RWO Delete Bound default/express-nodejs-express-nodejs-11 gp2 9m53s\npvc-0c3a949a-aa2f-4244-9988-d650b409698a 10Gi RWO Delete Bound default/express-nodejs-express-nodejs-3 gp2 13m\npvc-32dfc65e-9d10-42ea-a1c1-946f25500766 10Gi RWO Delete Bound default/express-nodejs-express-nodejs-7 gp2 11m\n\npvc-3cf7afb9-8bf4-4a0c-8064-cc97f0cdcbd5 10Gi RWO Delete Bound default/express-nodejs-express-nodejs-9 gp2 11m\npvc-3d8d0cb8-c5f4-43ad-a3a4-a95922a13c53 10Gi RWO Delete Bound default/express-nodejs-express-nodejs-0 gp2\n...\nkubectl get pvc\n\nNAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE\nexpress-nodejs-express-nodejs-0 Bound pvc-3d8d0cb8-c5f4-43ad-a3a4-a95922a13c53 10Gi RWO gp2 14m\nexpress-nodejs-express-nodejs-1 Bound pvc-af841cf3-634a-4314-bfd1-a1d6da9d4101 10Gi RWO gp2 13m\nexpress-nodejs-express-nodejs-10 Bound pvc-741c0d72-7311-4063-9d9b-15942f71a9a9 10Gi RWO gp2 10m\nexpress-nodejs-express-nodejs-11 Bound pvc-07ad8f9f-e14f-4760-a908-58d2c48c49ac 10Gi RWO gp2 10m\nexpress-nodejs-express-nodejs-12 Bound pvc-4a91fb0a-778b-42d2-b4b7-33d5d3dc8a87 10Gi RWO gp2 9m45s\n...\nkubectl get pods -l app=express-nodejs -o wide\nNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES\nexpress-nodejs-0 1/1 Running 0 16m 10.0.0.174 ip-10-0-0-193.eu-west-1.compute.internal <none> <none>\n\nexpress-nodejs-1 1/1 Running 0 16m 10.0.0.30 ip-10-0-0-35.eu-west-1.compute.internal <none> <none>\nexpress-nodejs-10 1/1 Running 0 12m 10.0.0.235 ip-10-0-0-53.eu-west-1.compute.internal <none> <none>\nexpress-nodejs-11 1/1 Running 0 12m 10.0.0.137 ip-10-0-0-53.eu-west-1.compute.internal <none> <none>\nexpress-nodejs-12 1/1 Running 0 12m 10.0.0.161 ip-10-0-0-4.eu-west-1.compute.internal <none> <none>\n...\n</code></pre>\n<h3><a id=\"Cleanup_447\"></a><strong>Cleanup</strong></h3>\n<p>To avoid incurring any additional costs, make sure you destroy all the infrastructure that you provisioned in relation to the examples detailed in this post.</p>\n<h3><a id=\"Conclusion_450\"></a><strong>Conclusion</strong></h3>\n<p>In this post, we covered a hands-on approach to scaling Kubernetes swith <a href=\"https://karpenter.sh/\" target=\"_blank\">Karpenter</a> specifically for supporting advanced scheduling techniques with inter-pod affinity and volume topology awareness.</p>\n<p>To learn more about <a href=\"https://karpenter.sh/\" target=\"_blank\">Karpenter</a>, you can read the <a href=\"https://karpenter.sh/v0.10.1/\" target=\"_blank\">documentation</a> and join the community channel, #karpenter, in the <a href=\"https://slack.k8s.io/\" target=\"_blank\">Kubernetes Slack workspace</a>. Also, if you like the project, you can star it on <a href=\"https://github.com/aws/karpenter\" target=\"_blank\">GitHub here</a>.</p>\n<p><img src=\"\" alt=\"59c9055aae4d4332a28e2bd482f2265e_image1.png\" rel=\"1\" /></p>\n<p><strong>Lukonde Mwila, Principal Technical Evangelist, SUSE</strong></p>\n<p>Lukonde is a Principal Technical Evangelist at SUSE, an AWS Container Hero, and a HashiCorp Ambassador. He has years of experience in application development, solution architecture, cloud engineering, and DevOps workflows. He is a life-long learner and is passionate about sharing knowledge through various mediums. Nowadays, Lukonde spends the majority of his time providing content, training, and support in the Kubernetes ecosystem and SUSE’s open-source container management stack.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/a2efff38cea14753875974d603e10bd9_image.png\" alt=\"image.png\" /></p>\n<p><strong>Jeremy Cowan</strong></p>\n<p>Jeremy Cowan is a Specialist Solutions Architect for containers at AWS, although his family thinks he sells “cloud space”. Prior to joining AWS, Jeremy worked for several large software vendors, including VMware, Microsoft, and IBM. When he’s not working, you can usually find on a trail in the wilderness, far away from technology.</p>\n"}