Cost optimization and resource optimization are two subjects that go hand in hand. That’s why this article focuses on how we utilize resources in a better way to achieve a significant cost reduction.
In the Delivery Hero Search Domain, we are utilizing GKE (Google Kubernetes Engine) as our medium to run our applications. Yet, our approach can be applied to any other environment running K8s. There are already great resources available to steer developers in the right direction to get the most out of GKE. One of them is Best practices for running cost-optimized Kubernetes applications on GKE, which explains best ways to optimize resources to prevent unnecessary costs. These best practices shaped the foundation of our methodology.
Resource optimization is best accomplished when it is done at the design stage, so that maximum utilization can be achieved before things get complicated. Yet, it can also be applied to more mature systems. Keep in mind that, similar to any other optimization problem, the proposed solution can be repeated till the desired outcome is acquired.
Even though there are many different aspects when it comes to optimizing resources, in our particular case there were three main efforts which helped us greatly: Pod Right-Sizing, Bin Packing and Pod & Node Autoscaling.
In Kubernetes, pods are the smallest deployable units which can contain one or more containers. Depending on the needs of the application, resources are specified for each container. Subsequently, those pods are scheduled in Kubernetes Nodes.
Each application has specific needs depending on the programming language, lifecycle, concurrency etc. Some applications have start-up operations which cause high CPU consumption at the beginning, while some others have CPU fluctuations based on the time of the day. One thing is certain though: CPU and memory consumption don’t usually stay steady.
To balance out the fluctuations, it is not uncommon to request resources generously. Yet, there is a better and cheaper way to achieve similar behavior: rightsizing the pod and overcoming fluctuations using auto-scaling, which I will elaborate on later in this article.
A good way to approach right pod sizing is understanding the limits of the application. In most cases, applications can only consume CPU up to some limit – simply because there are other limiting factors such as number of threads, I/O operations etc. Unless there is a memory leak or GC issues, memory consumption should also have an upper limit.
Upper limits by themselves are not enough. The average consumption and the level of fluctuations are the main deciding factors for the resource requests. To exemplify, If the application is constantly consuming exactly one core of CPU, it is enough to request one CPU core with a slightly higher limit. On the other hand, if the application is consuming one core on average, yet the consumption fluctuates between 200m and 5 cores of CPU, then a cpu request between 1 and 5 cores should be ideal. The exact value can be 75th, 90th or higher percentiles of the consumption, depending on application needs.
Finally, the last step is approaching resource types differently. CPU is a compressible resource, where pushing limits results in CPU throttling. Memory, on the other hand, is an incompressible resource, where reaching the limits causes OOM (Out of Memory) Error, hence the pod is taken down.
This is why it is good practice to specify “CPU request” between average and maximum CPU consumption. The exact value depends on the frequency of CPU usage spikes (high percentiles). But what happens when the CPU request is exceeded? That’s where the second parameter “CPU limit” plays an important role in utilizing CPU from the shared area within a Kubernetes Node. The CPU limit should be set higher than the maximum ever reached value. It is important to keep in mind that if there is not enough CPU in the shared area, applications will start CPU throttling. We don’t need to worry about this so much if we are planning to use HPA though. Also, keep in mind that setting requests and limits differently will cause losing Guaranteed QoS class.
Since Memory is an incompressible resource, we don’t want to depend on the shared memory, which is much less reliable than the requested memory. For this reason, it is advised to set “memory request” and “memory limit” to the same value which should be more than maximum used memory.
By using the methodology above, we were able to decrease “CPU request” of some applications by ~90%, and “Memory request” by more than 50%. Our pods became much smaller, and resource utilization (100*used/requested) increased drastically without affecting stability.
Nodes are the Kubernetes representation of VMs, and they form the Compute Engine cost for GKE. That’s why it is crucial to maximize their utilization. By placing pods into nodes in a compacted way, the final cost can be reduced without any performance loss. This conscious way of locating pods is called bin packing.
If you are already using GKE Autopilot, you might not need to be concerned with bin packing. As some Autopilot limitations didn’t allow us to use it, we achieved bin packing by ourselves. To achieve the best compaction, one or multiple node pools can be defined where the nodes are tailored to contain pods in a snuggly way, yet with some buffer zone for the shared CPU. Bin packing can be easily planned in case there aren’t many pods, or if the pods have similar sizes. Yet, in a more complex environment, you might find yourself making a lot of calculations to find the optimum placement strategy. A good starting point is to calculate CPU/Memory rate of pods involved, so that a matching VM machine family can be picked. In our case, switching from N1 Standard machines to N2D High Memory allowed us to have a much better price-performance ratio. On top of that, we used preemptible nodes more often to further decrease costs.
Furthermore, there are some considerations to keep deployments healthy. Affinity and Anti-Affinity settings decide how pods will attract or repel each other. From a cost perspective, it might sound more appealing to run as many pods as possible in the same node. Yet, whenever higher availability is desired, anti-affinity settings can be used to place pods, at least the ones belonging to the same deployment, in different nodes and/or availability zones.
We used to have too many different node pools for different deployment types, which was causing under-utilized nodes in general. By merging those node pools and performing bin packing, we were able to fit our pods into nodes in a much more efficient way, such that we were able to decrease the total number of nodes by ~60% in our squad.
Pod & Node Autoscaling
Autoscaling is the final ingredient supporting our effort to request optimized resources. Its significance increases if the resource consumption is fluctuating over time. Without autoscaling, resources must be requested based on the peak usage – which results in unused resources most of the time. There are four different autoscaling mechanisms on GKE, yet we will focus on two of them: HPA (Horizontal Pod Autoscaler) and Cluster Autoscaler.
HPA ensures that the desired resource utilization is sustained for Deployments or StatefulSets over time. This is achieved by scaling the deployments up and down automatically. You can specify one or more metrics with a target utilization value, where metrics can be anything such as CPU usage, memory usage or custom metrics – depending on the application needs. Lower and upper limits for the replica count can be specified by “minReplicas” and “maxReplicas”. Keep in mind that HPA cannot scale down less than 1 replica, which could be useful when the workload resources are not used at all – such as staging environments at non-working hours.
Having a varying number of pods necessitates a varying number of nodes. Whenever there is a need for a new node, cluster autoscaler kicks-in to create a new node. Similarly, whenever all the conditions are satisfied to shrink a node pool, cluster autoscaler would create a plan to move pods around and drain a node. If the conditions are not satisfied, autoscaler will log this event with the reason, and will retry after an interval. This is done by each node pool separately. Similar to HPA, it is also possible to specify minimum and maximum node count within each node pool.
Even though at Delivery Hero Search we already had autoscaling capabilities in general, revisiting its configuration and supporting them for more applications helped us to create more stable, yet cheaper, deployments.
- It is easier to optimize resources at design stage before things get complicated
- Like any other optimization problem, resource optimization can be repeated in iterations until the desired outcome is obtained.
- Pod Right-Sizing ensures pod level optimization
- Bin Packing ensures node level optimization
- Use preemptible nodes whenever availability requirements allow that
- Autoscaling ensures fluctuations are handled