Dynamically overscaling a Kubernetes cluster with cluster-autoscaler and Pod Priority

25.04.19 by Miguel Angel Mingorance Fernandez

Dynamically overscaling a Kubernetes cluster with cluster-autoscaler and Pod Priority

5 min read

My name is Miguel Angel Mingorance Fernandez and I’m a Systems Engineer at Delivery Hero, working to ensure high availability in our foodora/foodpanda platform and providing solutions to improve the performance, capacity and resilience of our infrastructure.

Kubernetes is a very flexible system but not perfect. When it comes to cluster scaling, the delays in spinning up new nodes can create latency in dealing with the increase of requests that triggered the scaling, causing a negative feedback because we can’t handle the traffic quickly enough.

We in the DevOps team at Pandora (foodora, foodpanda, onlinepizza and pizza-online) have been running Kubernetes since some years already and have seen that this technology was a very efficient way to quickly scale our applications as the amount of requests increased. This allowed us to be reliable for our customers in the most critical times of the day: when people are hungry. However as our platform grew and the number of simultaneous orders with it, we started to face scaling issues. Not because of our applications or Kubernetes itself, but because of AWS. Eventually our applications were growing faster than our Kubernetes cluster could due to the amount of time it takes to spin up new EC2 instances and have them ready to join the cluster. As result of this behaviour we started to lose orders because our applications were pending to scale up. This was really bad and we had to find a solution for this.

When considering ideas about how to keep our cluster ready to scale our apps very quickly, all the time, we decided to fix the number of extra nodes in the cluster that would be ready and waiting to start new pods. In this way, our apps would always be able to scale up without having to wait for AWS to create a new EC2 instance to join the Kubernetes cluster. Problem solved! The apps could always scale up without getting in pending state. That was exactly what we wanted.

However, this was not exactly the solution we wanted to have long term. Even though it was good enough for our apps to always be able to scale up, it was a very expensive solution.

Having 20 extra EC2 instances running all day, just waiting to start pods during peak times, made our costs much higher. Not a great deal in the end.

In search of excellence

The solution we implemented worked perfectly but as said before, it was expensive. That’s why we started to search for excellence and provide a solution that would address both aspects, efficiency and costs.

In our research to find this excellent solution we found that the current addon we use to autoscale our Kubernetes cluster, the cluster-autoscaler, could also be used to overscale the cluster by using a specific feature available. This seemed to be the way to go for a long term and excellent solution.

How to overscale Kubernetes with the cluster-autoscaler

The way this is done is a bit tricky. There is not any native solution in Kubernetes to make the cluster run some spare nodes or have some specific spare amount of resources. However, we can make use of Kubernetes paradigm to achieve this. The way a Kubernetes cluster grows in computing capacity is based on pod resource allocation. Knowing this, we can create a “dummy” or “paused” deployment in our Kubernetes cluster that will deploy as many pods as we want and these pods will request a desired amount of computing resources in the cluster, meaning that it will occupy some space in the cluster. That’s how we can make our Kubernetes cluster to grow and be overscaled. Having these pods occupying extra space in the cluster will allow us to have some spare resources already created and ready to be used at any time.

However, how can we make use of that “spare” space if those pods are already using it? That’s where Pod Priority comes into play.

Using Pod Priority to manage our spare resources

Pod Priority is an available feature in Kubernetes since version 1.9 as alpha and promoted to beta in 1.11 that allows us to create priority classes for pods.

By using this feature we can create spare capacity in our cluster by running a deployment of paused pods with a lower priority than our apps. In this way, whenever one of our applications needs to create a new pod and there is no space available in the cluster, Kubernetes will compare the priority of all pods and evict those with lower priority. Provided the eviction will make enough space for the new pod, the application pods will start immediately. Kubernetes will reschedule this evicted pod and this time, since there is no more space available in the cluster, Kubernetes will trigger the creation of a new node through cluster-autoscaler. Once the new instance joins the cluster, the previously evicted pods will be running again and we have spare capacity available.

How we implemented it at Foodora/Foodpanda

To be able to implement this solution in our Kubernetes cluster, we had to perform a couple of changes because in Kubernetes 1.10 pod priority is a beta feature and not enabled by default.

Information about how to enable Pod Priority in Kubernetes 1.10 or 1.9 can be found here.

Be aware that Pod Priority is enabled by default in Kubernetes since version 1.11

After enabling Pod Priority in our cluster we had to re-deploy cluster-autoscaler with the following config flag to ensure it considers the lower priority pods: —expendable-pods-priority-cutoff=-10

Now we are ready to deploy 5 paused low priority pods that will be created in a different node each of them and it will request 3000m CPU and 7000Mi memory. This gives us enough spare space to scale up multiple of our applications at the same time.

To deploy the paused pods that reserve extra space in our Kubernetes cluster, we created a Helm chart that includes the following Kubernetes resources:

A new namespace only for overscaling
A priority class for the paused pods with a priority value of -1
A default priority class for all the pods with a priority value of 0
A deployment running the paused pods

As you can see, running this paused deployment together with pod priority and cluster-autoscaler we can have a dynamic overscaled cluster based on workload. This dramatically reduces the time to scale up our applications.