27/05/20

Designing Your Kubernetes DNS EcoSystem

By Gabriel Ferreira

My Name is Gabriel Ferreira, I’m Senior Principal Engineer at DeliveryHero, and today I would like to share my experience with working with the most critical and “trick” component on k8s clusters, the DNS ecosystem.

Some time ago, after regular maintenance of one of our EKS K8S clusters, we started to receive alarms about the DNS resolution being slow, and after checking the CoreDNS logs, I could confirm the following errors:

As we can see the DNS is TIMING OUT when forwarding the queries for our internal AWS DNS SERVER, at this case IP 10.139.0.2.

After a quick chat with AWS Support, the issue was still not resolved, and from the AWS side all EC2 LIMITS were fine – instance OK, etc, but I had a gut feeling from when we first faced this problem in the beginning, that it was related to the limit of DNS queries ALLOWED to the AWS Internal DNS Server.

So I tried one change – enabled coredns autopath (around 6:00 Berlin Time).

The latency went down.

Some errors were still present, but far less than before.

After this point, I could be sure that we were hitting some kind of internal AWS server limit.

After searching around for a bit, I finally found something useful.

It basically said:

BINGO, we found the problem.

Here’s how we solved it.

Basically, we have some really high throughput applications, with hundreds/thousands of pods, that do not consume any internal Kubernetes services.

We were able to change the DNS POLICY to “Default”, which essentially means: “The Pod inherits the name resolution configuration from the node that the pods run on.” In the end all DNS queries will be spread across multiples ENIs.

We did the change around 10:09(UTC) 12:09(Berlin Time).

When monitoring it properly, consider using this prometheus query:

max(rate(coredns_dns_request_duration_seconds_bucket{}[5m]))

So, in case of any slowness over the pods, that can be because of N reasons you would be aware.

The threshold is up to you. 🙂

It’s really important that you think about the relation of your DNS and your apps in advance, so you can avoid these kinds of complications in the future.

I also strongly recommend using DNSMASQ on a node level as well.

I hope that our experience has brought some clarity to those facing similar issues and that you found this article helpful!


As always, if you’re interested in pursuing a career at Delivery Hero, have a look at some of our open positions!