The work life of a principal engineer in a dynamic company like Delivery Hero is varied and interesting. In this article, we run through a typical week.
Heya ???? I am Max, a Senior Principal Engineer at Delivery Hero. I am a Systems Engineer which means generally I work on topics like cloud infrastructure, automation and Kubernetes but as you become more senior on an IC track the work becomes somewhat more abstract.
Delivery Hero has a hybrid office environment meaning you choose 2 days per week when you come into the office. On Mondays, I like to be at home and today is no exception. It’s an easier start to the week when I can get some focus time and plan the coming days. The company has historically had a decentralized structure that allowed for brands to have a localized focus in their respective markets, something that companies like Uber Eats could never do. But with this structure came a certain degree of inefficiency with each part of the company solving the same problems in parallel. To create more efficiency, there have been efforts to centralize and create a platform available for all developers in the company to run applications. With a company of our size, the scale of this internal developer platform will be quite large so it’s very important to get the foundational design correct. The main task for me this week will be working on an RFC about workload tenancy of this new platform. This document details the structure of cloud accounts and Kubernetes clusters. Also, an important part of a principal engineer’s role is reviewing our ever-changing architecture, which takes many forms, but today I read through a very interesting design document from a data engineering team that would replace some database read replicas with a change data capture system based on Debezium. A very interesting approach that could save significant costs ????
On Tuesday I go to the office. It can get quite busy so I book a desk in advance. Today we have an in-person stand-up meeting and it’s spent mainly discussing Kubernetes upgrades, an ongoing topic for most Foundation teams. The never-ending process of Kubernetes version upgrades creates a lot of toils if you are not careful and today we are discussing tools to automate this process as lately, it consumes too much of our time. The conversation is bouncing between writing our own tool, using Terraform + Atlantis or looking for something else. I spend the rest of the morning on some technical debt topics and catching up on Slack messages. As lunchtime nears, people start trying to coordinate plans. Our team will go to Shiso Burger, one of my favourites. It was closed for a very long period during COVID but is now open again and it’s very popular with Delivery Hero employees. In the afternoon I return to my RFC from yesterday for an hour or two until Deepak Lewis hosts a talk about ingress controllers. Deepak’s talk, which are always detailed and interesting, is about shortcomings of the famous nginx-ingress controller. The main issues being that gRPC support seems bolted on, it does not support double-GOAWAY and lacks detailed metrics or logging in some situations. He has tentatively decided to test out Contour in smaller production environments, so far it looks very promising!
Wednesdays always start at 09:30 sharp for our Weekly Operations Meeting. Each week a different team will detail an incident that affected their application. The mood is always fun and humorous and this week the meeting takes on a Super Mario theme led by Mario. Other topics today include the announcement of a company-wide Hackathon, details of our internal mentoring program and celebrating some work anniversaries. This week’s deep dive is about an incident where an application feature flag was enabled on an incorrect version. This manifested as a serious performance degradation of one of our main services that is used by riders. The meeting operates with a strict blameless culture so although this one was serious and orders were lost, the Q&A about the incident is positive and friendly. The main goal is to share learnings from incidents across teams. Later in the day, I spend some time with a few other engineers explaining why an EKS managed node group rotation fails and what can be done about it, a problem we see occasionally when applications have strict pod disruption budgets. Within the company, there are a number of guilds where people can meet to discuss a topic of interest. These include guilds for popular languages like Golang or Java and other tools like Kubernetes, which is a guild I run. This afternoon it is the Python Guild meeting organized by Adam and I’m excited as we are hosting PyBerlin in our office. Adlet is giving a talk about Property-Based Testing In Python.
I am back in the office for Thursday and we have a Principal Council meeting today. The Principal Council is relatively new here and is a place for principal engineers from all over the company to coordinate on some specific global topics and to focus efforts. In this meeting, we discuss updating the Delivery Hero Reliability Manifesto which was first published in 2021. Also formalizing an architecture review process, both big topics. We are still working out the kinks like how the council should function and how we should coordinate and organize ourselves but it’s looking very positive. I am preparing a proposal for the council to unify the way we store and index technical documents like RFCs, incident reports, ADRs etc. Currently, across the company, the solutions are mixed and inconsistent. I spend the rest of the day catching up on slack, emails, pull request reviews, writing 360 feedback and taking advantage of being in the office by meeting people face to face. Later in the evening, we have an emergency incident, AWS services in the us-east-1 region have increased error rates. This impacts some of our workloads that access AWS services using credentials that are provided via the AWS STS service. As a temporary workaround, we implement a different method for getting credentials and thus bypass the faulty AWS STS service. It is tense for a moment but in the end, the impact is relatively low for us.
TGIF and it’s a special Friday today. It feels like we’ve been waiting waaayyyy too long for the Berlin summer to arrive (like every year ????) but today it feels like it’s finally here AND we have a Delivery Hero summer party after work ???? I have no meetings scheduled and want to focus on some coding. I want to write a small CLI tool to quickly check the health of a Kubernetes cluster. I imagine it would check for any problems with core components like cluster-autoscaler, coredns, CNI etc then check for any node problems like Conditions, then check for wider problems like pending or crashing pods or HPAs at their maximum etc. We have dashboards and alerts of many of these already but they are scattered in different places and I often find, especially when there is an incident, it’s time-consuming to gather all this information. I’m not sure the idea has merit but I mute my Slack notifications, spend some hours on the task and see how it looks. After this, I return to my RFC document to answer questions posted and then start writing the incident report for last night’s AWS issues. There is a large volleyball tournament at the summer party later so there’s some discussion in the office about making teams that feels somewhat nostalgic, like being in high school again. We are leaving to go to the party now so here is where I sign off, later ????
If you like what you’ve read and you’re someone who wants to work on open, interesting projects in a caring environment, check out our full list of open roles here – from Backend to Frontend and everything in between. We’d love to have you on board for an amazing journey ahead.