19/01/23

Creating an SRE Culture while preventing a 12 million order loss

By Wesley Costa

Back in 2019, we were in a race to constantly build new features while trying to juggle stability. During this phase, technical debt was piling up and the reliability of the platform was suffering. We had a “stability” meeting with all of the backend and infrastructure chapters EVERY morning to talk about the incidents we caused and what we were going to do next. I used to call this meeting “The ring of fire”.


Operation Hawk

We decided to call our Observability project ‘Operation Hawk’, as a Hawk has better vision than humans. We had too many different observability tools, all spread out among local squads. The goal of this project was to bring observability to one single place, while increasing ownership on local teams so that the data could be as trustworthy as possible.

The foundation of Operation Hawk was, and still is, the implementation of the Four Golden Signals mentioned in the Google SRE Book. However, before implementing it, we needed a new tool.

The Hunt

We wanted our observability data to be in one place, so we began the hunt for the right tool. At Delivery Hero, we only make architectural decisions through RFCs, so we started a couple of RFCs and POCs until we found the right tool.

The Golden Path

Our mission as a team was to enable Heroes to achieve Operational Excellence by providing Best Practices, Observability and Governance throughout the application lifecycle – meaning that we wanted to lower the adoption bar by providing a standard and self-service approach for every service or tool that we provide, as we have a self-service mindset regardless of the solution we provide.

With that in mind, we created our SRE Framework.

The SRE Framework

At Delivery Hero, we invest time and effort into monitoring our services from day one.

We created the SRE Framework with various maturity levels, based on the adoption of the SRE best practices. The SRE framework creates a golden path to increase the reliability and stability of the platform while promoting the SRE culture in local teams and giving service owners the ownership and independence they need.

The SRE Framework is split into 5 Maturity Levels. The squad was given ‘Maturity Level 0’. For Maturity Level 0, we provide an awesome list of resources so that one can learn what SRE is. At Maturity Level 4, squads own the whole process of ‘how to SRE’ in their local teams.

“…And, as we all know, culture beats strategy every time”

One of Delivery Hero’s core values is “We always aim higher 🚀”.

We quickly learned that by making it easy for our developers/stakeholders to do the right thing, the path to adoption is made easier. Therefore, we decided to spend time and effort making adoption of the golden signals and observability best practices the ‘easy option’ for our developers, by including monitoring directly into the modules used to create infrastructure, rather than pointing them to resources they could use to create those monitors themselves. Doing so meant every service and its underlying dependency had a fantastic observability stack ‘out of the box’, driving the proportion of services covered by 100% and empowering engineers to own their own stack.

This is now the default approach for our solutions, it’s called “Batteries Included”.

Batteries Included

Imagine you buy a toy for your child for Christmas. They rip the wrapping paper open excitedly to see what gift they have received. Their face lights up, they want to start playing immediately – but the toy needs  2 AA batteries. You go and find the packet (or take them from the TV remote). At that moment, the excitement of opening a new toy turns to frustration. 

Toy manufacturers became aware of this and started to include batteries directly in their toys, resulting in happy kids and less frazzled parents. This is the ‘batteries included’ approach. 

In product usability (mostly in software) it states that the product comes with all possible parts required for full usability. It means that the local teams now have all the observability out-of-the-box when they onboard their service. Not only this, but whenever a resource is created on AWS, it will already have all the Observability included.

Batteries included is now our approach at Delivery Hero.

Conclusion

With the right tools and data to create awareness about application performance, along with the underlying dependencies and costs, we were able to shift the Engineering Culture and improve our MTTD (Mean Time to Detect) and MTTR (Mean Time to Recovery) by 195% and 282% consecutively, and the percentual overall reduction was at around 327% less minutes in incidents.

In other words, Delivery Hero makes approximately 2 thousand orders per minute. If we calculate the difference, we can see that the reduction of both MTTD and MTTR helped us prevent an order loss of more than twelve million orders in the last two years.


If you like what you’ve read and you’re someone who wants to work on open, interesting projects in a caring environment, check out our full list of open roles here – from Backend to Frontend and everything in between. We’d love to have you on board for the exciting journey ahead!