Every six months we do “Architecture Review” sessions with all tribe leads and principal engineers at Delivery Hero. The idea is to keep track of technical debt, capture system design learnings that can be applied across tribes, and make sure our architecture & reliability guidelines contain all the latest best practices.
There are a number of nice side-effects: People have the unique chance to get an end-to-end view of our ever-growing landscape of microservices, we keep our diagrams up to date, we make sure names are unique and meaningful, we detect conflicts that need resolution, we see technology trends across the whole of Delivery Hero, and finally, it is a great way to meet fellow colleagues that care deeply about system architecture.
Why Architecture Matters
For me, a great system architecture is like city planning. To achieve a beautiful city with a high standard of living, it is important to think ahead and create an aligned map on where to build roads, subways, and residential zones. Obviously, a lot is happening at the same time at Delivery Hero with thousands of microservices in production, and it feels hard sometimes to navigate this landscape of buildings, storage spaces, and communication lines. It is impossible for anyone to be in on all the details, but at a minimum, I believe it is important and useful to define shared principles, keep track of our debt, and enforce the creation of a high-quality map that makes it easier to navigate and plan.
- I run the sessions in an interview style, which makes it more interactive and engaging.
- Each tribe lead presents four slides for 15 minutes, including:
- An up-to-date architecture diagram following our common representation standard, with major past and planned changes highlighted.
- Suggested updates to the global technical debt tracker
- Interesting learnings generated in the last six months
- Proposed changes to our reliability manifesto or architecture guideline
- Everyone is welcome to join.
This year in total, we resolved 37 technical debt items and added 50 new ones. We have 54 older items (the oldest from Q1 2018) that need to be resolved eventually.
Given the growth of our team, this is probably an acceptable balance. Still, we need to be careful that we don’t accumulate too much tech debt. Either this means more resources for cleaning up projects, or sometimes being more careful in what we declare as technical debt. Often, it can be acceptable to keep a service running on a language/infrastructure/framework that is out of fashion, if that service does its job as expected and its change frequency is not too high.
Some observations from the last round of review sessions:
- On the frontend, it is all React now, plus Swift, Kotlin, and of course the new kid on the block: Dart/Flutter.
- Golang seems to be the clear favorite for many backend services. Java/Kotlin and Python have their loyal followers. In our platform, PHP, Ruby, Scala, and node.js should be avoided.
- Postgres, Redis, Dynamo, and BigQuery are our DB powerhouses. Some MySQL is still around (and not a real problem), ElasticSearch is fine for some use cases (when used with lots of care), and Mongo seems to work (again when used with care), Redshift is gone, also things like Aerospike, Memcached and CockroachDB.
- SQS/SNS, Kinesis & Google PubSub are reliably processing billions of messages for us. Kafka is used in more and more places, however, as it is more complex, the decision to use it should not be taken lightly.
- Wherever we start applications on AWS Lambda, we tend to rewrite them sooner or later into containers. In my view, Lambda/CloudFunctions are only the right choices for very specific use case.
- We tend to be eager to split things into pieces, in some places a bit too eager. Splitting services brings overhead in infrastructure, data synchronization, and API design. Obviously, we should do so when there are compelling reasons, but we have to be thoughtful – especially when we are early in the life cycle and the business requirements are still evolving.
- Almost everywhere we used a backend-for-frontend pattern (except for very simple API gateway services), which led to trouble down the road. Very easily these services violate the single-responsibility principle.
- Rewrites/deprecation projects (in pretty much all cases) take far longer than expected. We almost never had a correct estimation for any technical debt item. Starting is easy, finishing is hard.
- Good names are key. Services that don’t have a clear and easy-to-understand name typically point to an underlying problem.
- Many teams reported deficiencies in monitoring and alerting. Every service we run has to come with proper instrumentation. At our scale, we need to build defensively (especially anything that is on the critical path). Unified dashboards, incident escalation rules, run books and so on are required from the start.
- When we split services, it’s typically a good idea to split critical path functionality from non-critical functionality. Splitting one tier 1 service into two (in the worst case chained) services actually increases the probability of incidents.
- Great to see a widespread adoption of our Operations Portal. This is 100 times better than every microservice figuring out how to do user authentication and role management.