Building a Global Experimentation Platform: The Technical Challenges

05.10.23 by Martin Luz

Building a Global Experimentation Platform: The Technical Challenges

6 min read

Fueling Data-Driven Decisions: Unveiling the Technical Marvel Behind Delivery Hero’s Global Experimentation Platform.

Much like other data-driven industry leaders, the prominent brands under the Delivery Hero umbrella (PedidosYa, Talabat, Foodpanda, Hungerstation) don’t make random decisions. Instead, all product decisions are rooted in empirical data to ensure that end-users have the desired experience within the application.

This is precisely where the need for an experimentation platform that facilitates the measurement of product feature impact becomes essential.

Constructing a platform like this brings forth numerous technical challenges.

Building a global experimentation platform

Let’s start defining what A/B testing is.

A/B testing is a systematic and controlled experiment in which two or more versions of a webpage, app, email, or other content are presented to users or an audience. The primary goal of A/B testing is to assess which version performs better in terms of predefined metrics, such as user engagement, conversion rates, click-through rates, or any other relevant key performance indicators (KPIs).

Delivery Hero boasts an in-house experimentation platform called “Fun with Flags” (FwF). With over 1 billion requests per week, FwF is used by 90% of Delivery Hero’s brands (present in over 70 countries worldwide) to measure the impact of their changes, ensuring they have the expected effect on end-users.

FwF not only provides the platform for setting up these experiments but also offers over 10 SDKs in various programming languages, statistical analysis of experiments, result visualization, creation of custom metrics, and more.

Developing such a platform requires a diverse team of experts in Data Science, Software Engineering, Data Engineering, UI/UX, and Product Development.

In this blog post, we’ll delve into some of the technical challenges encountered while architecting this platform. If you want to learn how we started working on the Fun with Flags platform in the first place, this article might be helpful.

Scaling on Demand: Anything But Trivial

Our service handles peak-hour traffic in its regions, surpassing 400k RPM (requests per minute). This highlights the vital role that scalability plays in our service’s overall success.

To guarantee this scalability, we’ve deployed a distributed architecture across five strategic regions around the world. This strategy allows us to get closer to our global markets and provide a seamless user experience. However, achieving scalability in such a vast and diverse environment is a significant challenge. Let me show you how we do it.

Strategic Distribution

Our architecture is built on the idea of being near the users. This means we’ve strategically distributed our services across five regions worldwide. Each region is carefully selected to cater to specific markets, reducing latency and improving response times.

Global Load Balancer (GLB): An Intelligent Landing Strip

The Global Load Balancer (GLB) is our solution for directing incoming traffic. This intelligent system is responsible for routing requests to the fastest availability region at that moment. While the fastest region is typically the closest one, the GLB also takes into account the current load and potential issues in each region.

Resilience and Fault Tolerance

Our architecture is resilient and fault-tolerant. Even if one region is overloaded or experiences issues, we can seamlessly redirect traffic to other regions. This failover capability ensures that our users always have access to our services.

AWS EKS for Enhanced Scalability

To further bolster our scalability efforts, we’ve harnessed the power of AWS Elastic Kubernetes Service (EKS). AWS EKS allows us to dynamically manage and scale our containerized applications with ease, ensuring that our system can seamlessly adapt to fluctuating demands. This integration adds a layer of efficiency and flexibility to our distributed architecture, ultimately contributing to our ability to handle peak-hour traffic and maintain a reliable user experience.

Every Millisecond Matters

In addition to the scalability of our service, response times play a pivotal role in providing timely decisions for user allocation within an A/B test. Our platform maintains response times that average below 15 ms at the 95th percentile.

Stateless service: who are you?

Our service is stateless with respect to users, meaning it doesn’t retain or store information about them. Each new request is treated as a new user for our service, allowing us to avoid database read times.

But how do we maintain the consistency of the responses we provide? How do we ensure that we allocate the same user to the same A/B test variation with each request?

Our allocation service utilizes a deterministic bucketing algorithm (same input, same output) based on the user ID (among other A/B test-related data) we receive in requests. With this algorithm, we guarantee that the same user ID will always generate the same bucket ID, ensuring allocation to the same experiment variation consistently.

Dual Cache Layer: Boosting Performance and Efficiency

Managing the configurations of various experiments, including target rules, splitting rules, and segments, is a critical aspect of our system. In our setup, we store these configurations in durable storage within AWS RDS (Relational Database Service). This data is not only stored securely but also replicated across different availability regions for redundancy and disaster recovery.

The Cost of Database Reads

While having configurations in a reliable database is essential, retrieving data directly from a database can be relatively time-consuming. It’s often more efficient to access data from memory, and this is where our dual cache layer comes into play.

In-memory Cache (Pod)

The first layer of our caching strategy involves an in-memory cache, which resides within the application’s container or pod. This cache enables lightning-fast data reads, significantly reducing response times. Additionally, it offloads some of the demand from the second cache layer, preventing overloads and conserving valuable bandwidth.

AWS ElastiCache (Redis)

Our second cache layer relies on AWS ElastiCache, specifically using the Redis engine. Redis provides a distributed caching solution known for its high performance. Redis boasts response times that are approximately x10 faster than direct database reads.

FwF Everywhere: Empowering A/B Testing Across the Board

A/B testing is the heartbeat of Delivery Hero’s product offerings, spanning mobile applications, web applications, backend services, and more.

That’s why we’ve gone the extra mile to provide Software Development Kits (SDKs) that simplify the integration process across more than 10 platforms and programming languages. These SDKs serve as the conduit between the platform where experiments take place and our FwF (Feature flagging and experimentation service).

Within our SDKs, a world of capabilities unfolds:

Local Evaluation for Lightning-Fast Decisions

Some of our SDKs wear a dual hat, enabling them to request experiment evaluations remotely from the FwF service and promptly deliver the responses. They also can request experiment configurations and perform evaluations locally. This dual approach slashes response times and minimizes network usage.

Attribute Change Subscription for Precise Control

User allocations in various experiment variations often hinge on user attributes like location, app version, and more. That’s where our SDKs shine, allowing you to subscribe to attribute changes. Whenever a relevant attribute transforms (say, a user changes location or app version), our SDKs automatically reevaluate related experiments.

Local Caching for Speed and Efficiency

For SDKs equipped with remote evaluation capabilities, local caching is the name of the game. They give you the option to store evaluation responses for multiple experiments right in the SDK’s local cache. This nifty feature not only conserves network requests but also shaves off those annoying response time delays. Plus, our SDKs are always on the lookout for any changes in experiment configurations, ensuring that the evaluations are always up-to-date.

Conclusion

In our journey through the technical landscape of Delivery Hero’s Global Experimentation Platform, we’ve encountered a world of challenges and solutions. This platform, “Fun with Flags” (FwF), empowers data-driven decision-making across Delivery Hero’s brands.

From A/B testing insights to handling peak-hour traffic and ensuring rapid response times, the platform showcases technical prowess. Scalability, strategic distribution, and AWS EKS form a robust foundation, while a dual cache layer optimizes database access.

The importance of response times and consistent user allocation is evident. SDKs democratize A/B testing, offering local evaluation, attribute change subscriptions, and local caching.

In essence, FwF illustrates technology’s power to enhance user experiences globally, conquering technical challenges for innovation.

The header image is generated by Stability AI. The prompt is “Experimentation platform infrastructure”.

If you like what you’ve read and you’re someone who wants to work on open, interesting projects in a caring environment, check out our full list of open roles here – from Backend to Frontend and everything in between. We’d love to have you on board for an amazing journey ahead.

Martin Luz
Software Engineering Manager

Pivoting Into Tech: You Do Not Need to Have a Tech Background

Engineering

Pivoting Into Tech: You Do Not Need to Have a Tech Background