Discover our innovative approach to chaos testing: a journey from traditional methods to a dynamic, production-like environment. Learn how we enhance resilience, conduct comprehensive experiments, and prepare for any scenario.
In this article, we aim to guide you through our previous load and chaos testing setup (which you can read about in an old post here!). We will then highlight the numerous improvements we have implemented over the past year and introduce the new options we have now at our disposal to enhance the reliability and resilience of our services.
Started from the Bottom…
Well, not exactly the bottom, but you get the idea.
Consider this a TL;DR of our previous approach. For a more detailed description, please check out the previous post before continuing -> Link.
Our original chaos testing method involved repurposing our existing load testing setup. We incorporated Toxyproxy to trigger failure scenarios like packet loss, delays, latency, and outright connection drops. Additionally, we used Python and Bash scripting to simulate other scenarios, such as database failovers in AWS. Everything was neatly integrated into execution pipelines, allowing teams to trigger tests manually or on a schedule.
While this approach was effective for a time and still has its uses, it also had clear limitations that needed addressing:
- The applications were tested in isolation with all dependencies mocked.
- We couldn’t assess the impact of a failure in one application on others, as everything was isolated.
- There was no centralized repository for experiments. Although we provided Terraform files for teams to create necessary pipelines, this led to redundancy and made maintenance more challenging than necessary.
Now We (are) Here
In December 2022, our team initiated plans for a new and innovative approach to chaos testing. This strategy was designed to overcome some of the limitations we previously faced and to expand our testing capabilities.
Limitations and Approach
One significant limitation we faced was the inability to run chaos tests in production. Understandably, there’s a cautious approach to this – as confident as we are in our system’s robustness (knocking on wood here), the last thing we want is to unintentionally disrupt services, like preventing our families from ordering dinner.
So, we opted for the next best solution. We created an environment so closely resembling our production system that it’s almost indistinguishable – I could probably order a burger from it, and it would taste just as good.
Key features of this environment include:
- Integration of all critical applications within our tribe, encompassing everything necessary for restaurant creation and operation, order delivery (of all types), and front-end applications across Android, web, and Windows clients.
- Minimal use of mock-ups, ensuring applications interact as they would in production. This extends to databases, caches, and gateways. Beyond our tribe’s control, we use Wiremock to simulate the behaviour of external applications.
- Traffic levels and patterns are carefully calibrated to mirror those in the production environment, including request handling latency.
- The same alerting system is used. If a scenario like a pod crash or high error rate triggers a Slack alert in production, it will do the same in our test environment.
- Cost-effectiveness is prioritized by operating this environment only during business hours as needed.
Now, you might be wondering, “Is this really comparable to a production environment? Can insights from this chaos environment be effectively applied to production?”
And the answer to that is… Yes.
There are only two major distinctions between our production environment (on the left) and our chaos environment (on the right):
- Operational Hours: The chaos environment is operational for only a limited number of hours each day.
- Number of Connected Devices (Restaurants): In the chaos environment, we don’t require the same tens of thousands of restaurants connected as in production. Instead, we maintain a sufficient number to bypass all caching. These fewer restaurants generate a higher volume of orders per minute, ensuring that the overall traffic levels are comparable between both environments.
Chaos Testing: The New Approach
Let’s move beyond the setup and focus on what our new approach enables us to do differently.
In essence, our experiments—and more crucially, our findings—now have a significantly broader scope.
Here’s a list of key things we can do now, that were just not possible before:
- Holistic Environment Analysis: We now observe the entire environment as a cohesive unit. This means we can introduce a fault in Application A and precisely observe its impact, not only on Applications B and C but also on the overall traffic flow within the environment.
- For example, if an application involved in order processing starts responding slowly, it may not directly cause errors in other applications, but it can result in a lower number of orders being processed, a change detectable in the traffic patterns.
- Enhanced and Diverse Chaos Experiments: We’ve upgraded our toolset to include a wider range of chaos experiments:
- Disruption of AWS resources such as databases, queues, and caches.
- Breakdowns in applications, targeting single or multiple applications, individual or random pods.
- Network issues including packet loss, response/request delays, packet corruption, duplication, and limited bandwidth.
- CPU/Memory stress tests.
- DNS fault injections.
- [Coming Soon] Direct integration with JVM in applications to induce misbehaviour.
- Ease of Experimentation for Teams: Teams whose applications have been onboarded into this initiative can trigger experiments with a single click or regularly on a schedule.
- Scheduled Random Experiments: We run random experiments thrice weekly, with automated reports posted to Slack after each session.
Not bad, right? Not bad at all.
But Wait, There’s More
A few months into this project, while observing our production-like chaos environment, we realized its potential extended beyond just chaos experiments.
- Testing Emergency Alerting Systems:
- We evaluate whether alerts are effectively reaching the right teams through the proper channels. For instance, if an experiment triggers a problem but no automated alert is raised, it indicates a need for a new alert in the production environment.
- Training On-Call Teams:
- We conduct regular training sessions to ensure that our on-call personnel are well-prepared to handle any potential issues.
- Pre-Production Deployment Testing:
- Our chaos environment serves as a testing ground for new application versions. Thanks to traffic levels matching those of our production environment, we can accurately gauge whether new versions perform as expected. This step is integrated into our deployment pipelines, and any failure here prevents potentially faulty versions from reaching production.
- Rollback Testing:
- We also test rollback processes to ensure that, in case of issues, we can revert to a previous version smoothly. If a rollback isn’t possible, we halt the release.
These are just a few examples of the additional capabilities of our chaos environment, showcasing its versatility and value.
Conclusion
Still here? Awesome!
As you’ve seen, our approach to chaos testing is unconventional and, admittedly, not flawless. It has required considerable effort and hard work to fine-tune our system, and we’re still working on simplifying certain aspects, particularly the onboarding of new applications (that can be the hardest step).
Despite these challenges, our efforts have been worthwhile. They have significantly bolstered our confidence in rolling out new changes and reinforced the resilience of our services to withstand potential issues.
Interested in learning more? Great! Stay tuned, as we plan to delve deeper into specific elements of our setup in upcoming articles.
If you like what you’ve read and you’re someone who wants to work on open, interesting projects in a caring environment, check out our full list of open roles here – from Backend to Frontend and everything in between. We’d love to have you on board for an amazing journey ahead.