28/09/21

Chaos testing with Toxiproxy

By Nicolas Reymundo

Picture this: you have your shiny new application already in production or just about ready to go live. You have gone through the usual testing stages and have pretty good unit, integration and end-to-end coverage (because we all have totally valid suites at each stage, right?). You even went the extra mile and added load testing to make sure the application could handle whatever traffic you expect it to face, and then some. All good, right?

Peer review and test automation suites take care of finding bugs and problems before they hit production. CI/CD and canary releases handle rolling back if anything slips by minimizing damage. Load testing makes sure that things are not catching fire on a Friday night because traffic has suddenly doubled. When all that is complete, it’s time to go home!

Wait.. maybe not just yet. 

Sure, you have everything covered right until your application goes live, but what happens afterwards? What happens if one of the third party services it uses starts having issues? Or if there’s a connectivity problem somewhere in the middle and it starts dropping packages? Do you know what would happen if you tried to do a deployment in the middle of your peak usage when the service is maxed out and already using all resources?

Welcome to the really interesting, and more-often-than-not frustrating as hell, world of chaos testing.

What is chaos testing?

Chaos Testing is an approach in which you verify and understand a system’s stability by introducing issues and adverse conditions, either in a controlled environment or an organized manner (if doing it directly in prod envs).  

The idea is not just to know what can go wrong and assume how the application would behave in that situation, but rather to be proactive: Cause those failures, and observe how each change or condition affects your application. This will allow you to know for certain the answer to questions like “What would happen if X 3rd party service went down?” or “How does a slow database affect my service?”, to avoid  finding out when it’s already too late. 

Netflix is arguably one of the most famous examples of chaos engineering, with their now-retired Simian Army, and the open sourced Chaos Monkey, being great examples of how this can be done. 

Our own approaches to chaos engineering are not anywhere near as complex and polished as those of Netflix. We don’t chaos test in prod for example, instead we use prod-like load testing environments. Still though, in just a year of actively working on this, we have managed to make quite a lot of progress. 

From Zero to Hero 

Our starting point looked like this: 

  • We had a separated load test environment in which all the core applications were deployed with resources matching production. 
  • As part of the same CI/CD pipeline that deploys our services to production, said environment also got updated, ensuring that our tests are run against the same codebase clients are using. 
  • If an application required third party services, then those were mocked using WireMock, with everything configured in such a way that the WireMock server would respond with payloads comparable in size and in response time to what we would see in production. 
  • We already had load testing in place to know what the baseline performance would be in a ‘happy case’ scenario. 

What we wanted to accomplish:

  • At the very least, we wanted the ability to cover scenarios in which dependencies to a service are responding slower than usual, or are otherwise degraded. 
  • Ideally, this should include scenarios in which the latency is high, as well as getting outright timeouts. Focusing first on the issues we had already experienced in production, but expanding to investigating how each service behaves when each and every single one of its dependencies has issues… regardless of how realistic that particular dependency failing may or may not be. 
  • We wanted to reduce our overhead of deploying new scenarios for existing services and start coverage for new services as much as possible. 
  • We wanted to accomplish all this without having our existing infrastructure balloon out of control – with different versions of the same things deployed to multiple places.

How we used to do it (hacky way with wiremock):

In the past, we used to have the application or service deployed in a separate environment. We would replace as many dependencies as we could with a stand alone wiremocking service tailored to that application.

For example, if the application in production would query a Google API, then in the load test environment it would point to a Wiremock installation that would have the expected Google API endpoint stubbed and would return a suitable response. 

If we wanted to create a scenario with any kind of degradation or slow response times, we changed those mocked endpoints to take longer to respond or return an error code.

While this worked decently well, we did run into several shortcomings – mainly due to using a tool for something that it wasn’t designed for.

Problems we encountered:

  • Changing these conditions, or the response to any endpoint, required restarting the mocking service, which in turn resulted in a temporary burst in error rate that tainted the results. 
  • Working around that issue required multiple instances of not only the mocking service (each with different configurations), but also of the application to test. 
  • Mocking some services was very cumbersome or downright impossible. Some examples were queues, databases and other non-HTTP services.

Our new and improved approach:

Enter Toxiproxy

The TL;DR is that Toxiproxy is a framework to simulate network conditions. It can sit in the middle, between an application and its dependencies, and proxy any TCP requests to its destination. 

By default, it is a transparent proxy, meaning it just forwards everything back and forth without altering the connection (and in very extensive A/B testing on our side without even adding any noticeable latency). However, it also has the capability to simulate issues, such as adding latency (on the upstream, on the downstream, or both), simulate a complete outage by toggling endpoints off, or mimic bandwidth issues. As well as that, it is highly extensible and you can create your own custom conditions as needed. 

All this while also exposing a control API so we can make changes on the fly in the middle of any testing pipeline (and script those changes to happen automatically!), so we can see in real time what happens when the application encounters an issue and, just as importantly, how it recovers when the degradation is not there anymore. 

In our own specific implementations, this meant that we were able to remove all our ‘purpose specific’ wiremocking services that were imitating degradations, and instead just leave one per application, along with all the settings for the happy scenario. We deployed Toxiproxy in between the application and the wiremock service and used its admin API to trigger specific network conditions on the fly. 

An example of a chaos testing scenario that this allows us to do is something like this:

  1. We start simulating traffic going into our application slowly until it reaches somewhere around the standard load we see in production. 
  2. We let the service stabilize for a few minutes to make sure everything is OK, and that we see the same behaviour as production. 
  3. After 15 minutes of warm up, we trigger the Toxiproxy API to create a spike in latency. From now on, one of the service’s dependencies takes 5 times longer to respond than expected. 
  4. We use Grafana, Datadog and other monitoring tools to record how the application behaves during this degraded period. 
  5. 20 minutes after that initial degradation, we send a request to the Toxiproxy API again, this time undoing the changes and going ‘back to normal’. 
  6. We record how the application recovers. 
  7. 10 or 20 minutes later the test finishes – everything gets cleaned up and downscaled and the pipeline is considered to have passed or failed based on the observed metrics.

Practical examples

We use Spinnaker as the orchestrator for all our pipelines. Here are a couple examples of what that looks like for us and some idea of what we do in each one. 

Happy Path

  1. We scale everything up. 
    • The application to be tested. 
    • The mocking service that will replace all 3rd party dependencies that the application might need to function. 
    • A number of Toxiproxy pods that will sit in between the application and the mocking service.
    • Locust is what we use for our load testing framework. We scale up a single master pod that will coordinate the load test and a number of workers that will execute everything. 
  1. A python script adds a new placeholder entry to a spreadsheet so we can then go back to it and fill in details at the end of the load test. This way we can run multiple load tests in parallel, with all of them logging data into the same spreadsheet without risking one overwriting another. 
  2. Once the Toxiproxy pods are up, we send a “Reset” command to make doubly sure everything is in a clean state and we don’t have leftovers from previous runs tainting our results. 
  3. “Start and monitor load test” does what it says on the tin. 
  4. After the load test is finished, we grab metrics from different sources (Datadog, Prometheus, etc) and replace the placeholder created in step #2 with the actual values. 
  5. Everything gets downscaled back to 0 active replicas. That way we are not using resources (as in, not spending money) while nothing is running. 
  6. At the very end of the pipeline, we have the option to run a script to query whatever metrics we might need to decide if the load test as a whole was successful or not, and pass/fail the pipeline accordingly. 

Not-so-happy Path

  1. Most of the pipeline is the same as the example above. All the stages present are still here and do the exact same thing. 
  2. In parallel to starting and monitoring the load test, we now also have a branching path that does the following: 
    • Waits an arbitrary amount of time (15 minutes for example). This is to give the load test time to ramp up and get to the traffic level we want. 
    • After the wait period, the next stage will trigger a call to the Toxiproxy API that will apply the specific networking condition we want. This could be latency, time outs, a bandwidth issue, a service suddenly failing, etc. 
    • Another wait period will mark how long this new degraded state will last. Let’s say 20 minutes for this example. 
    • After those 20 minutes, there is another reset stage that will remove all conditions – setting everything back to ‘normal’. 
  3. Everything in step #2 will happen in parallel with the load test so the end result will effectively look like this: 
    • 0 minutes: Load test starts -> Everything is OK. 
    • 15 minutes: Suddenly the application starts experiencing high latency when contacting third party services. 
    • 35 minutes: High latency resolves itself. Everything is back to normal. 
    • 60 minutes: Load test ends, results are gathered and final judgement is rendered. 

Pros & Cons

As you can see from the examples above, we can cover a wide variety of scenarios by just creating new pipelines and making small tweaks, adding/removing stages. We are reusing all the same infrastructure, while at the same time keeping complexity and costs relatively under control. 

These are some of the highlights we’ve gotten from switching to this setup: 

  • It allows us to use a single environment for every scenario. Since by default Toxiproxy acts as a fully transparent proxy, we can have it present at all times – even when we don’t want any adverse networking conditions in effect. 
  • Along the same lines, having it ‘available’ at all times means that creating a new scenario is just a matter of adding a new API call to our load testing pipeline so that it sets whatever networking conditions we want. Everything else stays the same. 
  • We can use it to proxy anything that uses standard TCP connections. So far we have done it with regular HTTP calls, MySQL databases, Redis, Google’s Firebase,AWS’ DynamoDB and SNS/SQS.
  • It can work very nicely with Helm and Kubernetes (with some caveats). 

On the flip side, however, not everything has been sunshine and rainbows..

  • Handling HTTP traffic is super easy, but trying to handle HTTPS traffic is a nightmare we are not ready to face yet. 
  • We are pretty sure this wasn’t written to handle traffic above the 1 million requests/minute mark, as trying to run that through a single instance (following the official recommendations) is just not feasible. It is easy to deploy this into a cluster with multiple pods and it works great, but then you stumble into that… 
  • It is not designed to work in a distributed environment and the pods don’t share any configurations or settings. We’re working around this issue with great success but took some jury rigging to get it to work.
    • If you run 5 pods and want to apply a network condition, you have to make sure you apply it to all 5 pods since there’s no state sharing between them. Otherwise you can end up in a scenario in which 1 pod has the expected network conditions, while the other 4 are still in the “everything is good” stage. 
    • In the same vein as the above, if out of the 5 pods one gets restarted in the middle of the run for whatever reason, it will go back up with the default “all is good” config meaning that you would have 4 pods with the desired settings and 1 just letting everything through. In practice, however, we have never seen this happening (yet). 

Closing thoughts

Over the last year or so, we’ve been seriously working on improving our load testing coverage and frameworks, which eventually led us into chaos testing territory, as we wanted to test beyond the happy path. 

As we went further down that rabbit hole, some of our code started to balloon out of control (duplication, lots of things to keep updated, restrictive limits on what errors we could trigger). Toxiproxy allowed us to remove a lot of those issues and re-use our infrastructure in a much smarter way.

Yeah sure, it had its issues and we had to patch things here and there, or come up with solutions to work around its own limitations (clustering being the biggest one so far), but nothing that some scripting couldn’t fix, and the benefits have by far outweighed the limitations. 

What started out as a proof of concept in a single team (using my own team’s load testing suite as guinea pig), has now become part of the standard practices almost every team in our department follows when load/stress/chaos testing their services. 

What comes next?

The world!… I mean… we’ll probably expand this to other teams in other areas of Delivery Hero. 

Then the world. 



Thank you, Nicolas, for sharing your learnings with us!

As always, we are still hiring, so check out some of our latest openings, or join our Talent Community to stay up to date with what’s going on at Delivery Hero and receive customized job alerts!