How Contract Testing Helped Our Organization to Prevent Production Incidents

28.11.24 by Alex Xandra Albertsim, Ramesh Kumar, Shrinivas Khawase

How Contract Testing Helped Our Organization to Prevent Production Incidents

9 min read

This article details our journey towards how we successfully lowered production incidents by 14% and enabled our backend developers to proactively test API contracts during the development phase, ensuring greater stability and reliability in our services.

Testing integrations of two microservices and their API contracts for backend applications in our department at Delivery Hero was challenging due to the complicated infrastructure setup. The testing of API integrations between microservices required synchronous communication which was not possible due to several reasons which we will visit in the blog later.

Contract testing is a methodology for ensuring that two separate systems (such as two microservices, a client and a server, etc) are compatible with each other. It aims at documenting the expectations of two parties in an integration, automating these expectations as test cases, and then establishing the test runs as a quality gate for the release process.

How did we get here?

We were experiencing production incidents due to broken API contracts between our microservices. This resulted in order loss and degraded customer experience.

On the root cause analysis for some of our incidents, we found cases where there were no tests written to test API contracts between microservices. When we interviewed developers we realized that the current testing environment was not suitable for running such tests.

Developers could not use the staging environment to execute their tests as it hosts numerous backend services. It is prone to issues like downtime from different microservices, configuration mismatches, and data corruption which result in complicating the testing process.

Teams were making changes to their own API specifications but there was no communication to the teams of the microservices who were consuming this changed response output.

Our primary motivation for seeking a solution was to minimize reliance on the staging environment and enable the testing of microservice integrations independently. This was crucial in a large organization such as ours, where teams are working in silos, at different timezone and changes to API contracts are unavoidable in some cases.

We knew we couldn’t stop contract changes as they were necessary for business needs. We also knew that spinning up a new testing environment wasn’t an option and this was due to the cost implications and the complexity of managing and maintaining environments. At that point, we started to look for alternatives.

Before we could find a solution, we had to be clinical and go through the incidents in the past and write down the exact use cases.

When we started calculations we detected that out of 100 production incidents, almost 5% of the incidents that occurred in 2023 and 2024 were due to contract violations. In layman’s terms, the company lost approximately 60000 customer orders due to these violations. We observed that we have not been able to detect contract violations before they go to production.

We observed that these failures were due to contract changes between microservices. We decided to explore ways where we could test microservices independently of each other without relying on a need for the other service to be available while testing the integration between two microservices.

Exploration

We began exploring ready-to-use solutions available in the market. Our major considerations during the tool exploration were:

Learning curve to understand the tool
Easy to set up Infrastructure
Effortless CI/CD Integration
Security and SSO Integration
Cost Efficiency

The most popular choice for contract testing came up to be Pact.io. Pact is a code-first contract testing tool. It offers the necessary libraries for writing contract testing in various languages we leverage. It seems to check all the boxes we’d want in our choice for contract testing, being code-first, being flexible in terms of types of integrations, and being reliable by introducing standard tools for contract creation, stubbing, mocking, and verification.

Pact also offers a Pact Broker which acts as a centralized storage for contracts. It facilitates version control, persisting verification status across various versions of the software, and allows simple integration with CI/CD. The broker can be deployed on our own infrastructure. A cloud-based managed service called Pact Flow is also offered and has some extra features on top of what Pact Broker offers. e.g. smarter and user-friendly visualization, fine-grained access control, etc.

We explored Pact Broker and hosted the Pact Broker ourselves. However, it involved maintenance efforts, infrastructure hosting costs, and a team to maintain the infrastructure setup.

That’s when we decided to use PactFlow which is the enterprise version of Pact.io since it also provides bi-directional contract testing capabilities (which cross-check all Consumer-driven contracts against Open API Specification documentation from providers) and supports our enterprise SSO setup.

Our Experience

We conducted a PoC with Pactflow. We have listed our experience below.

Easy to set up Infrastructure: We requested Pactflow to provide us with a test account so that we can use that account to conduct our PoC. Pactflow gave us a managed centralized Pact broker providing a single repository for storing and retrieving contracts between consumer and provider services. This helped a lot as we could focus directly on writing pact tests without worrying about managing the infrastructure.
Effortless CI/CD Integration: We were able to seamlessly integrate Pactflow with CI/CD pipelines, allowing automated contract testing as part of the build and deployment process. All that we had to do was store the Pactflow secret token in our pipeline and use Pact SDKs to publish the Pact files.
Versioning and History: We were able to publish the pact contracts with different versions. Pactflow supports contract versioning, enabling teams to manage changes over time. It also kept a test matrix or history of pact interactions which can be viewed in Pactflow’s User Interface.
Support for Multiple Languages: We wrote tests in Golang, Java, and Node.js. The integration of pact libraries was smooth and we did not face any challenges.
Bi-Directional Tests: We also wrote a bidirectional test and it worked well with the Pact PoC tests that were written in non-DH repo.
Report & Result: Results are captured well in Pactflow and it shows all the interaction reports in detail. The graphical representation of the result is also quite good.
Security and Access Control: Pactflow has SAML SSO support which reduces the burden of keeping the solution security compliant.

Solution

Consumer-Driven Contract Testing

Consumer-Driven Contract Testing (CDCT) in Pact is a testing methodology where the consumer of an API defines the expected behavior (contract) from the provider, ensuring that the provider adheres to these expectations. It helps in verifying that both sides (consumer and provider) agree on the structure and behavior of the API, reducing integration issues in distributed systems.

Key Concepts

Consumer: A consumer is any service or application that consumes data or functionality provided by another service (provider).
Provider: A provider is any service or application that provides data or functionality to be consumed by another service (consumer).
Pact: A Pact is a contract that defines the expectations of interactions between a consumer and a provider. It specifies the requests that the consumer will make and the responses that the provider should return.
Pactflow: Pactflow is a platform that facilitates the creation, management, and sharing of Pacts between consumers and providers. It’s a managed enterprise edition SaaS application that hosts the pact broker which stores the contracts.

Deep Dive

We will understand the CI pipeline we implemented at our organization.

1. Simple Approach to running contract tests on CI pipeline

We began with a straightforward approach where the consumer creates and publishes contract files on every commit to PR. After the contracts are published, the consumer’s pipeline triggers the verification process at the provider’s side and waits for the results to be published to Pactflow.

Once the verification results are available, the pipeline executes the “Can I Deploy” command to determine if the consumer’s feature branch is compatible with the provider’s main branch. If they are found compatible the consumer merges its code into its main branch. However, if compatibility fails, it is concluded that the contract test has failed, requiring developers to debug and resolve the issue on the consumer’s side.

On the provider side, the process begins by pulling the consumer’s contract published from its main branch. Provider tests are then executed against these contracts on its feature branch, and the verification results are published to Pactflow. Finally, the provider pipeline runs the “Can I Deploy” command to check if the provider’s feature branch is compatible with the consumer’s main branch. If they are found compatible the provider merges its code into its main branch.

However, if compatibility fails, it is concluded that the contract test has failed, requiring developers to debug and resolve the issue on the provider’s side.

This approach comes with its own drawbacks. In this pipeline consumer triggers the verification every time a new code is committed to PR on the consumer’s side. This requires additional time even if the contract has not changed from the last commit. This consumes CI resources and also results in developers waiting every time for the provider to verify the same contracts. We will see how we solved this problem in the next section.

2. Optimised Approach to run contract tests on CI pipeline

In this approach, we follow a similar step with one additional check. The consumer publishes its contract to Pactflow’s broker. Upon publishing, Pactflow generates an event called “contract_requiring_verification_published”, which we monitor for. If this event is present, the provider’s pipeline is triggered to verify the newly published consumer contracts from the feature branch. Conversely, if the event is absent but the contract is successfully published, the verification step is skipped, and the process proceeds directly to the “Can I Deploy” step. The process after this remains the same as the previous approach.

The steps on the provider’s side also remain the same as the previous approach.

This approach allows us to save significant amounts of time for our developers as we trigger verification only when the contract has changed. We also save our CI resources and the cost associated with running the verification every time.

Results

We were able to prevent most of the contract violations that were attempted from the consumer or provider’s side where contract tests were written for those API endpoints.

One of the examples that we want to mention is the estimated delivery time that we show to our customers. The provider application had implemented a new way to provide the estimated delivery time to our clients. Assuming that clients are already using the new fields to read the estimated time, the provider application created a pull request to its main branch to remove the old fields. As clients had a contract test written for this use case, when the provider tried to remove this field it received an error message on the CI pipeline.

Before ( Missing Estimated Delivery Time)

After (Correct Estimated Delivery Time)

[Example of what contract test prevented]

For e.g. In the image below, you can see that in July, we were able to detect that a total of 5.2% of feature builds on CI were trying to update the existing contract, but the builds failed because the contract tests did not pass. We also saw that 3.57% of the builds from the main branch were prevented from deploying to production because they were trying to update the contract which wasn’t compatible with the consumer of the provider.

In October, contract tests successfully stopped 14.29% of the contract violations from the main branch, being deployed to production. Therefore adding contract test execution only on pull requests is not enough, contract tests must be also executed before each deployment. They are lightweight and hardly take up the bandwidth of your CI pipeline. This means if we had 100 incidents this year, 14.29 incidents would have been related to contract violations if there were no contract tests in place for them.

[*Percentage of CI builds failed on main branch and feature branch because of contract violations*]

For e.g. In the image below, you can see month-on-month contract violations prevented because of the contract test.

[*Number of times we prevented contract violations*]

What’s next for us?

This lightweight approach to testing will benefit the organization in the long run and reduce the friction and chaos that is caused by dependent testing environments. We plan to extend this to our mobile clients and improve contract test adoption in the future.

Conclusion

In conclusion, we strongly believe that customer satisfaction and experience are paramount for us. When customers don’t see a data point that is important for them to make a decision while placing the order, this affects your order placement experience significantly.

Contract testing has enabled us to ship faster with confidence and prevent production incidents. We plan to continue our investment and advocate for the importance of contract tests.

If you like what you’ve read and you’re someone who wants to work on open, interesting projects in a caring environment, check out our full list of open roles here – from Backend to Frontend and everything in between. We’d love to have you on board for an amazing journey ahead.