DevOps for Machine Learning – What does this mean, and why do you need it?

By Yann Landrin-Schweitzer

At DeliveryHero, we build for scale. With an objective of 10M orders a day, in more than 30 countries worldwide, being able to scale our operations, in a sustainable and cost-effective way, is the primary driver of our engineering practices. DevOps is one of the methods and behaviors which allow us to achieve scalability.

This is the first instalment of a series of blog post looking in detail at the relationship between data, AI and DevOps.

What is DevOps?

“DevOps is the combination of cultural philosophies, practices, and tools that increases an organization’s ability to deliver applications and services at high velocity: evolving and improving products at a faster pace than organizations using traditional software development and infrastructure management processes.” (AWS: What is DevOps?)

In short, DevOps is a practice (ideally, not a team or a role!). Software engineers who practice DevOps not only code, but also are responsible to maintain and support their applications.

This practice emerged to enable high-velocity, high-quality and highly predictable delivery of software applications.

DevOps principles

DevOps principles assume that the majority of costs, failures and inefficiencies in the design, delivery and operation process of software are due to its human component.

For this reason, it tries to minimize direct human intervention through the automation and optimization of all possible deployment, testing, monitoring, configuration and repair tasks.

By doing this, it also incorporates a number of more targeted practices that focus on automation, such as Continuous Integration and Deployment, Test Automation, Infrastructure as Code or Service Reliability Engineering.

Where is Data in DevOps?

“Typical” DevOps perspective often assumes that the two components of interest are software and infrastructure. It asserts its scope on infrastructure provisioning, scaling, failure management, application and infrastructure monitoring, as well as the deployment and orchestration of the software applications that run on these.

In that context, data is often regarded as an ephemeral characteristic of the application. It is assumed to evolve in a synchronized manner, through the same set of processes, and to require no special treatment. Specific focus on data as a distinct entity with distinct management is relegated to very “waterfall-ish” traditional database administration approaches.

So what changed?

Applications rely on and manipulate increasingly larger and more frequently changing datasets. These datasets evolve with markedly different rhythms and drivers from application logic, and have business impacts that vastly survives their production, transformation and use in a single application version.

In that context, the same velocity and quality limitations that existed for software start becoming an issue for data, with amplified consequences due to the cost of acquiring valuable data.

Models of different datasets get out of sync, releases cannot be completely tested or rolled back because their data has moved on, the complexity of testing and validation for all data cases make frequent iterations impossible. In the end, the gains of applying a DevOps model to the software part are negated by the effort to keep data valid and correct in such a rapidly changing environment.

So teams are tempted to switch back to a more planned, waterfall-ish model, where ensuring data health is easier. But surely, there is a better way?

Can DevOps principles apply to data?

Extending DevOps principles to being applicable to data brings in a major increase in complexity. It requires investment and focus in a number of “new” topics that generally had received no attention up to that point: data generation, data testing, data versioning and data automation. “Old” topics also need renewed focus, and a much higher standard of accuracy and quality: the ability to test in production-replica environments, the accurate specification and documentation of business processes, the tracking and lineage of data flows in the organization, the discoverability and visibility of business logic in and to all parts of the organization.

In all these areas, most organizations that have not focused on data agility before have low maturity, therefore making “DevOps for data” a large and scary investment.

What about DevOps and Machine Learning?

Machine Learning, and more generally Artificial Intelligence, are increasingly used in applications to provide “smart” features (i.e. very adaptive to the context and the needs of the user) in software applications. Simply put, it processes historical data to detect patterns, generalize them into models, and apply them to new data to predict future situations.

AI and ML techniques therefore contain many “gray” or “black” box components, these patterns that have been learned from data. Adding them to the system landscape adds several extra dimensions of complexity to the application lifecycle. Not only it is even more critical than before to know what data has been used in each of the many different stages of learning, but it is also more difficult to control exactly what logic is running in an application.

In addition to configuration, code, and now data, models also need to be taken into account. And models generated from data and training algorithms are particularly complex to manage. They are:

  • costly to generate (requiring massive data movement and processing), therefore changing and re-training models needs to be done deliberately and in a controlled fashion;
  • usually at least partially opaque (difficult to understand by “reading” them), therefore difficult to test in an isolated manner;
  • and critically dependent for meaningfulness on both source data and feature generation code, therefore much more complex to reproduce exactly.This challenges the original DevOps assumption that delivery is essentially limited by its human component.

In other words, the AI component itself becomes a major source of cost and unpredictability.

In our next post, we will talk about some options to address these challenges.