Integrating Machine Learning Systems

Issues arise from putting your machine learning model into a real life system. To actually make use of your model, you need a way to interact with it, which creates complications you can’t address simply by testing the model in isolation.

Machine Learning is all about data. Each model in production will have its own specific ETL process – extracting and transforming the relevant data, then loading it in to get the right answers. On the other hand, how do we make use of the results depends on the purpose of the system.

Most of the time when we talk about the training phase and the using phase, we’re technically talking about offline models. But sometimes, you want the learning to keep going, and we call this online learning or continuous learning. In this case, the operational data gets fed into your model and answers pop out. The answer could be validated and be fed back into the model, and the learning algorithm updates the model accordingly. Integrating continuous learning systems into production requires special care, especially around monitoring the performance and parameters of the model.

The process for building, deploying, and continuously improving a machine learning application is more complex than a traditional software solution that doesn’t include a learning component. Having a clear integration plan is crucial:

  • Data people build pipelines to make their data accessible.
  • ML scientists are building and improving the ML model.
  • ML developers take care of integrating that model and releasing it to production.

Impacts from Users

Machine learning is all about the quality of data and often production data come straight from users. This can lead to adversarial data. We call it adversarial when it’s designed specifically to confuse our models. We can use this to our advantage and training, but once our model is out there, we want to watch out for it.

Users will reliably behave in unpredictable ways. It often takes a human to conduct regular sanity checks and deal with exceptional situations. You might as well plan for it. When your model isn’t confident in its predictions, or when a user is getting frustrated, decide how you’ll deal with the hand-off. Have a human backup ready to look at the user queries or inputs and respond accordingly.

Talk to people who have implemented similar systems to find out what they found once there models start interacting with real humans. Try to anticipate adversarial behavior and have a plan for how to respond when the unexpected occurs. Communicating your limitations can be at least as important as aiming for perfection for your long-term success.

Time and Space Complexity of the Operational Environment

Nowadays machine learning systems are deployed not only in personal computers, but also on custom hardware: smart mobile devices, robots of various sizes, smart cars, and home electronic appliances. This difference in resource bandwidth between development and production environments is a major challenge we need to address before deploying any machine learning model to the real world. The main goal here is to make sure that your production machine learning models have reasonable performance.

  1. tradeoff between complex models (like neural networks) and simple models (like linear regression)
  2. write more efficient code, use less variables and data structures
  3. get rid of useless features early on
  4. select between interpreted language (Python) or compiled language (C++)

The size of the model created by the machine learning algorithm can be tweaked as well. One major approach to this is called model compression, which aims at taking large machine learning models and converting them into something more efficient, without hindering the performance of the original.

One approach to model compression is known as the Student-Teacher approach.

TeacherA large and complex neural network is trained as usual.
You aim to find the model that gives the best performance that you need
StudentA smaller and simpler structure is enforced, the student trains solely on the output of the teacher.

Surprisingly enough, under this approach you can end up with similar model accuracy without the intensive resource requirements demanded by the Teacher model.

Retrain and Update Models

We usually deploy a machine learning model to the production environment when we’re comfortable with its performance, hoping that operational data should be from a similar distribution as the data we trained with. But in reality, real world, it’s complicated.

The data that we interact with, as being randomly selected, are according to some probability distribution that usually we don’t know. And often we see that the production data distribution can drift over time from the training data distribution. That means we have to retrain the model on updated data, if we want it to stay relevant.

The most direct way to measure this drift is by measuring the performance of the model in production and compare it with the performance during training. If it’s getting significantly worse, it’s time to retrain. But in reality, there’s a few complications, most frequently the delay in getting a true performance measure. It can take a while to collect the true answers in order to precisely measure change in performance.

For some problems,you might just want to have a regular schedule for retraining your model. This can be a judgment based on your understanding of the problem domain. The extreme version of this is using online learning, online learning approaches tightly integrate the building and use of the model. Although they’ll still have a period of training and testing before deployment, they are designed to respond to each new data point and adjust the model at each point in time.

Retraining the model may simply mean training on the new data, with the same configuration model and hyperparameters. When you know there’s significant shifts, you might want to go through the whole model selection and hyperparameter tuning stage as well.

Version Control

Machine learning by its nature involves a lot of experimentation around the choice of algorithm, data sets, parameters metrics and on and on.

Version control tools help us to monitor experimentation. We don’t yet have consistent standards for saving machine learning models, so it is hard to manage experiments in machine learning through standard versioning tools. How is versioning in machine learning different from software engineering?

Software engineeringtracking different versions of the source code
Machine learningtracking a lot of other artifacts like big data files, trained model files, labels, coding, model parameters, etc.

In machine learning, source code versioning is not the end of your worries. You also have to be concerned about versioning large data files. Having a version of machine learning system enables your team to iterate quickly without having to manually maintain logs or worry about the reproducibility of your experiments. It also helps provide some transparency about the different data sets that were used to train the model and things like feedback loops privacy concerns and many more.

Knowledge Transfer and Reporting Performance

Communication is especially important at different stages with different individuals within the organization. Reports are a great way to keep everyone informed about the ongoing status of a project, and setting a baseline of knowledge within your organization.

A well-structured report should feature the following:

  1. One-paragraph summary of the project that summaries:
    • key stakeholders
    • project goal
    • methodology
  2. One-page executive summary of the goings-on in the project for this reporting period:
    • Successes and failures (challenges)
    • Upcoming month
    • interesting items might beyond project scope
  3. Leave room for details, and plans and shifts to your methodology
    • Graphs or summaries of data
  4. Have a section documenting challenges in the report
    • Set expectation as your project progresses

There are three major categories of people in the organization that you need to transfer knowledge and report:

  1. Your peers on the data team
    • A healthy data team will communicate regularly throughout the course of the project
    • Review code, share report or dashboard.
    • Present findings, highlight lesson learned.
    • Get agreement across the team about the results is the first step towards widespread adoption
  2. Non-technical stakeholders
    • the people who have major influence over the success of your project
    • don’t assume this audience understands or cares about the details.
    • show the value of the solution, think in terms of three different aspects:
      • cost-savings
      • efficiencies
      • development of new products
  3. The management team
    • Show how well their decision to invest in your machine learning project has helped the organization.

Be ready to take the opportunity to talk about your machine learning solution at any point over the course of the project. Cater your communication to the needs of your audience.

Machine Learning Process Lifecycle (Recap)

The ML Process Lifecycle is a framework that captures the iterative process of developing a machine learning solution for a specific problem, from problem formulation to handing over the project to the client.

ML solution development is an exploratory and experimental process where different learning algorithms and methods are tried before arriving at a satisfactory solution. As you advanced to different stages of the process and uncover more information, you may need to go back to previous stages of the MLPL to make changes or start over completely.

Remember, the process is split into four stages, the phases are iterative, but you can’t skip ahead:

1. business understanding and problem discovery
[ Objectives, problem definition, stakeholders communication,
data sources, resources and constraints, development environment,
existing practices, milestones ]
Next: 2
2. data acquisition and understanding
[ Data acquisition, cleaning, processing,
pipeline, EDA, feature engineering]
Next: 1, or 3
3. machine learning modeling and evaluation
[ Feature engineering, model training,
evaluation, selection, reporting]
Next: 1, 2, or 4
4. delivery and acceptance
[ ML solution, documentation, knowledge transfer, handoff]
Next: 1, 2, or 3

Lots of things cause a lifecycle reset and not all of them are possible to anticipate. Monitoring the nature of lifecycle switches can give you a measure of how your business is growing in the machine learning adoption stages. Understanding these stages along with the iterative and sometimes unpredictable nature is key to setting the right expectations among the stakeholders.

Post Deployment Challenges

Simply deploying a model in a production environment requires a lot of decisions about tools:

  • development stack
  • integration environment
  • model management tools
  • logging systems
  • tools to monitor the health of live model

When the machine learning models have been deployed, it is just the start of a whole other process, where model maintenance and monitoring are just as important as the development and deployment. Any changes to the data, or the question answered, will affect the model, usually to the detriment of performance, in the long run rendering them invalid.

All of this creates a series of unique considerations when you’re deploying machine learned systems:

  1. Machine learning systems are not straightforward software systems
    • In a traditional software development process the results are mostly deterministic.
    • machine learning is all about letting the learning algorithm find a path between input and output, but there are lots of potential paths, you can’t guarantee exactly how the question will be answered.
  2. Who maintains machine learning model?
  3. How to integrate software development stack and machine learning stack?
Software development stackMachine learning stack
User InterfaceUser Interface
ServerData processing
Data communicationData extraction
Raw DataRaw Data

Data keep changing. Monitoring is the key, so that as much as possible you can be alerted to changes in the data that’s being fed into your model. But even with the best alerts in the world, more subtle changes means your model can get stale:

  • Data drift is when there’s subtle change in the distribution of the data over time, and in extreme cases it can lead to operational data coming from an area of the feature space that you’re model never saw in the training data.
  • Concept drift, the concept associated with the data that has evolved.

Yet another unavoidable source of change is the changes in your organization, some of these changes affect the priorities of the organization, shifts in priorities will affect decisions that are data driven.

Logging and Monitoring Your Models

When you’re building your model, you’re necessarily using only a small portion of all the possible relevant data. But when those models are deployed, they enter a world that’s chaotic. Often the data that’s used to build the model is different from the real-world. You have to check the model performance in the real life production environment. One of the ways to track performance is by logging and monitoring.

Logging is about storing data for later analysis, it means keeping a record of the data seen and generated by your model, storing the inputs and outputs. This helps in debugging. It also helps you identify the source of changes. Logging it the most granular level may also help you understand and interpret why some specific decisions were made.

Monitoring is setting up alerts. Monitoring helps for knowing how well the model is performing. It can identify issues related to adversarial or problematic interactions or physical limitations. It should reflect changes made to the system at:

  1. the strategic level – business decisions, or at
  2. the data level – changes in the skewness of the data

there are three main types of metrics that you would want to track between model training and model used in real-world:

Model metricsAccuracy of model
Distribution of the Input to model
Distribution of the output from model
User assistant metricsHow users / systems are responding to the predictions from model
Business metricsLikely to care about how much money it’s generating

You can also think of metrics in terms of short, medium, and long-term. Remember it’s never a good idea to optimize for a single metric or to emphasize short-term over long-term metrics. Try to optimize for a set of metrics.

Model Testing

Model testing outline is a set of procedures to confirm the model behaves as expected to report its performance. In order to test which models are best on live data, we’re going to look at two different approaches: A/B Testing and Multi-armed Bandit Testing.

In A/B Testing:

  1. choose the best model given your learning data: model_a
  2. have another candidate: model_b
  3. define multiple metrics to compare the two models
  4. do a proportional (say 80% – 20%) split on the incoming operational data
  5. deploy both models such that model_a gets 80% of data, and model_b gets 20%
  6. check the performance of both models, using the metrics

You could do sequential A/B Testing for multiple models, with the winner of each comparison running against the next in line. You could also concurrently compare more than two models, splitting the operational data among all of them.

Similar to A/B testing, there’s an approach called A/A testing, dealing with only one model. In this case, you divide the operational data into two-halves and compare the performance of your model, but with the input segregated. This can help you infer if your model is consistent across multiple data sets. Having continuous A/A testing is a good way to catch problems with incoming data or in the data collection process.

The second approach to model testing is Multi-armed Bandit Testing, which is particularly useful when we want the selection of models to be adaptive. In the model testing contexts, the arms are our different models, and we are selecting the best model. How frequently do we explore versus exploit? We can estimate which is best, but the confidence in the accuracy of that estimate changes.

Both methods are common in production environments. Choose the type of testing that suits the nature of your data and your tolerance for running multiple models.

My Certificate

For more on Post-Deployment of Machine Learning Models, please refer to the wonderful course here

Related Quick Recap

I am Kesler Zhu, thank you for visiting my website. Check out more course reviews at

Don't forget to sign up newsletter, don't miss any chance to learn.

Or share what you've learned with friends!

Leave a Reply

Your email address will not be published. Required fields are marked *