# Index

Trusting is Good Not Trusting is Better
Monitoring the Behavior
Monitoring the Usage
Conclusions

# Trusting is Good Not Trusting is Better

Application monitoring is a key part of running software in production.

Without it, the only way of finding an issue is through sheer luck or because a client has reported it.

Both of which are less than ideal, to say the least!

You wouldn't deploy an application without monitoring, so why do it for Machine Learning models?

So let's Start!

The first resource you should go through is the following:

How to Monitor Machine Learning Models (opens new window)

It's a very detailed and comprehensive blog post and it addresses these topics:

What makes ML systems monitoring hard?
How can we monitor the usage and behavior of the model?
What are the key metrics to track and alert on?
What are the key principles for monitoring the ML system?

Once you are done with the previous blog post, you can refer to the related chapter of the book Machine Learning Engineering (opens new window) for a more detailed guide (strongly recommended buy, but you can read it for free online).

Chapter 8 - Model Serving and Monitoring (opens new window)

# Monitoring the Behavior

When monitoring the behavior of the ML model in production, you should consider many aspects:

Setting a baseline: A good idea is to have a baseline model before we start monitoring or measuring. Of course, if we are starting monitoring for the first time then that is our baseline. After establishing the baseline model, you can keep that static and make all comparisons and references with regards to this baseline, allowing you to ask the question: "How has the system been behaving since [important milestone/change]?"

See:
- What does “baseline” mean in the context of machine learning? (opens new window)
- How To Get Baseline Results And Why They Matter (opens new window)
Data in:
- Monitor whether the data you're processing looks like the data you trained on. (data drift challenge (opens new window)) E.g., use simple (comparatively interpretable) distributional models to try to track whether data looks "sufficiently" similar.
See:
- Calculate Similarity — the most relevant Metrics in a Nutshell (opens new window)
- How to measure similarity or dissimilarity between two data set? (opens new window)
Addressing data drift over time:
- Is your data distribution non-stationary? I.e. are you expecting your model to degrade due to the data changing over time?
- If so you can do anomaly detection on the stream and track the fractions of anomalous data points over time.
- You can also find out if the distribution of the live-data or evaluation data matches that of the training set (or even the held-out test/evaluation set), for example with the Kolmogorov-Smirnov (opens new window) test.
See:
Runtime Performance: When running the inference part of your models, you should consider the specific requirements of the application at hand. Some of them could require faster inference, others could serve better if the accuracy is high, so maybe you can average the predictions of several models (ensemble methods), sacrificing the speed of the computation.

Consider both runtime and model-specific performances:
- Platform performance
  - Hardware specific
  - Environment-specific (OS or software installation, or configuration, Cloud provider)
- Model-specific performance
  - Input data specific
  - Model algorithm-type specific (model built with Scikitlearn versus Pytorch v/s TF)
See:
Data out:
- Distribution of predictions by the label. If you see this shifting a lot, another flag that inference data is shifting, which could be a point of concern.
- Distribution of predictions by confidence/probability. There is a pretty rich (& ongoing) field of research here, but, as a starter, I'd also expect the distribution of raw probabilities (logits) to look similar.
- As a corollary, looking at overall change in the confidence of outputs. E.g., a drop in confidence values is possibly a concerning sign. (Of course, vanilla NNs are also prone to be highly overconfident on new/OOD data, so you need to be cautious; if this is really critical, could look at different NN variants that generate better confidence assessments.)
- Consider whether training a model w/ out-of-domain detector (OOD) makes sense or not.
- It's also a sign that when we see such changes that our last test/evaluation dataset integrity is failing. You should update the test/evaluation datasets from production data and retrain the model, either fully or incrementally.
See:
- How to determine the confidence of a neural network prediction? (opens new window)
Human evaluation:
- Not really going to be avoidable, but you can use a lot of the above to narrow down what people look at. Have people look at the examples which are most alarming, based on the checks you have in place. Over time, we can better calibrate what alerts are of concern.
- A related option is to use a calibrated model of confidence/estimated error rate to inform what people should look at. If you pick up the lowest 1% confidence and the internal estimated error rate is 5%, then if you explore that 1% and see > 5% error rate, that could be a concern.
See:
- Comparing Automatic and Human Evaluation of Local Explanations for Text Classification (opens new window)

# Monitoring the Usage

If monitoring the behavior of the model can be technically hard, you should also be sure that your users are leveraging the model in the correct way.

With "users" we can refer to everything that consumes the output of the ML model, it can be a system, a human, or an ensemble of systems and humans.

TIP

If you're serving your model through an API (recommended way), you can refer to the API monitoring best practices in general (not specific for Machine Learning).

See:

The major issue you can encounter when dealing with people is that they choose to not use the ML model.

This can happen for a variety of reasons, maybe they don't have confidence enough in the system, or they don't understand how to use it.

Take a look at this awesome guide from Google's engineers:

The People + AI Guidebook was written to help user experience (UX) professionals and product managers follow a human-centered approach to AI.

People + AI Guidebook (opens new window)

This detailed resources can get you started about the following topics:

User Needs + Defining Success
Data Collection + Evaluation
Mental Models
Explainability + Trust
Feedback + Control
Errors + Graceful Failure

General Tips

None of the above techniques is a silver bullet
Use only those things that work for you and are applicable in your use-cases
Don't literary follow any of the ideas, try them out and see how they work for you

# Conclusions

After walking through the resources listed here, you should be comfortable with the challenges and caveats of monitoring your Machine Learning model in production.

As you've seen, there are both technical challenges (data drift, input data check, output data check) and "human-related" challenges. In particular, Google's People + AI Guidebook will show you most of the human-related ones.

In the next section, Now Go Build, we'll give you a list of tips, best practices, and suggestions about how to put in practice everything you've learned in the Purgatorio!

← Serving Trained Models Automation and Reproducibility →