How to detect drift and resolve issues in you Machine Learning models?
Summary
TLDRThis detailed presentation dives into the critical concept of data drift, also known as covariate shift, exploring its impact on machine learning models post-deployment. The speaker begins with a theoretical overview, followed by a discussion on the implications of data drift and covariate shift on model performance. Emphasizing practical applications, the talk introduces algorithms for detecting variations in data and concludes with a hands-on Jupiter notebook demonstration. This demonstration, accessible on GitHub, showcases real dataset analysis to identify and understand drift occurrences, providing invaluable insights into maintaining model accuracy and reliability in production environments.
Takeaways
- 📊 Detecting data drift, also known as covariate shift, is crucial for understanding model failures in production environments.
- 📝 The webinar covers a theoretical introduction, followed by practical demonstrations on identifying and addressing data drift.
- 🤖 Algorithms for detecting variation in data are discussed, showcasing methods to identify when and where drift occurs.
- 🛠️ A practical deep dive using a Jupyter notebook illustrates the process of detecting data drift in a real dataset.
- 💳 The importance of monitoring models, especially in scenarios like predicting mortgage defaults, highlights the necessity of understanding data drift.
- 📈 The concept of covariate shift is defined as changes in the joint distribution of model inputs, which can significantly impact model performance.
- ⚡ Universal and multivariable detection methods are explored for identifying drift, with strengths and weaknesses discussed for each approach.
- 🔍 The webinar emphasizes ML model monitoring, outlining steps from performance monitoring to issue resolution to protect business impact.
- 📱 Practical tips on using univariable and multivariable detection techniques provide insights into effectively identifying and analyzing data drift.
- 📚 Recommendations for further learning and engagement, such as accessing the webinar recording and exploring open-source libraries, are provided.
Q & A
What is data drift and how does it affect model performance?
-Data drift, also known as covariate shift, occurs when the distribution of model input data changes significantly after the model has been deployed. It can negatively impact the model's performance by making its predictions less accurate.
How can machine learning practitioners detect data drift?
-Practitioners can detect data drift by using specific algorithms designed to monitor and analyze changes in data distribution. These algorithms can assess whether there's a significant shift in the features' distribution or the relationship between features.
What is the importance of monitoring machine learning models in production?
-Monitoring is crucial for maintaining the business impact of the model, reducing its risk, and ensuring its predictions remain reliable over time. It helps identify when the model's performance degrades due to data drift or other factors.
Can you give an example of a proxy target used by banks to predict mortgage defaults?
-Banks might use a proxy target such as whether a person is delayed by six months or more in mortgage repayments after two years from the loan's start as an indicator of defaulting, rather than waiting the entire mortgage duration.
Why is it challenging to directly evaluate the quality of predictions for certain models?
-For models predicting outcomes over long periods, like mortgage defaults, direct evaluation is impractical because it requires waiting years for the actual outcomes. Hence, proxy targets and performance estimation techniques become necessary.
What are the two main reasons machine learning models fail after deployment?
-The two main reasons are data drift and changes in the relationship between model inputs. Both can lead to significant drops in performance if the model cannot adapt to these changes.
What is univariable drift detection and its limitations?
-Univariable drift detection involves assessing changes in the distribution of individual features. Its limitations include the inability to detect changes in correlations between features, which can result in high false-positive rates for alerts.
How does the Johnson-Lindenstrauss method help in drift detection?
-The Johnson-Lindenstrauss method is recommended for drift detection because it is robust against outliers and good at detecting significant shifts in data, though it can be sensitive to small drifts.
Why is multivariable drift detection important and what are its challenges?
-Multivariable drift detection is crucial for capturing changes in the relationship between features or the entire data set, which univariable detection misses. However, it requires at least two features to work and can be less interpretable.
What practical steps were followed in the Jupiter notebook tutorial for detecting data drift?
-The tutorial involved training a simple model, simulating its deployment, and then using specific algorithms to analyze data drift in production data, highlighting how to identify and address issues affecting model performance.
Outlines
🔍 Introduction to Data Drift Detection in Machine Learning Models
This segment introduces the concept of data drift and its impact on machine learning models post-deployment. The speaker outlines the webinar structure, starting with theoretical aspects, moving to the definition and risks associated with data drift and covariance shift, and concluding with practical detection methods. The importance of understanding data drift is emphasized with a real-world example of predicting mortgage defaults in banking. The speaker explains the challenges in evaluating model performance over time due to delayed outcomes, underlining the necessity of estimating model performance indirectly. The initial steps towards ml model monitoring and the detection of covariate shift, a common reason for model failure, are discussed.
📊 Deep Dive into Root Cause Analysis and Univariate Drift Detection
This paragraph transitions to the importance of root cause analysis following the identification of performance issues in deployed models. The focus is on understanding data drift, its causes, and its effects on model performance. The discussion then shifts to univariate drift detection methods, highlighting their ability to capture distribution changes in individual features while acknowledging their limitations, such as the inability to detect changes in feature correlations and the high rate of false positives. Various methods available for univariate drift detection are briefly mentioned, with a promise of more detailed exploration in future webinars.
🤖 Exploring Multivariate Drift Detection Techniques
The narrative progresses to multivariate drift detection, addressing its ability to capture linear relationships and distribution changes among multiple features. The limitations of multivariate methods are acknowledged, including their reliance on at least two features and reduced interpretability. The procedure for applying multivariate detection using PCA (Principal Component Analysis) is outlined, emphasizing the significance of comparing reference and analysis data sets to identify structural changes in the data. The summary highlights how multivariate detection complements univariate methods by providing a broader analysis of data drift.
📝 Practical Application and Analysis Using a Jupyter Notebook
The speaker provides a hands-on demonstration of detecting data drift using a Jupyter notebook, walking through the process of training a simple machine learning model and identifying drift in a simulated production data set. The tutorial covers the initial model training, the preparation of data for drift detection, and the utilization of Anomalib, an open-source library, for both univariate and multivariate drift analysis. This practical approach illustrates how to identify significant drifts and their impact on model performance, leading to actionable insights for model improvement.
🚀 Concluding Remarks and Q&A Session
The webinar concludes with a Q&A session, where the speaker addresses questions about the robustness of the Jensen-Shannon distance metric to outliers, the potential for using deep autoencoders for more nuanced drift detection, and the application of univariate drift tests to time series data. The speaker emphasizes the flexibility and future roadmap of Anomalib for incorporating advanced features like variational autoencoders. The audience is encouraged to contribute to the open-source project and reminded of upcoming webinars for further learning.
Mindmap
Keywords
💡Data Drift
💡Covariate Shift
💡ML Monitoring
💡Root Cause Analysis
💡Univariable Detection
💡Multivariable Detection
💡Jensen-Shannon Distance
💡PCA (Principal Component Analysis)
💡Reconstruction Error
💡Model Degradation
Highlights
Introduction to detecting data drift, covariate shift, and their impacts on model performance after deployment.
Explaining the importance of understanding mortgage defaults prediction for banks using proxy targets.
The challenge of evaluating model predictions quality due to the long wait for actual outcomes.
Overview of ML monitoring flow: maintaining business impact, reducing model risk, and increasing model visibility.
Defining covariate shift as a change in the joint model input distribution and its impact on model performance.
Introduction to univariable drift detection algorithms and their utility in pinpointing data drift.
Discussing the Jensen-Shannon distance method for detecting significant shifts in data.
Practical demonstration of drift detection using a Jupyter notebook and a real dataset.
The significance of handling year and month data correctly to avoid model performance issues.
The importance of preprocessing time series data for drift detection to get meaningful results.
Recommendations for using PCA in multivariable drift detection for robust and stable results.
The potential future inclusion of deep autoencoders in multivariable drift detection for capturing nonlinear relationships.
The role of domain knowledge in interpreting multivariable drift detection results.
The importance of monitoring and understanding data drift to ensure the sustained performance of ML models in production.
Invitation to contribute to the open source library for ML monitoring and drift detection.
Transcripts
uh today we're going to talk about how
to detect data address R also known as
covarianship and how we can really use
that knowledge to try to figure out what
has gone wrong with our models after it
has been deployed to production and I'm
gonna have a theoretical part first and
then I'm gonna go over what is data
drift and risk of our chip and how it
can impact performance and then we'll go
over the algorithms that you can use
within an email to detect a variation
and then we'll finish with a practical
Deep dive in a Jupiter node that I
prepare that you also can access on
GitHub uh just going through a reality
data set when we will see that there's
some drift and we'll try to spot how it
happened and where it is
uh so let's get started with uh
desetting the stage
just to give you a kind of a perspective
why it's important so imagine that
you're trying to predict uh Bank uh so
mortgage people to work at a bank you
try to predict mortgage defaults uh so
in order to develop models that can
predict whether somebody is going to
default on a loan or not you're going to
take their credit scores uh their
customer information and uh hopefully
you build them also actually predicts
loan defaults reasonably well and to
give you a bit of information about how
it works in practice normally you would
want to wait entire duration of the
market should say 20 years or 30 years
uh to get your targets but that is not
really practical and for a vast majority
of
uh of customers you can already know
whether they're going to be called or
not and in reality most banks use some
kind of proxy targets for that as an
example we can say that if a person is
delayed six months or more in the
repayment of the mortgage after two
years uh
of the loan start long beginning then we
can say that this person is practically
defaulting so that is can be the Target
and here of course it means that we
still need to wait two years
um after we've deployed our boss and
made the prediction uh to really see
whether this prediction was correct or
not which means we cannot easily
evaluate the quality of the predictions
and that means that we need to somehow
try to
um estimate that performance and I
already made a I already gave a webinar
last week about that uh so if you wanna
learn more then
um ping me uh only in after the webinar
and I'll let you share the link with the
uh with the recording and now we're
going to assume that we actually have
access to targets and somehow something
has gone wrong and we need to figure out
what went wrong and we'll learn the
steps so the first step is going over
the ml monitoring flow so something that
already kind of covered which is what do
we what should we do first watch the
second watch you prefer it how do we go
from starting monitoring to resolving
any issues and making sure that we can
actually protect the business impact of
our models then I'm gonna Define
coverage which is one of the two main
reasons why machine learning models can
fail and the easiest one to spot and
then vast majority of the webinar is
going to be about actually delving
deeper into the algorithms we have at
our disposal to try to figure out where
is data how strong it is and whether
it's potentially linked with the drop in
performance and for that we have
Universal Protection when we look at one
feature at a time and we have the
multivariable detection and we'll try to
look at a group of pictures or even
entire data set at once three try to
figure out whether there is some
significant drift in our data
so let's get started with the first part
the monitoring flow and I'll just
quickly go over that so we have the
three goals of monitoring first we want
to maintain the business impact of the
model this is kind of an obvious thing
where you develop your machine learning
modes they serve a purpose and hopefully
Drive business impact and if the model
is deteriorating production and uh vast
majority of them do deteriorate in
production or the degrade in production
we need to do things we need to know
what's going on and then we need to take
actions to maintain the business impact
of that model second thing is reducing
the risk of the model because the
predictions are uncertain and there is a
certain risk that every machine learning
model basically imparts on the
organization and as long as the model is
predicting reasonably well or predicting
inline expectations this risk is known
however in the modern rate this risk can
really balloon out of proportion so what
we want to do with monitoring is to
really know and quantify the risk of the
model and the last thing for you as a
data scientist or for you as a data
science you want to increase the
visibility of the model to either
basically gain recognition and make sure
that your work is well rewarded by
either hopefully getting promotions or
getting braces or are getting higher
budget allocations working
now for the process we started
performance monitoring and that's
something that I covered in their
previous webinar when we want to make
sure that we know the performance of our
models at all times whether ground proof
is available or not then we go into the
root cause analysis so seemingly
something has gone wrong and only if
something has gone wrong you don't need
to figure out what has gone wrong so
then we can go into the result issue
resolution and try to resolve the
problem and today we're focusing mostly
on the second part which is the ripples
analysis and to actually do root cause
analysis we'll have to look at data trip
and what exactly change in our data and
hopefully you can also find out the
actual causes of the drop-in performance
so that is read the flow and now we can
start with the second part which is the
coverage sheet
so let's start with the defining
a work by a chip ifs and we can quickly
Define a coverage sheet as the change in
the joint model input distribution so
imagine that you have multiple features
uh comprising of your model inputs and
if this joint distribution changes in
any significant way and we can talk
about covariation to give you a simple
example here imagine that we have just
one feature and if that distribution
that sample uh from population so we
have some kind of sampling function
that's something function from
population to our sample changes we have
coverage and that means that not only
that model input distribution will
change but potentially also the target
distribution is going to change because
we might move from a region when let's
say there is more positive plus
instances to a region when there is more
negative class instances like in the
example here we see that we have in
total more negative instant negative
class instances compared to uh the
before the shift and that also might
potentially impact the performance of
the model if we move from regions when
the model is supposed to perform well
because maybe it's very easy to separate
classes to region when the model is not
going to perform so well maybe because
it is hard to separate the classes or
maybe because it did not have enough
data to really learn the correct pattern
and of course the same applies to
regression problems but it's a bit
easier to
talk about binary problems and binary
classification problems and to visualize
binary classification problems so let's
stick with that
uh now uh of course as I mentioned we're
talking about a joint probability
distribution uh so just to give you an
example why this joint part is important
uh there are kind of coverage sheets
when if you look at every single feature
separately you will not only see a
difference so imagine it here due to
some kind of error on the data
preparation or data engineering side or
maybe somewhere in your data pipelines
Upstream feature one and feature 2 gets
switched sometime before between week 10
and week six and you look at those
distributions and if you look at the
distribution separately it's gonna look
basically exactly the same and if you
look here the distribution discussion
and the distribution of 10 and over 16
are
basically the same
but almost to a rounding error and the
same follows here but of course our
distribution is completely different and
the model is going to make terrible
mistakes because it assumes that feature
one is feature two and vice versa so
this is something that we absolutely
need to
be able to detect and it's something
that can actually
let us identify what is the potential
root cause of the performance store
so now we know uh what are potential
problems uh with machine learning models
in production that they deteriorate and
one of the main causes is coverage shift
and how it looks like what it is now
let's talk about how to detect it so
let's assume that we have a model there
is a problem we need to figure out what
exactly wants work so the First Avenue
that we can go with is the evaluative
detection
and what does it do so first of all it's
going to capture the change in
distribution of a single feature
um and because it's going to look at the
distribution of a single feature within
that email it's going to automatically
kind of loop over all features in
paralyzed way and you will see all
features separating and how the
distributions of those features is
changing but there's of course a few
things that you cannot do and the most
important one is that it cannot detect
changes in correlations between features
or any really change in relationship
between teachers whether the
correlations are maybe something a bit
less leader and because imagine that you
have a model with 70 features or 100
features which is pretty standard in
modular deployed to production if you
look at every single feature separately
it's quite likely that some of them will
shift even though it doesn't impact
performance or it's not the root cause
of the problems that we might have
identified with our performance
monitoring and that means that it can
suffer from high positive rates if you
have 100 features and you want to do
your monitoring daily you'll almost
certainly get a lot of false positives
of things that are meaningful changing
but they are not really that uh real
problem so you will just drown in alerts
and this is really one of the main
reasons why Universal detection cannot
be used as a kind of quality assurance
store tool and you need to do
performance monitoring
and instead you should use univariable
protections to really drill down on
where the problem exactly might be after
you've identify uh issues using
performance monitoring
so now we have in our open source
Library six different methods for
Universal detection and at some point I
will make um
another webinar probably going over all
of those in-depth And discussing the
pros and cons but that is really going
to take at least one hour so this time
I'm gonna Focus just on one that we
recommend as a default and we have as a
default and in our library and this is
the Johnson Channel lists and why we
recommend those is first of all uh it is
a method that's quite good at detecting
most of the significant shifts so if you
have a significant shift in your data
um you know feature uh it's going to be
able to detect those most uh like most
of the time almost all the time it is
also quite robust against outliers and
like some other methods and one huge few
huge outliers or a few anomalies in the
data will not necessarily trigger that
so that already reduces the false
positive rate or false alert rate
and of course no matter it's perfect
that's why we have six of them just this
one and one of the main issues with
Jensen Channel distance is that it is
potentially a bit too sensitive to small
drifts so you're still getting uh that
pulse alerts here and there but that's
luck as you use it only as or mainly as
a root cause analysis tool or
um
kind of big on what's going on too and
you should be fine
and now I'm gonna go over this method a
bit more in depth talking about the
inputs the results and the kind of
intuition of how this method actually
works and what it do what it does so
let's start with practical things so
what are the things that we need to have
so the first thing we need to have is we
need to get the features that we
actually want to uh analyze that we want
to try to see if there is any drift
um so we just provide column names for
which uh we want to do our analysis and
and then in our infrared calculator
which is kind of a high level interface
for any of the univariable protection
features uh we can specify and reaction
you need to specify
um
what methods do we want to use for drift
detection you can use more than one as a
list and the same we need to do for
categorical methods because some methods
work on for continuous data some methods
work only for categorical data one of
the main strengths of Genesis channel is
that it actually works reasonably well
for both categorical and continuous
features so we can use it for both
and now uh let's move to how this thing
actually works uh so here I have a kind
of a presentation that should build a
bit of intuition about what's going on
so we have uh here probability
distribution functions uh the reference
one is let's say your test data for
which you know everything was fine and
the distribution looked like it should
look so we have a buyer by
model distribution here we've comprised
of kind of two normal distributions uh
one centered around zero and another
Center at around five and this is our
reference distribution now let's say
that we wait few weeks few months and we
detect that there is some issue
performance so then I'm gonna want to
compare our reference distribution to
the let's say a week or a day for which
we know that there is a problem and for
that we'll look at our analysis
distribution for a given feature and we
see that it has significantly changed
and there is one thing exactly at zero
and there is another Peak and this time
at 10 so maybe again some problem
Upstream in our data engineering
pipelines when something was maybe not
scaled or scaled inversely or just
multiplied by two for some reason and we
see that there is significant change and
what uh JS actually provides us is
and average uh distance between uh the
uh
distributions for this distribution so
we see here that the analysis
distribution is completely flat and
reference distribution is quite high so
it's going to take this difference and
Pulp it and then it's going to do the
same thing here between analysis and
reference and the referencing analysis
so in essence it's actually quite
similar to uh okay other Divergence uh
but the difference here is actually
doesn't only look from the perspective
of difference between reference and
Analysis but also from the difference
between analysis and reference uh so
it's kind of like a KL Divergence but
from uh both sides
and that is basically what our JS
distance tells us
uh so now uh we kind of have a at least
cursive intuition of how the method
works now let's look at the results uh
so what we get here is our distance
metric so just the channel distance goes
between 0 and 1 and we see here that it
is quite low so now that four this
specific example uh everything is doing
fine we have our threshold if this
threshold is exceeded we know that there
is a significant
um increase in our drift and this
threshold is generally automatically
done by an email automatically Define
binary ml
um
using the variance of this Johnson
Channel distance or any other distance
in our reference data sets or let's say
our test data set and if there are some
alerts uh the new course we will see
also the alerts
uh and that's really kind of an overview
of what you can do with universal trick
detection methods and an example of one
and how it works now let's go to
multivariable detection
so here
let's start again at what does it
actually do so the first thing uh it
does is it captures any linear change in
relationship between future so this is
kind of the answer to the problems of
Universal Protection features that
couldn't actually check and detect any
kind of uh change in the correlations
between features uh the idea of
multivariable protection is it actually
can capture that so it can capture the
change in my second example that you saw
with distribution 1 and distribution to
switching places
uh it also captures changes in single
feature distributions so it kind of does
the same thing a single simulator
protection models um
but on a more Global level uh and of
course there's two
um there's some drawbacks that and here
I want to mention too which is one it
requires at least two features to work
because of the way the algorithm works
that I'll explain in a minute we will be
doing
um dimensional reduction at some point
and to read the images dimensional
Dimensions to reduce uh which is not
gonna be a problem for almost any data
set because you always work with
multiple features in machine learning
apart from maybe some kind of Time
series analysis
and there's a lot of issues that because
we're looking at multiple features at
once uh it's not as easily interpretable
but there's good news is that we can
select a subset of features on our data
to kind of try to narrow down where the
change in data structure has actually
happened so we still get a bit of
interpretability there and at the end we
can come back to Universal detection to
see whether it be
changing data structure is due to
changing single feature distribution or
correlations between features
so again let's go over the input so the
first thing here
um is and the only thing fortunate thing
is we need to provide feature columns uh
for our data reconstruction calculators
and we'll just provide the data for our
features we could also put in our model
predictions
but that would mean that we're not
really looking at only the covariate
sheet but the entire distribution of
model inputs and uh predictions at the
same time which is kind of a different
thing
so we recommend that you take uh that
you put only your features for
multivariable detection
and the results again quite similar to
before when we have our reconstruction
error which is a measure of drift and I
will go into how we can obtain it and
walk you through step by step of the
algorithm that we use
um and then again you see confidence
bands which is basically how confident
that our drift metric is within certain
range these are the ranges and we have
the threshold and if the impressions are
again automatically turned based on the
behavior of the metric on the reference
data set and if your
if there is drift in your data you will
see that these thresholds will be
exceeded in time and then we have alerts
the small red diamonds that show that
something has gone wrong
uh now let's start building the
intuition of what's going on in this
algorithm
um it looks a bit complex but actually
the premise is quite easy
so the main idea is that we're going to
train or learn a compression compression
part so let's say that we're gonna train
something like an auto encoder
and um this Auto encoder is going to
learn the actual structure of the data
in order to minimize the Reconstruction
loss so the loss between the original
data and the reconstructed data and the
key thing here is that we're going to
rely on this compressor to correctly
learn the structure of the data and then
we are going to measure the loss of this
uh let's say autoencoder
um to measure how strong is the
structure of the data has changed so
this is really the high level intuition
so imagine here we have the compression
part of compressing part of the auto
encoder the compress it to latent data
right in space and we decompressed and
then what we're going to do is we're
going to compare the original data with
reconstruct the data like I do here and
for every place where they don't really
align we're gonna be able to compute the
Reconstruction error so now that you
have the general idea that we're gonna
rely on this Auto encoder to learn the
structure of data and then we will
measure how good this Auto encoder is on
new data to see whether there is
significant change in the data structure
let's go into the actual step-by-step
algorithm
so how do we train it so first we'll
have to prepare the data I'm gonna walk
you through uh with the how we train the
algorithm in an email and we actually
don't use Auto encoders we use the
simplest possible thing which is the PCA
because we don't need to capture uh the
structured data fully we just need to
capture it well enough that if there is
change in the data the loss of the
encoder decoder changes so it doesn't
need to be low to start with so I'm
going to prepare the data to make sure
it can work with PCA we're going to
impute missing values we're going to
encode categorical features and at the
end we're going to scale the data then
we are going to train a fit a PCA on our
reference data so it could be the test
set it could be any part of your data
any period in the data for which you
know everything is fine and there is no
significant data drift and performance
is satisfactory
uh so that what we're going to do is
we're going to compress and decompress
so we're going you transform an inverse
transform of this reference data using
our trained PCA and the strain PCA again
was trained on reference data then we're
going to compute the distance between
the original and the Reconstruction
point so the points that go through the
compression decompression using PCA and
we're just gonna do the simplest thing
we're going to compute uh you could be a
distance between every uh pair of points
and then we're going to compute
so-called reconstruction error which
means that we just take all the
distances the other you can be on
distances between those points and we
take the average of that and for that we
know what is the general loss or the
general reconstruction error of our PCA
on reference data it's not going to be
zero because it's losing we still need
to reduce dimensionality we're losing
Dimensions which means we are losing
information most likely
so we have some kind of reference
reconstruction error
then when we uh go into the drift
detection mode uh what we're going to do
is we're going to look at our analysis
data so the data for which we want to
figure out whether there is drift and
we're going to compress and decompress
this analysis data using the trained PCA
and it's important to think here is that
we do not retrain the PCA we use the PCA
that we fitted on our reference data and
this PCA still remembers and of course
in quotes uh what is the structure of
our reference data and then we're going
to the same thing we're going to compute
the distances between original and
reconstruction points and we're gonna
compute the Reconstruction error again
this will be average between those
distances and then I'm going to compute
I convert this to construction error
with the Reconstruction error we got
from the reference data and in a moment
I'll show how it actually looks in a
notebook
but the idea is that if this
reconstruction error goes up it means
that the model has captured the
structural data that is no longer
suitable for compression and
decompression so the structure of the
data has significantly changed in a way
that the compression and decompression
mechanism no longer works correctly and
again I'm kind of on the other hand if
the
um higher construction are actually
significantly goes down then it means
that the structured data again has
changed in a way that the compression
the compression mechanism is better
suited than it was on our reference data
so that again means that we have
significant data or coverage
now hello to the tutorial so
just to walk you through again if you go
to that email
and if you go examples here in
repository and then if you click on
webinars
uh you will see uh the webinar we have
right now today how to detective and
resolve issues
um in your ml models and then just click
on group detection notebook
and I already have it open here and you
can easily follow you can also run it
yourself
so let's get started I'll give you like
a minute or two for people who actually
want to follow live
okay
let's give this two minute short let's
get going
uh so first I'm going to walk you
through kind of the entire process
starting from training the model just to
train a simple model that actually runs
on data and actually predicts something
and then we're gonna simulate uh the
deployment one will have a production
data set that we have no access to not
during training not during evaluation
not during testing and it just comes
after and we'll see what happens with it
we'll be able to see that there is some
issues to Performance with it and then
it will go into the main part of the
tutorial which is how to actually detect
data drift so let's start with importing
an email and just few things to load the
data and things like pandas we're going
to load the data
it is a data set that's available on
openml organization it's a very nice
non-profit organization that has a lot
of very cool real-life data sets so very
highly recommended that you give it a go
uh then we are loading the data and what
we see here is that we have
um
you um
features about when certain things
happen and what we're actually trying to
figure out is how many people
sign up for bike sharing application or
actually shared a bike uh during giving
um
season day I think it's day oh it's hour
during given hour and uh we're gonna
just quickly pre-process the data here
uh so first thing we're gonna Define our
Target which is the count and then we'll
have to drop some features first of all
we have to drop the casual and
registered features because they
actually have a very bad memory leak
that they just add up to our account so
it wouldn't really make sense to build a
model if these two features actually
just give us the answer
and uh because I'm really lazy I'm just
gonna also drop everything that is not a
number already which is weather and
season
and I additionally don't drop the camera
because that is just the target so let's
do that then we're gonna split the data
and we're going to split the data in our
training testing and our production
because we're not going to optimize on
the hyper parameters there is no need to
really do anything with the validation
set so I'm just going to split the data
and I'm gonna use a Time series split
here so we're gonna start with things
that happened earlier and then we're
gonna test on things that happen later
and then the last part of the data set
is gonna be reserved for production
and now we just simply train the model
of default parameters with training LGB
and regressor the most standard thing
you can potentially come up with annual
training that on train data we're gonna
make predictions and we're gonna predict
on training testing and production
um
some kind of simulating that we actually
have a separate set that we wouldn't see
at all and then we need to prepare the
data for an animal this is kind of where
the nanimal part starts uh so what we're
gonna do first is we need to Define our
reference data set and again as I
already mentioned a few times perfect
data set for that is the test data set
so that is exactly what we're gonna do
we're gonna take that test predictions
test features and test the ground truth
uh and we Define it as our reference
status and then to analyze we can
analyze just specific part of the data
when something is going wrong but why
not analyze everything to have a bit
Fuller view since the data set is not
that big so here for the analysis
um
we're gonna take our
um production data and the features and
the predictions for the time being
assuming we don't have very easy access
to ground Truth uh but right after we
when we fit our performance calculator
let's just assume we do have access to
that we could also estimate our
performance without access to ground
truth but just to show you that there
are really issues with uh performance
some that we're just estimating and has
a hallucinog hello hallucinating issues
there's actual issues and we try to
compute our performance using MSC or Mae
we will actually see that there is
issues with our model
so we're going to
add targets to our analysis and then
we're going to just perform and
calculate the performance taking the
performance calculator and calculating
the performance obviously on analysis
with targets
so let me just run it and
we all see here that
during testing everything was fine so we
felt quite uh confident that we deploy
the model
uh even though it kind of climbed up at
the end but probably when uh we train
the model we just look at the total uh
or average performance in our test data
set we don't try to segment in time or
do other things like that so the problem
looked fine
uh but once we deployed we'll see that
there's huge spikes both MSC and Mae so
whatever method we use we see that
there's some issues with it
now what might have happened uh let's
start with the universal detection so
um again I will uh make sure just here
that our predictions are true that's
categorical to make sure that we get the
right kind of plot
and then our column that we want to look
at I'm just going to take all the model
features and that they call and that the
model has looked at
um just to see what is going on
step-by-step feature by Future
and then as I already mentioned here
we're gonna take our column names as
column names and we're gonna compute our
recommended metric uh or method which is
Junction Channel distance
um
and we'll do that we will use our
address calculator to fit on reference
and the only thing that I actually think
is that it's taking the reference data
and for some methods it's going to bend
them so you don't have to store the
entire thing uh when you compare it with
your analysis data so if you want to
just put your calculator let's say you
have a few terabytes or 200 terabytes of
data you can still fit the model
or your calculator and then you'll be
able to store just the histograms of the
data which makes storage possible easy
and then you can easily compare it with
your analysis data which normally will
be much smaller
so I will fit it and then we'll
calculate uh the results for them then
shuttle distance on our analysis data
and we'll display the results turning
that into Data frame so one way of
viewing results is to manually go over
them
um as a data frame you can take the data
frame and you can do whatever you want
with it that you will be able to do just
as a typical Timeless data frame so what
you see here is that we have our indices
we're going to split our data by index
for our analysis data and our production
data
oh sorry for our reference data and our
analysis data so in other words test and
production and and then we're gonna have
a kind of funny multi-index I think one
for every feature we have a potential
list of methods because we can select
more than just one method as you can see
here and we will see what is the
threshold Pro which we consider that
something has gone wrong and what is the
value of an actual metric and what is
potentially lower threshold as some
methods might have a lower threshold as
well and whether we see that as
something problematic so whether there
is alert is true so you could just take
it and plug it for example in your
retraining um
pipeline automatically automatically you
train if you see that specific things
are drifting too much and you want to
retrain and see whether automatically
training actually fix the model so there
is some potential for uh nice automation
here
uh but the other view of other way of
viewing the results is just with plots
and the way we do it is uh we'll take
the results we will just say Dot Plot
and the way the thing we actually want
to you is the drift so we're gonna say
that the kind here is drift and we'll
just show it uh just like we will show
any normal clip or plotted figure
and again exactly what we said before
and there's a few weird things going on
that hopefully help us build
understanding of why performance should
increase so first of all the air is
always drifting in every single track of
our date that year is drifting which
actually makes sense because it's always
multiple monotonically increasing so
it's never going to be the same as the
distribution of our entire test set
which already probably gives us some
idea of uh the mistakes we might have
made or have made during development
sorry during training and preparation of
this algorithm of this model
uh then we see the same thing happened
happens for a month is that it always
drifts but in that case it might make
sense because we're looking at kind of a
univariable distribution and distance
data lasts more than a year but for
every period in our data every Chunk we
probably had only one or two months so
it's drifting it's fine it's drifting
it's just cyclical data but it's good to
seeing that this is actually happening
um then we have our which is equally
distributed there's no changes there uh
we have what are those holidays or not
and we see that there are
um probably again months when we see a
lot of holidays uh and it might actually
mean that there is something wrong with
hormones when there's a lot of holidays
uh but we're not sure yet
uh for weekday we see a very weird thing
going on is that this data for testing
was probably in some way uh ordered but
for production this data was completely
not ordered and it's completely random
and it looks exactly as the data for
the entire uh test set
but again none of those actually exceeds
our drift so we're sure that the
distribution of our weekday has not
changed significantly but it's something
to look into when you develop your model
uh and if you see something like that
then probably should think about whether
this is actually something that you're
happy with
and then for working day no changes it
all looks very good it all stays below
the threshold
for the temperature we see that there
are significant changes and especially
there is a big jump here so again we
have cyclical data and we already
decided to build the intuition that we
probably built our models uh to take
into account Cycles in an incorrect way
and there's something wrong going on
when we have
um different parts of the of that the
year-long cycles
the temperature the same thing humidity
kind of similar thing
uh wind speed similar thing so there is
something going on with Cycles we
already managed to figure out that we're
probably not really think through how
we're gonna treat temporal data when it
will develop that model
so temporal data then let's continue our
investigation and instead of just doing
normal multivariable detection let's do
multivariable detection of all features
to get a general view of what is that
um drift for the entire data set but
also let's drop two things that we
already know are drifting and have a
recent drift and there's probably issues
with those so year and a month
so I'm going to run that as well and
then we're gonna initialize our data
reconstruction with calculators our
multivariate protection calculator uh
just putting all column names for the
timing so not those but the ones we find
earlier
then we're going to put it on reference
so we're going to train our
um PCA model on reference that's going
to learn how the data looks like and
then we are going to calculate so we're
going to run it through
um the uh trained PCA on our analysis
data
and what we see there is huge increase
in the error so definitely there is
drift and this drift probably actually
is one of the uh causes for dropping
performance but now let's drop uh our
year and months so let's do the same
thing let's fit it again without a year
and the month
and what happens you see that there is
actually not really so that what does it
tell us is that all important drift and
all important changes in the model all
coverage sheet in the model is really
captured in the month and the year
columns and we shouldn't have a Trader
model and especially we should not have
trained a great in boosting model on a
year because maybe there is a trend in
our application maybe more and more
people start sharing bikes using our
application which is something that you
would hope for as a business
um but we did not take that into account
and our gradient boosting model just
assumed that uh the higher the year the
more people on average are going to
actually
uh sure the bikes but it just took the
cut on the last data it shows and it
failed to extrapolate the strength
further because it's not something that
um three base models can easily do and
that drop-in performance is due to us
taking our year and month into a
consideration here so this is something
that we will have to think about now as
data scientists and try to uh kind of
redevelop the model in a way that
doesn't take a year and month into
account in such a simplistic way
and that's it we know what happened we
know why it went wrong and then we can
start working on
resolution which in that case would be
full modular Redevelopment because we're
training won't help here we know that
the model would fail later on when we
see the next year
um and that is it that is the end of the
tutorial
so now thanks for noticing uh we are
slowly nearing the end of the webinar uh
uh we are still kind of you know
starting open source Library so I would
very much appreciate if you give an
email a try and give us a star every
Star matters
um they give a bit more visibility to
our library which helps more people
learn about nanimal which we see as a
good thing because monitoring is
important
uh so that's that and now before we go
to q a uh just one more thing we have
yet another webinar on the next Thursday
and it's time it's going to be my
co-founder William uh talking about what
data do you actually need to monitor
remorse in production so it's going to
be even more Hands-On and hopefully it's
gonna help you uh not only understand uh
what's going on with monitoring but
easily get started with it and in your
job
and now uh let's get the Q a I'm gonna
leave it up here and you can just scan
uh this QR code to be forwarded to a
form and we'll send you the recording of
This webinar later on
thank you so much for a day for this
awesome presentation it was very
informational
uh we have a few questions here and
um just a reminder that you can drop
your questions in the Q a and we will
answer all of them live
so the first question for you vertek is
uh what does it mean that the GS
distance is robust to outliers
um is it that it doesn't it does not
detect outliers as drifts I think
exactly that that is exactly what it
means if you have a few outliers in your
data and this is something that is not
important to you from business
perspective or from data science
perspective uh General Shannon is going
to be robusting uh to those and will not
actually flag those as great on the
other hand if it's something that you
know is going to import be important for
your data and even a few outliers can
very strongly impact them in their
business outcomes or your own metrics
you should use something else such as
massage time distance or it's also
called f moving distance which is very
sensitive to outliers and that is one of
the reasons we have multiple methods in
our library because there is no method
that really fits all the use cases and
we also have in our docs
um quick summary of the strengths and
weaknesses of our of our methods so then
you can take a look there and pick the
one that fits your skin test
awesome thanks a lot uh I will
um after the the Q a I will also share
the link to the doc so that you can have
access to this documentation that what
tech just mentioned
um the second question is in
multivariate Risk detection why not
using deep Auto encoder because
reconstruction error instead of PCI
reconstruct ah Jesus reconstruction to
detect also non-linear relationship
changes as well
so this is something that is on our
roadmap and it's something that we'll do
sooner or later and the reasons we uh
the reason we didn't go with it for the
time being are twofold first uh PC is
privately robust works out of the box
and it always has very stable Behavior
which is not true for variational
controllers it will go with normal
encoders there's no chance we're gonna
get anything that is anyway meaningful
because even very small change in data
distribution will result in huge
reconstruction error for variation
outcome encoders they are significantly
more stable so it is possible to do it
but we need our algorithms to work on
any kind of tabular data no matter what
is the size of the data no matter how
it's distributed and to do that we need
to have quite Advanced automl hidden
inside an EML and it's much easier to
make a PCA work with any kind of data
distribution and any kind of tabular
data and then variational controllers so
we will have them because like you said
they are much better at detecting
non-linear relationship changes
um we don't have now because we need to
start somewhere and we started at the
something that's robust and we know it's
going to work reasonably well for
everything uh and in the future we will
have variational encoders most like as
part of our library and of course if
you're interested in that you are most
welcome to contribute and uh feel free
to join our slack field trip to
contribute on GitHub and if you want to
go ahead and try to implement those that
would be great
awesome cool answer thanks a lot
um another question was well not really
a question but a comment saying very
well explain how to use the PCA so
thanks a lot Eileen I hope I'm
pronouncing yes I'm hoping I'm
pronouncing your name correctly but
thanks a lot for that comment we really
appreciate the feedback
um we have uh
another question that is a repetition so
what does it mean for the GS distance to
be robust to a players and then another
question
um is
um can you use the univariates drift
tests for a Time series
uh yes if you did try and recycle them
first so this is something again that is
potentially on our roadmap which is
explicit support for time series if your
time series is already stationary
there's no prints and the Cycles are
removed then you can use that kind of
data that's prepared in that way that
you don't have Cycles don't have friends
and you can put it them directly in
Universal detection and you will get
reasonable results as you saw here in
the tutorial if we don't do that and we
leave the trend in or we leave the
cyclone in we'll get results
so you have to first do a bit of data
pre-processing uh to do it but it's
absolutely possible
nice
everything is possible huh
JK JK
uh okay we have uh one final question
for you vitec um how can you use
multivariate drift detection to get
interpretable results
uh yeah so what I recommend is first
just run everything uh then you might
want to run it on uh sorry you might
want to run in variety of detection see
which features tend to behave weirdly
and then you can do the thing that I did
in the tutorial which is just exclude
those and only run it on things that
don't change from the universe
perspective and if you're multivariate
data detection also doesn't detect
anything there then you can be quite
sure that the reason for your drop-in
performance is not there because there's
no changes in the relationship between
features and there is no change in the
actual distribution of the features so
then you are just left with the behaving
features that you see on the infrared
level and what you should do then is use
domain knowledge to try to bundle the
ones where we know that correlations
also matter maybe the correlation
between the age income is something that
is actually very strongly and
influencing the model but the
correlation between age and location
doesn't matter that much in that case
um select only few features two three
four features that you know that the
correlation between that action better
for the model
and you can do that using things like
shop explainable AI when you can look at
combinations of feature and their
importance and then look at um
the subset of teachers that are either
important or the interactions between
them are important and run multivariate
there and basically you should be able
to narrow down uh whether Universe
actually happens and just the change in
distribution that influences performance
or whether you find a pair of features
that say that Don't Drift by themselves
but if you put them in a multivariator
detection you will see Thrift then you
know that there's actual change in
correlation between those features and
that's how you get interpretable results
that you would miss with in variety of
protection
awesome very complete answer thanks a
lot
um great then
um I don't think we have any more
questions to address today
so poetic thanks a lot for your
presentation I really appreciate it
um thanks a lot everyone that joined us
today uh as vitek said uh we are an open
source Library so don't forget to start
us on GitHub if you find this uh webinar
useful uh it really means a lot to have
one more star there and
um if you just a reminder that uh if you
do want this recording please scan this
QR code leave us your email and we will
send it by the end of this week and if
you'd like to uh join another webinar
you will have a chance
Browse More Related Video
Quantifying the Impact of Data Drift on Machine Learning Model Performance | Webinar
AWS re:Invent 2020: Detect machine learning (ML) model drift in production
Top 6 ML Engineer Interview Questions (with Snapchat MLE)
Model Monitoring with Sagemaker
Machine Learning Tutorial Python - 15: Naive Bayes Classifier Algorithm Part 2
Online Machine Learning | Online Learning | Online Vs Offline Machine Learning
5.0 / 5 (0 votes)