Quantifying the Impact of Data Drift on Machine Learning Model Performance | Webinar
Summary
TLDRThis deep dive presentation focuses on understanding and mitigating machine learning model failures due to covariate shift and concept drift. It outlines two main causes for model performance deterioration, with a special emphasis on covariate shift, exploring how it can both positively and negatively affect model outcomes depending on the nature and direction of the shift. The presentation introduces two algorithms, Direct Loss Estimation (DLE) for regression and Confidence-Based Performance Estimation for classification, designed to quantify the impact of covariate shift on model performance. Through practical examples and detailed explanations, it elucidates how these algorithms can predict model failures and performance changes, enabling proactive adjustments before significant business impacts occur.
Takeaways
- 🙂 The presentation covers both theoretical and practical aspects of machine learning model performance, focusing on algorithms.
- 👇 Two main causes of machine learning model failure are discussed: covariate shift and concept drift, highlighting how they can lead to significant performance drops.
- 💁🏻 Covariate shift is defined as changes in the joint model input distribution, which can have both positive and negative impacts on model performance depending on where and how the shift occurs.
- 👨💻 Concept drift refers to changes in the underlying real-world pattern that the model tries to predict, necessitating updates to the model to maintain accuracy.
- 📖 Direct Loss Estimation (DLE) and Confidence-Based Performance Estimation are introduced as two main algorithms for quantifying the impact of covariate shift on model performance for regression and classification models, respectively.
- 📈 The presentation emphasizes the importance of catching model failure early, ideally before significant business impact, by estimating model performance without needing access to target data.
- 📚 A deep dive into DLE shows how it uses model predictions, features, and known targets from a reference dataset to estimate expected model performance under covariate shift.
- 🔧 Confidence-Based Performance Estimation uses model predictions and scores to estimate the expected confusion matrix for classification models, allowing for detailed performance metrics estimation.
- 👍 Emphasizes that covariate shift is not always detrimental; under certain conditions, it can actually improve model performance if the data shifts towards areas where the model is more confident.
- 🛠 Highlights the need for models to be recalibrated for accuracy, especially in the face of covariate shifts, using techniques like calibration curves to adjust predicted probabilities.
Q & A
What are the two main causes of potential model failure in machine learning?
-The two main causes of potential model failure in machine learning are covariate shift and concept drift.
How does covariate shift impact model performance?
-Covariate shift impacts model performance by changing the joint model input distribution. This can potentially lead to a significant drop or, in some cases, an improvement in model performance, depending on the type and location of the shift.
What is the difference between covariate shift and concept drift?
-Covariate shift refers to changes in the distribution of the model inputs, while concept drift involves changes in the relationship between the inputs and the target variable, affecting the underlying pattern the model has learned.
What are DLE and CBP, and how do they relate to machine learning model performance?
-DLE (Direct Loss Estimation) and CBP (Confidence-Based Performance Estimation) are algorithms used to quantify the impact of covariate shift on model performance. DLE is used for regression models, while CBP is used for classification models, both assisting in assessing how shifts might affect model accuracy without needing new target data.
Why is it important to detect model failure or deterioration in performance before there is business impact?
-It's important to detect model failure early to minimize or ideally eliminate business impact. Early detection allows for corrective actions to be taken before the model's inaccuracies can lead to significant losses or inefficiencies.
How can covariate shift result in both positive and negative impacts on model performance?
-Covariate shift can lead to positive impacts if the data drifts to regions where the model is very confident and accurate in its predictions. Conversely, it can negatively impact performance if the shift leads to regions where the model is less certain or has not seen enough data during training.
What is model calibration, and why is it important in the context of CBP?
-Model calibration ensures that the predicted probabilities accurately reflect the true likelihood of an event's occurrence. In CBP, calibration is crucial because it uses model scores or predicted probabilities to estimate the uncertainty of each prediction and, consequently, the model's performance.
Can covariate shift detection alone reliably indicate model performance?
-No, covariate shift detection alone cannot reliably indicate model performance. While it shows changes in the input data distribution, it does not directly reflect how these changes affect the accuracy or effectiveness of the model.
How does the Direct Loss Estimation (DLE) algorithm work?
-The DLE algorithm works by calculating loss metrics (like mean squared error or mean absolute error) on reference data where targets are known. It then trains a model to estimate these losses using the features and predictions of the monitored model, allowing for performance estimation on new data without targets.
What role does concept drift play in model failure, and how is it different from the role of covariate shift?
-Concept drift plays a crucial role in model failure by altering the relationship between input features and the target variable, making the learned pattern obsolete. Unlike covariate shift, which changes the distribution of inputs, concept drift changes the underlying pattern itself, often requiring model retraining or updating.
Outlines
🔍 Introduction to Model Performance and Failure Causes
This segment introduces the focus of the presentation, which revolves around exploring both theoretical and practical aspects of machine learning algorithms, with a special emphasis on understanding model failures. The agenda outlined includes a discussion on two primary causes of potential machine learning model failure: covariate shift and concept drift, and their implications on model performance. The speaker plans to delve into how covariate shift can impact model performance in both positive and negative ways, depending on its nature and occurrence. Additionally, the introduction mentions two key algorithms designed to quantify the impact of covariate shift on model performance for regression and classification models, respectively named Direct Loss Estimation (DLE) and Confidence-Based Performance Estimation.
🎯 Detailed Look at Concept Drift
This paragraph elaborates on the concept of concept drift, defining it as a change in the relationship between the target variable and the model inputs. It highlights the challenge of quantifying concept drift without access to actual targets and emphasizes the importance of detecting changes in model performance early to avoid business impact. The speaker further illustrates concept drift with an example of a shift in class boundaries, demonstrating how such a shift can significantly degrade a model's predictive accuracy. The paragraph concludes with an emphasis on the necessity of addressing concept drift promptly to prevent or minimize negative business outcomes.
🔑 Understanding Covariate Shift and Its Implications
This section delves into the intricacies of covariate shift, explaining it as a change in the model input distribution and its potential effects on model performance. The discussion covers various scenarios of covariate shift, including changes in the sampling mechanism and alterations in the underlying distribution of data features. The narrative underscores that covariate shift does not uniformly result in performance degradation; in some cases, it may even improve performance if the shift moves data points to regions where the model has high confidence. The explanation also touches upon how unseen regions or insufficiently sampled areas in the feature space can lead to performance drops, establishing a nuanced view of covariate shift's impact on machine learning models.
📊 Direct Loss Estimation (DLE) for Regression Models
The paragraph introduces the Direct Loss Estimation (DLE) algorithm, a technique for quantifying the impact of covariate shift on the performance of regression models. It outlines the prerequisites for applying DLE, including the need for reference data where model performance is known and satisfactory. The discussion elaborates on the assumptions underlying DLE, specifically the absence of concept drift, and explains the inputs required for the algorithm to function, such as model predictions, features, and, for the reference dataset, actual targets. The segment aims to provide a foundational understanding of how DLE operates and its role in assessing and mitigating the effects of covariate shift on regression models.
🔄 Performance Estimation in Regression: How DLE Works
This part offers a deeper dive into how Direct Loss Estimation (DLE) operates, using a simple regression problem as an illustrative example. It shows how the algorithm utilizes the model's predictions and the dispersion of data points to estimate performance metrics like mean absolute error or root mean squared error. The narrative explains that the expected error varies depending on the data point's location in the input space, with regions of higher data dispersion indicating higher uncertainty and expected error. This approach allows DLE to estimate the performance of regression models under covariate shift by analyzing the distribution and characteristics of the input data relative to the model's predictions.
📉 Estimating Model Performance Decline Due to Covariate Shift
This section emphasizes that covariate shift does not always lead to a decline in model performance. It presents a case where despite significant covariate shift, as measured by PCA reconstruction error, model performance may not necessarily worsen and may sometimes improve. The speaker introduces the concept of multivariate data drift detection and discusses its limitations in accurately predicting performance outcomes. The passage argues for the use of algorithms like DLE to directly assess the impact of covariate shift on performance, highlighting the importance of distinguishing between mere data structure changes and actual performance degradation.
🌟 Confidence-Based Performance Estimation for Classification Models
This paragraph shifts focus to classification models, introducing Confidence-Based Performance Estimation (CBPE) as a method for assessing the impact of covariate shift on these types of models. The speaker outlines what can be calculated using CBPE, including confusion matrices, precision, recall, and even business impact metrics. The narrative explains the assumption of no concept drift and details the inputs needed for CBPE. Special emphasis is placed on the need for model scores or predicted probabilities, highlighting the importance of model confidence in estimating performance and addressing covariate shift in classification contexts.
📚 Calibration and Performance Estimation in Classification Models
The final segment delves into the details of performing Confidence-Based Performance Estimation (CBPE) for classification models. It explains the process of calibrating model predicted probabilities to align with the frequency definition of probability, ensuring that they accurately represent the likelihood of positive outcomes. The speaker discusses how calibrated probabilities are used to estimate confusion matrices for individual predictions and aggregate them to assess model performance over data chunks. The paragraph concludes by showcasing how despite the absence of significant covariate shift, a model's accuracy can still experience notable declines, underscoring the critical role of CBPE in monitoring and adjusting for performance shifts in classification models.
Mindmap
Keywords
💡Model Failure
💡Covariate Shift
💡Concept Drift
💡Direct Loss Estimation (DLE)
💡Performance Estimation
💡Sampling Mechanism
💡Model Confidence
💡Regression Models
💡Classification Models
💡Business Impact
Highlights
Introduction to the theoretical and practical aspects of machine learning model performance, focusing on algorithms.
Discussion of two main causes of potential machine learning model failure: covariate shift and concept drift.
Detailed examination of covariate shift, its potential impacts on model performance, and how these impacts can be both positive and negative.
Introduction of direct loss estimation (DLE) for quantifying the impact of covariate shift on regression model performance.
Introduction of confidence-based performance estimation for quantifying the impact of covariate shift on classification models.
Overview of the essential components of machine learning models, including the sampling mechanism and the true pattern in reality they aim to capture.
Explanation of the difference between covariate shift and concept drift, and their respective impacts on model performance.
The significance of capturing model failure or performance deterioration before there is a business impact.
The challenge of quantifying concept drift without access to labels and its implications for model performance.
Using model confidence as a proxy for potential model performance and the relationship between covariate shift and model certainty.
The process of direct loss estimation (DLE) for regression models, including its assumptions and inputs required.
The calibration process for predictive probabilities in classification models to ensure they match the frequency definition of probability.
How confidence-based performance estimation works to predict the impact of covariate shift on classification models without labels.
The importance of model calibration in the accuracy of performance estimation algorithms.
Demonstration that covariate shift is not always indicative of model performance deterioration.
Transcripts
model
performance so yeah as I mentioned we're
going to doing both the theoretical and
a bit of practical Deep dive here but
really mostly focusing on the algorithms
on the agenda we have four items the
first one is we're going to talk about
two main causes of potential model
failure and here by failure I mean
significant drop in machine learning
model performance then in the second
part we're going to focus on one of
those causes which is the coar shift and
we're going to talk how it can
potentially impact model performance and
we see that the story there is not going
to be that simple and the impact can be
both positive and negative depending
exactly on the type and place where the
coar shift actually happens and then
once we have that and we bu kind of an
intuition of how coar shift impacts mod
performance and what is also the role of
H uncertainty in all that we're going to
talk about our two main algorithms that
actually help us quantify or actually
quantify the impact of karat shift on
model performance and for regression we
have an algorithm called dle which
stands for direct loss estimation that
helps us quantify the impact of coar
shift on model performance you're going
to hear that centeres a lot here for
regression models and for classification
models we have confidence-based
performance intimation that does the
exact same things but for classification
models and here uh it works both for
multiclass and binary classification
models we're going to be focusing on the
binary example just for sake of
Simplicity but everything here Will
gener analyze also to multiclass
problem so now let's get started with
the first thing which is the two main
causes of machine learning model
failure just before we do that we'll do
a quick refresh of what are the actual
moving Parts uh for our machine learning
model what are the things that we ingest
what are what are the things we're
trying to achieve with supervised
machine learning so the first thing is
that there exist some kind of true
pattern in reality that we're trying to
capture in our example here we have just
one Feature Feature X and we're trying
to predict a binary outcome and we see
that as feature X increases we have this
kind of sigmo pattern that the
probability of a positive class
increases um and then if x is very high
also it's almost uh no sorry the
probability actually decreases and the
uh the probability of negative class is
almost 100% if x is very high um and
this is the pattern that our model will
try to capture based on the data we have
and the data we have is the sampling
mechanism so it's the way we sample the
data from our population we never have
access to the full population but we
have centered sample of our data for
example our current customers or all the
past uh events that we managed to
collect and that just forms our uh data
that we then use to train our
model and then imagine that we did train
the model we manag to capture correctly
the pattern that we see in the top of
the slide what can then happen
the first thing that can happen is
so-called coar shift coar shift happens
when our sampling mechanism is changed
it's not something that we necessarily
control ourselves but it might be that
we deployed our model in one country but
we uh sorry we trained the model in one
country but then we need to deploy it in
another country or just a different
segment of population or even the
segment is the same but slight
distribution of let's say age changed or
we deployed our model on one machine and
now it needs to operate on slightly
different machine
then we see that there is a difference
in something and that will also
influence how our data looks lies and
not only the data but also the actual
targets or the labels that we have and
the class balance might also change and
that may or may not impact the actual
performance of the model and I'm going
to delve deeper into that in the second
part of the presentation but now let's
focus on defining it fully so the coar
Shi shift is basically the change in the
joint model input distribution so the
probability of all x's and we see one
example of that in one dimension that
just the mean mate shift but it also
might be any other structural change in
the joint model input distribution and
it might be that even on every specific
Dimension you will see that the
distribution does not change but for
example the correlation between
different features
changes then the second reason why
machine learning mods May Fail
is so-called concept drift and here uh
what changes is the actual true pattern
that exists let's say that people used
to default on loans uh when they made
less money but this for some reason no
longer holds if that happens the pattern
that our model has learned is no longer
fully relevant for the real world that
the model is operating it so it's very
likely that uh the model will actually
fail and that's one of the reasons why
we're not focusing on it another reason
is that to really quantify concept drift
we do need to have access to labels why
is that because concept drift is simply
defined as the change in the
relationship between the Target and the
model inputs so it's the conditional
probability of the target given the
feature vector and if that happens we
really cannot quantify it before we have
access to the targets and we do need to
know the model performance and quantify
the change of model performance before
we have access to the targets and I'll
explain that in the next slide hope
anything is clear again if you have any
questions uh do put it in the Q&A and I
will answer the questions at the end of
the
presentation uh now let just visualize
the concept drift just give a bit more
of an intuition uh so here we see that
the actual true class boundary that
exists in reality was this kind of
vertical thing that goes like that
vertical vertical and then due to some
change in how the world operates maybe
something like a pandemic or some other
huge event or maybe something slower
that actually change the actual concept
gradually we see that instead of having
this class boundary that goes like that
we have now basically a horizontal class
boundary and of course the model Lear
the first uh boundary it's going to
operate really badly on the second
boundary so this is conent drift this is
not something I'm going to focus on now
we're going to be focusing on the first
reason why machine learning models can
fail which is the covarage
shift and before we go into really
discussing the intuition of how cage
shift can impact model performance we
need to talk one about one really
important thing which is we generally
want to catch model failure or catch
change or deterioration in model
performance before there is business
impact and what you see is basically on
a timeline first what we the first thing
we get is we get model inputs these
model inputs are then fed to the model
the model will make predictions and as
the predictions are made then these
predictions will be processed in some
way maybe it's going to be a credit
scoring model the predictions are credit
scores these predictions will then be
used to either deny or Grant loans for
predictive maintenance use cases you're
going to uh get model outputs whether
the model needs maintenance or not and
based on that the maintenance will be uh
performed or not so there's going to be
some kind of business impact that the
model um will basically create and what
we want to do and only once this model
business impact happens we then possibly
get the targets so in case of loans
we're going to have to wait quite long
time let's say two three years whatever
is the length of the loan to see whether
the person actually defaulted on it or
not H and then we're going to keep on
making more and more predictions in the
time and what we want to do ideally is
catch model failure before there is
business impact so before the loan is
granted or not as ideal but also quite
okay shortly after the loan is granted
so then we can stop the mold from
operating let's see for this remaining
two years and granting more and more
loans so we want to catch uh malary as
soon as possible to minimize or ideally
eliminate uh the business impact and
that also means that we never have the
targets before it's too late because if
we have the targets the all already
acted and we got some feedback from The
Real World so we have to focus on trying
to estimate the impact of coar shift on
model performance without access to
targets because basically by definition
if we have access to targets there's
already business impact and the damage
has been
done hope that's clear now let's talk
about the intuition of how we can
quantify the impact of kar shift on mod
performance without access to the Target
data so the first thing we're going to
do is we're going to look at model
predictions we have this beautiful
dragon fruit picture here which just
quantifies or shows what is the
confidence of the model depending or
when in the model input space a given
data point is located so what we did
here is we trained a model it was I
think a nonlinear svm because it gives
very nice uh smooth shapes and uh what
we see it's just a simple XR problem and
what we see is that if there is a
concentration of data points from one
class then the model tends to be very
confident about predictions that it's
very likely that their predictions there
are going to be positive or
negative so the predictions are
concentrated around either being very
predicted probabilities are concentrated
around being very close to either one or
to zero and the confidence is basically
just a simple measure of how far from
0.5 the prediction is so if the model
confidence is zero the model prediction
model predict probability is 0.5 if the
model predict probability is close to
either zero or one then the confidence
measure that we see here is going to be
very close to one basically just like an
absolute value but centered at
0.5 and what we see is that uh we can
actually look at this model confidence
as kind of the proxy for potential model
performance to build a bit more
intuition imagine that we take a Point
that's very close to the class boundary
then there's not much information to go
about to try to predict whether the data
point is going to be positive or
negative if it lies exactly on the class
boundary it's basically conto it's 50%
so there we actually expect for these
data points that are close to class
boundary you expect performance to be
very low because there's no information
to really use to try to predict the
Target and then we see that also the
model confidence is going to be very low
we see this intense pink there and then
also if we see
uh kind of potential points that would
appear let's say around minus 3 minus 3
um point in the feature space we would
also expect the model to really not know
what to do and we would expect to see
very bad performance because the model
was not trained on the data so we can
use then that model confidence as a
proxy for the expected model performance
uh but what is the role of coar shift
and all
that so the main point is that coar
shift is change in the distribution of
uh data in the model input space so what
you see here is on the test data where
we test the model and we decided that
the model performance is satisfactory uh
we had certain model distribution that's
kind of wide and covers most of the
green regions where we expect to see
high model performance but as we deploy
our mod to production for one reason or
another we see that actually most of the
data points are concentrated very close
to z00 region which is exactly in the
middle of the Red Zone where it's very
hard to predict whether a given data
point is going to be positive or
negative so then expect if we see that
kind of pattern in our cage shift in our
drift we expect to see a drop in
performance now kind of on the opposite
side of the spectrum if we see that
there is significant coar shift but the
data happens to drift to Regions where
the model is very confident of its
predictions we actually see we actually
expect to see either no significant
change in model performance or in very
extreme cases like the one you see here
we would actually expect to see an
increase in performance which means that
cavar shift is not always bad news if
the data will drift away from the class
boundary and to region or the model the
predictions are very easy to make
because they are all for example here
one
uh then we might actually see an
increase in performance and potentially
increase in business impact which means
that coverage shift not always bad thing
and then last kind of typical use case
we might see is drift to Regions that
were not seen or not properly seen
during training so there was not enough
data to really capture the correct
pattern in that region if that happens
what we expect to see again is because
the model didn't learn the pattern there
the model will probably predict bad
and again as we see here if the data
drifts to this free free free region we
will expect to see a drop in performance
the model is basically predicting random
value there H and we will expect model
failure
actually so
now as we covered the actual intuition
of how we can think about the
relationship between cavaria shift the
model certainty or model confidence and
expect performance let's try to
formalize this thinking with our first
algorithm for regression model called
direct loss estimation so we will be
able with this algorithm to take the
model predictions and somehow turn them
into expected uh performance of the
model given the current feature
distribution so given the coar shift
we're
experiencing first what can we estimate
with dle so we can estimate basically
any metric that we use as long as can be
Quantified on an instance level uh so we
can look at root mean squared error
which can be calculated on instance
level when it just squared error and
we're going to aggregate it over a
certain period of data or certain number
of data points we get Ru root mean of
that squared error same for absolute
error we can take the mean absolute
error and again for logarithmic errors
uh the same thing follows so basically
any metric that's really used in
practice for evaluating regression
models what do we need to use D so first
thing is we need So-Cal reference data
reference data is the data for which we
do have the targets and we are happy
with performance so we have performance
that we can treat as a
benchmark typical choice for the
reference data is the test set because
this is by definition a data point for
data set for which we have the targets
because we were able to actually test
them all on that and evaluate the
performance and after we have evaluated
the performance we are happy with it
then the second thing we need is the
actual data in question uh for which we
try to estimate our performance so this
is just the monitor data the production
data that just streams in as we make the
predictions either in batches or in
streaming it really doesn't matter here
importantly we don't need the targets
because we try to quantify the impact of
kar shift on performance before there is
business impact so before tar arrive as
kind of a feedback from The Real
World and then the last thing we don't
really need but we just need to specify
is how we want to aggregate our data so
imagine that instead of trying to just
aggregate data in some chunks we want to
estimate the performance for every
single data point this is something that
we could potentially do but
unfortunately in practice the estimation
error of that performance is going to be
very high just because of the stochastic
nature of the data so instead what is
more practical uh we're going to look at
let's say collection of data points
let's say a day of data an hour of data
or last 1,000 data points we're going to
aggregate those and we're going to
assume that the performance for that
specific chunk is reasonably constant or
it's approximately constant uh which is
generally very correct assumption for
almost all your cases and then instead
of looking at very uncertain estimations
for every single data point we're able
to very significantly actually reduce
the uncertainty of performance
estimation and we're going to be looking
at the chunks of
data and interesting thing also in niml
you uh we also output the confidence
bonds so you know that if your chunks
are very small you will actually get
very big confidence BNS which means that
you shouldn't trust your predictions too
much and potentially try to aggregate
data on a higher level to get more
points per
chunk
all right so now as for the assumptions
there's already really one assumption
that we need to be aware of which is
that there is no concept shift uh one
thing that I already mentioned concept
shift can also impact performance so
this is one of the things when we want
to get uh good performance estimation um
the only thing we need to do is
basically assume that there is no
concept shift and then we can really use
D to get good performance estimation and
other thing is that concept shift it's
itself so the change in the conditional
probability of x given y will also
impact the uncertainty distribution or
the uncertainty landscape so the very
thing that we want to actually leverage
would have changed and there's no way to
re quantify how that change exactly
happens before we have the targets so
we're going to assume that there is no
concept shift this is actually not a
very strong assumption for most of the
use cases for any use cases that deal
with physical world the physics doesn't
change so concept shift is very very
rare for any customer uh analytics use
cases you might see concept shift but
it's something that generally will be
very well known and obvious from
business perspective let's say that you
train a model in Germany and then try to
deploy it in China of course the
consumers there react differently think
differently work differently uh so you
will obviously expect to see some
concept shift but barring these kind of
extreme events or extreme changes
concept shift rarely happens in practice
in a way that's very very influential on
short time
scales and we also have a way to measure
concept shift but that requires Target
so this is something that we rely on
that it doesn't happen too often and as
long that it is the case then we can
wait for the targets and actually
quantify the impact of concept shift as
well but enough about concept shift now
let's assume it does not exist for the
time being there's no concept shift the
actual pattern that the model has
captured is still
correct what do we take as inputs first
thing is we need to take the model
predictions whatever our model predicts
uh it's a continuous real valued uh
number we take that we also will take
all the features that the model consumes
and we also take the Target also known
as the grun roof for the fitting part so
only for the reference data set because
on the reference data set we're going to
fit the dle it's also a machine learning
algorithm and we're going to learn how
the model uncertain what is the model
uncertainty and how it maps to expected
model
performance then how is performance
estimation possible we already talked a
bit about the intuition uh from kind of
classification uh variant but now let's
talk how it exactly works for regression
use case so imagine that we have this
very simple regression problem when
there is just one model input the value
X and we're trying to predict h y
and what we see that there is basically
a perfect linear relationship we capture
all information that we have in our data
set to create the best estimator
possible which happens to be just linear
regression because I created that data
to follow regression to follow a linear
pattern but we also see another very
interesting thing is that the data
points seem to be more dispersed uh the
closer they are to the x equals to zero
and if you go higher where X is close to
10 or equals 10 then we see that
basically the regression is perfect and
there's almost no noise in the data and
that means that if we pick a point where
value of x is close to the old SES one
we expect to see uh bigger error so
lower model performance whereas if
there's a lot of data points that come
from a region between let's say 9 and 10
we expect to have very good predictions
uh so also very good model performance
very low errors however we Define our
error whether it's absolute error squid
error doesn't matter and just to
validate that this is in the case what I
did here is I calculated the rolling
mean where we roll over X's uh for each
100 data points and I calculated the
expected er sorry not expected actual
realized error that we see absolute
error of predictions and we see that
indeed for lower predictions we observe
higher errors on average and as we
increase our uh value of x then we see
that there the errors get
lower and this is really the key
intuition is that the dispersion of
points kind of corresponds to
uncertainty there is a measure that we
can actually measure that based on where
data point is in the model input space
we can find what is the expected error
of that data point based on how much
information there is about the actual
Target uh in that region in our model
inputs to visualize it in two Dimensions
imagine that now we have two features of
course it generalizes to hundreds
thousands of features no matter how big
our model input space is and we can then
quantify this expected error and we'll
see how in a second uh and then imagine
that on the test data we have data
points distributed as such the blue
points and as we deploy our model to
production we see that unfortunately a
lot of data points actually drifted to
the red areas and that means we expect
to see a drop in model performance for a
given chunk of data so we're going to
aggregate all those predictions and
we're going to get that let's say mean
squared error or mean absolute error and
that's going to be the metric we're
going to be using we're going to be
estimating and it actually happens that
as long as there's no concept J this is
the true expected um mean absolute error
or mean squared error that we expect to
see once the targets
arrive now the algorithm itself uh we
have very detailed uh documentation and
blogs about how the algorithm Works in
detail everything it's like we do open S
we're not hiding everything so you can
actually see how it works and Di is also
fully available in our open source and
the cloud product uh but how it works so
first thing we're going to do is we're
going to calculate the loss metric on
the ref reference data again for
reference data for example our test data
we have the target data so we can just
simply calculate the actual loss metric
we're using such as
Mae we're going to do it for every Point
separately so we're going to get all the
absolute errors then we're going to
train the model to estimate this loss
metric absolute errors using the
features that the actual monitored model
consumes and the monitored model
predictions as well as a feature because
generally these predictions can also be
informative on the error itself we see
in practice that they are then we're
going to H directly we're going to take
that uh regressor let's say it's a
gradient boosting model actually we use
in our open source and the cloud product
we all use the gradient boosting models
and then we can use that model the dle
to directly estimate the loss on the
monitor data so for every data point
we're going to estimate what is the
expected absolute error there and then
once we have all those absolute errors
for a given chunk of data let's say Day
of data we can aggregate that data per
chunk and that is the output that we
have the
expected um absolute error mean absolute
error per
chunk how does it look like so here I
wanted to show one thing that might not
be obvious to some of you which is that
just because the Isis ofar shift it does
not mean that the performance has
dropped what you see on the left side is
an overall measure of change in the
joint model input distribution the data
structure so just a way to measure
multivar data drift it's called PCA
reconstruction error um it's one of the
algorithms that we have for multivariate
uh drift detection for multivariate coar
Sho detection and the higher it is the
stronger there is a change uh between a
given chunk and the reference data in
terms of uh the data structure so data
might be distributed differently looked
differently we know that the internal
data structure has changed if that value
goes up not going to go into details of
how it works uh but of course we can
share our resources for that later but
what I wanted to convey here is that as
uh data drifts and we see that there is
actually very strong varage shift it
does not mean that there is any issue
with model performance and even in more
interestingly you see that there are two
spikes in terms of uh ma here uh but
they don't really correspond to anything
specific in terms of data drift and you
really need to quantify the impact of
data drift on performance orari shift on
performance you cannot just rely onari
shift measures as a proxy for
performance because there's really
basically no relationship whatsoever and
we need to be able to find the
performance expected performance as fast
as possible and there is a place for
drift detection mostly as a kind of r
root cause analysis tool where we can
try to figure out which features are
exactly drifted how they are drifting
what is the actual um main reason why
our models are failing within the
umbrella of coar shift but for trying to
estimate whether the performance is good
we need to rely on algorithms such as
dle to really know what is the impact of
coar shift on
performance so that's D that's how we uh
estimate the impact of cage shift on
performance without access to labels in
a way that actually gives you expected
performance in the metric of performance
and that is a much more accurate measure
of what is the current level of your
predictive performance compared to just
doing Simple coverage detection or data
drift
detection so now the last part is how we
can quantify the impact of coar shift on
performance for classification mods uh
so here we have confidence based
performance estimation we're very
creative with names as you can
see and what it does is basically what
it says it
does let's talk about the same format as
before first thing is what can we
calculate so we can calculate first of
all the confusion Matrix for both binary
and multiclass variations I'm going to
be focusing on on binary uh just for the
sake of Simplicity here we can also
calculate Precision recall accuracy
basically any metric that you can get as
long as you have the confusion Matrix
because we'll actually get the confusion
Matrix first and then we're going to be
Computing any metric we want you also
can compute
Matrix that need a range of confusion
matrices so basically uh you will change
the threshholds you will get different
confusion Matrix and then we construct
curve such as rock a or rock rock curve
we can also get those metrics so a KC
average Precision Etc and the last thing
we can get similarly actually to uh D is
we can get the business impact so for
some use cases let's say uh for credit
scoring you know what is the actual
expected monetary loss if somebody
defaults on the loan and you also know
how much money you're making and on
average per loan that's given if you
know that you can estimate not only the
machine learning metrics but we can also
estimate that business metric which is
how much money the model is making per
chunk of data per day of operation and
this can be very useful for
communicating with business stakeholders
because then you also increase the
visibility of your model and you can say
that the model that you deployed to
production today made let's say
50,000 which is obviously a very useful
thing for you just from personal career
progression perspective because you can
really show the impact of that your work
has assumption is exactly the same thing
there's no concept shift inputs are very
similar to dle but I wanted to mention
one specific thing here is that we not
only need the modal predictions Z or one
for binary prediction use case but we
also need the model scores or the model
predicted probabilities because we're
going to be leveraging those instead of
building our model we're going to be
leveraging those to estimate the
uncertainty of each prediction and then
the last thing is that we need the
targets for the reference data set just
as
before now I'm going to skip the
intuition part because we already
covered the intuition at length both in
the section two about how cover can
impact performance and in dle and the
intuition is exactly the same if the
models are more uncertain then we expect
to see a drop in model
performance what is the kind of recipe
the actual algorithm uh that we're using
so it has five points the first one is
we just take the model predictions and
the predicted probabilities then the
predicted probabilities need to be Cal
calibration means that we're going to
take the actual predict probability that
your model outputs let's say
0.7 and we're going to ensure that this
0.7 actually corresponds to the
frequency definition of probability
which is that for a large number of
samples so let's say we get 10 million
data points that all out 0.7 we expect
70% of them to turn out to be positive
this is what means to have 0.7
probability according to the frequencies
definition and we're going to ensure
that the model predictive probabilities
actually follow that definition in the
aggregate on
average and I think also an expectation
yeah and then uh once we have those
calibrated probabilities we're going to
find the expected confusion Matrix for
every point this is really the magic
point when we're going to turn those
predicted probabilities and mod
predictions into a confusion Matrix for
every data for every data point
separately once we have that we can then
aggregate that confusion Matrix for a
chunk so instead of having just let's
say a thousand different instance level
confusion matrices we're going to get
one big confusion Matrix per chunk and
once we have that we have the expected
confusion Matrix Matrix for a given
Chunk we can just use it to calculate
any metric that we care about including
the business impact as long as we know
what is the impact of each true positive
false positive
Etc so now we're going to go over each
of these points step by step requires a
bit more Nuance than the D where we just
train the model to predict the errors
here we have uh few more points and I'm
going to skip Point number two at first
we're going to assume that the
calibration is already there and then
we're going to talk about all the other
parts so now for how we actually do it
we're just going to start with the
confusion Matrix and we're going to try
to fill it even though we don't have the
targets normally we have the actual
column here so we need to know what the
actuals are but we're going to for the
time being actually for this algorithm
we're going to assume we have no targets
so we need to do something else there
and it turns out we can actually do it
which is really the magic behind
it so the first point is that we're
going to take a look at the model
prediction and let's say that the model
makes a positive prediction that already
gives us some information about the
confusion Matrix because if the model
predicts positives it means that it's
definitely not a negative prediction so
we already know that it's not a false
negative we know uh that it's not a true
negative so in both of those cells we're
going to put
zero then we also know that the model
predicts let's say 0.7 and for the time
being I'm going to assume that these
probabilities are perfectly calibrated
so uh the 0.7 actually does correspond
to 70% chance that a uh this prediction
will turn out to be positive
and now remember that at the end we're
going to put those predictions together
so we can work with an average so what
we're going to do is we're going to say
that on average in expectation uh there
is
0.7 of true positive here and 0.3 of
false positive and as we aggregate uh
the predictions in the chunk we're going
to get to an expected confusion Matrix
following the exact method like that so
then for the second prediction we do the
same thing it's 0.5 four uh it's a
negative prediction we put 0.4 in false
negative 0.6 in true negative because
this is the remainder 1 minus 0.4 is
0.6 for the first prediction the same
etc etc we should do it for like a few
hundred data points more but now let's
say that our chunk is only three data
points and then once we have that um
confusion Matrix we can then just take a
metric let's say accuracy and we can
estimate it using this expected
confusion Matrix so accuracy is just two
positives plus two negatives divided by
the total number of data points we get
we already have all the information that
we need from our confusion Matrix we
calculate the accuracy we have the
expected accuracy that's it that's read
the algorithm as long as probabilities
are
calibrated and which are these numbers
here right these are the two numbers
that we get from the probability so
these numbers for them to be Cor correct
the probability needs to also be
correct but it's actually
not so how do we know whether the
probabilities that our model outputs are
actually correct uh the way to really
inspect that to find out is so-called
calibration curve which is exactly what
you see here and what does it what does
it display what does it show on the
x-axis you have the mean pred icted
probability of a certain bucket of
points how do we create these buckets of
points we're going to take a test that
the mo we're going to take a data set
that the model has not seen before such
as the test set so our reference data
set and we're going to make predictions
on the entire data set then we're going
to sort these predictions by predicted
probability that the model outputs and
then we're going to aggregate these
predicted probabilities
into buckets um let's say the first
bucket goes from 0 0 to 0.1 then the
second from 0.1 to 0.3 Etc and then
we're going to calculate the mean
predicted probability for each of the
buckets and then once we have that for
each of the buckets we're going to count
the fraction of actual positives that we
observe in that bucket for that bucket
in our test set and if the probabilities
are correctly calibrated what we expect
to see that if the mean predicted
probability is let's say
0.3 then and we expect to see 30% of
positives for that bucket but this is
unfortunately not the case for vast
majority of machine learning algorithms
I believe for everything that's not
logistic aggression so gradient boosting
models and any kind of Ensemble or buing
models deep neural networks for deep
neural networks is really bad actually H
we see either this s shaped curve or the
inverse s shaped curve inverse s shape
curve really only happens for visan Ms
as far as I know but don't quote me on
that I'm not sure here but the point is
that we want a straight line we don't
have a straight line so what can we do
we can actually just make it into
straight line with another regression
model that we're going to train on the
test set on the reference data how does
it work so basically to turn the
predicted probabilities into actual
probabilities we only need two steps
first we're going to take just one
feature which is the predicted
probability and then we're going to use
that to predict what is the probability
of Y so the our Target being equal to
one given that predictive probability so
we're just going to do onedimensional
regression we can use any regression
algorithm here uh we can use gradient
boosting we can use Simple linear models
whatever you want and then we can then
uh straighten the line that you see here
so instead of having an S shap curve
you're going to have a straight curve
once we that we can then say that on the
aggregate on average and also only for
the test Set uh which is one of the
issues with CBP that we solve with our
new better algorith called mcbp which m
stands for multic calibrated that's
available in the cloud uh but let's
assume now that these probabilities are
properly calibrated even if there's cage
and that means that we can follow the
reasoning that I outlined on the slides
before to get the expected confusion
Matrix for every point then aggregate it
per chunk and then compute of the Matrix
that's basically how the algorithm works
now for the results I wanted to show
kind of the opposite case of what you
see of what you saw for the regression
use case so here what you see is the PCA
reconstruction our multi measure of
multivariate data drift stays constant
stays roughly constant stays within the
threshold there is no strong variant
shift but if even so we see a
significant drop in the accuracy and we
see that the estimated and realized
accuracy follow each other very very
closely so we're good at estimating
accuracy here and we see that there is a
big deep of roughly 20% compared to the
test set so it's definitely business
significant even though there is no
coverage shift so again if you just
monitor cover ship you might be uh the
model might actually be experiencing
failure there might be significant
performance degradation but it's not
something you will see just based on the
coar shift
measures that is really it so now thanks
for listening we have uh our
documentation that outlined the
algorithms in detail both for the cloud
and the open source and as for the cloud
it's something
Weitere ähnliche Videos ansehen
How to detect drift and resolve issues in you Machine Learning models?
Different Types of Learning
#10 Machine Learning Specialization [Course 1, Week 1, Lesson 3]
Linear Regression, Cost Function and Gradient Descent Algorithm..Clearly Explained !!
#9 Machine Learning Specialization [Course 1, Week 1, Lesson 3]
Unit 1.4 | The First Machine Learning Classifier | Part 2 | Making Predictions
5.0 / 5 (0 votes)