Quantifying the Impact of Data Drift on Machine Learning Model Performance | Webinar

NannyML
28 Mar 202440:16

Summary

TLDRThis deep dive presentation focuses on understanding and mitigating machine learning model failures due to covariate shift and concept drift. It outlines two main causes for model performance deterioration, with a special emphasis on covariate shift, exploring how it can both positively and negatively affect model outcomes depending on the nature and direction of the shift. The presentation introduces two algorithms, Direct Loss Estimation (DLE) for regression and Confidence-Based Performance Estimation for classification, designed to quantify the impact of covariate shift on model performance. Through practical examples and detailed explanations, it elucidates how these algorithms can predict model failures and performance changes, enabling proactive adjustments before significant business impacts occur.

Takeaways

  • 🙂 The presentation covers both theoretical and practical aspects of machine learning model performance, focusing on algorithms.
  • 👇 Two main causes of machine learning model failure are discussed: covariate shift and concept drift, highlighting how they can lead to significant performance drops.
  • 💁🏻 Covariate shift is defined as changes in the joint model input distribution, which can have both positive and negative impacts on model performance depending on where and how the shift occurs.
  • 👨‍💻 Concept drift refers to changes in the underlying real-world pattern that the model tries to predict, necessitating updates to the model to maintain accuracy.
  • 📖 Direct Loss Estimation (DLE) and Confidence-Based Performance Estimation are introduced as two main algorithms for quantifying the impact of covariate shift on model performance for regression and classification models, respectively.
  • 📈 The presentation emphasizes the importance of catching model failure early, ideally before significant business impact, by estimating model performance without needing access to target data.
  • 📚 A deep dive into DLE shows how it uses model predictions, features, and known targets from a reference dataset to estimate expected model performance under covariate shift.
  • 🔧 Confidence-Based Performance Estimation uses model predictions and scores to estimate the expected confusion matrix for classification models, allowing for detailed performance metrics estimation.
  • 👍 Emphasizes that covariate shift is not always detrimental; under certain conditions, it can actually improve model performance if the data shifts towards areas where the model is more confident.
  • 🛠 Highlights the need for models to be recalibrated for accuracy, especially in the face of covariate shifts, using techniques like calibration curves to adjust predicted probabilities.

Q & A

  • What are the two main causes of potential model failure in machine learning?

    -The two main causes of potential model failure in machine learning are covariate shift and concept drift.

  • How does covariate shift impact model performance?

    -Covariate shift impacts model performance by changing the joint model input distribution. This can potentially lead to a significant drop or, in some cases, an improvement in model performance, depending on the type and location of the shift.

  • What is the difference between covariate shift and concept drift?

    -Covariate shift refers to changes in the distribution of the model inputs, while concept drift involves changes in the relationship between the inputs and the target variable, affecting the underlying pattern the model has learned.

  • What are DLE and CBP, and how do they relate to machine learning model performance?

    -DLE (Direct Loss Estimation) and CBP (Confidence-Based Performance Estimation) are algorithms used to quantify the impact of covariate shift on model performance. DLE is used for regression models, while CBP is used for classification models, both assisting in assessing how shifts might affect model accuracy without needing new target data.

  • Why is it important to detect model failure or deterioration in performance before there is business impact?

    -It's important to detect model failure early to minimize or ideally eliminate business impact. Early detection allows for corrective actions to be taken before the model's inaccuracies can lead to significant losses or inefficiencies.

  • How can covariate shift result in both positive and negative impacts on model performance?

    -Covariate shift can lead to positive impacts if the data drifts to regions where the model is very confident and accurate in its predictions. Conversely, it can negatively impact performance if the shift leads to regions where the model is less certain or has not seen enough data during training.

  • What is model calibration, and why is it important in the context of CBP?

    -Model calibration ensures that the predicted probabilities accurately reflect the true likelihood of an event's occurrence. In CBP, calibration is crucial because it uses model scores or predicted probabilities to estimate the uncertainty of each prediction and, consequently, the model's performance.

  • Can covariate shift detection alone reliably indicate model performance?

    -No, covariate shift detection alone cannot reliably indicate model performance. While it shows changes in the input data distribution, it does not directly reflect how these changes affect the accuracy or effectiveness of the model.

  • How does the Direct Loss Estimation (DLE) algorithm work?

    -The DLE algorithm works by calculating loss metrics (like mean squared error or mean absolute error) on reference data where targets are known. It then trains a model to estimate these losses using the features and predictions of the monitored model, allowing for performance estimation on new data without targets.

  • What role does concept drift play in model failure, and how is it different from the role of covariate shift?

    -Concept drift plays a crucial role in model failure by altering the relationship between input features and the target variable, making the learned pattern obsolete. Unlike covariate shift, which changes the distribution of inputs, concept drift changes the underlying pattern itself, often requiring model retraining or updating.

Outlines

00:00

🔍 Introduction to Model Performance and Failure Causes

This segment introduces the focus of the presentation, which revolves around exploring both theoretical and practical aspects of machine learning algorithms, with a special emphasis on understanding model failures. The agenda outlined includes a discussion on two primary causes of potential machine learning model failure: covariate shift and concept drift, and their implications on model performance. The speaker plans to delve into how covariate shift can impact model performance in both positive and negative ways, depending on its nature and occurrence. Additionally, the introduction mentions two key algorithms designed to quantify the impact of covariate shift on model performance for regression and classification models, respectively named Direct Loss Estimation (DLE) and Confidence-Based Performance Estimation.

05:01

🎯 Detailed Look at Concept Drift

This paragraph elaborates on the concept of concept drift, defining it as a change in the relationship between the target variable and the model inputs. It highlights the challenge of quantifying concept drift without access to actual targets and emphasizes the importance of detecting changes in model performance early to avoid business impact. The speaker further illustrates concept drift with an example of a shift in class boundaries, demonstrating how such a shift can significantly degrade a model's predictive accuracy. The paragraph concludes with an emphasis on the necessity of addressing concept drift promptly to prevent or minimize negative business outcomes.

10:03

🔑 Understanding Covariate Shift and Its Implications

This section delves into the intricacies of covariate shift, explaining it as a change in the model input distribution and its potential effects on model performance. The discussion covers various scenarios of covariate shift, including changes in the sampling mechanism and alterations in the underlying distribution of data features. The narrative underscores that covariate shift does not uniformly result in performance degradation; in some cases, it may even improve performance if the shift moves data points to regions where the model has high confidence. The explanation also touches upon how unseen regions or insufficiently sampled areas in the feature space can lead to performance drops, establishing a nuanced view of covariate shift's impact on machine learning models.

15:04

📊 Direct Loss Estimation (DLE) for Regression Models

The paragraph introduces the Direct Loss Estimation (DLE) algorithm, a technique for quantifying the impact of covariate shift on the performance of regression models. It outlines the prerequisites for applying DLE, including the need for reference data where model performance is known and satisfactory. The discussion elaborates on the assumptions underlying DLE, specifically the absence of concept drift, and explains the inputs required for the algorithm to function, such as model predictions, features, and, for the reference dataset, actual targets. The segment aims to provide a foundational understanding of how DLE operates and its role in assessing and mitigating the effects of covariate shift on regression models.

20:04

🔄 Performance Estimation in Regression: How DLE Works

This part offers a deeper dive into how Direct Loss Estimation (DLE) operates, using a simple regression problem as an illustrative example. It shows how the algorithm utilizes the model's predictions and the dispersion of data points to estimate performance metrics like mean absolute error or root mean squared error. The narrative explains that the expected error varies depending on the data point's location in the input space, with regions of higher data dispersion indicating higher uncertainty and expected error. This approach allows DLE to estimate the performance of regression models under covariate shift by analyzing the distribution and characteristics of the input data relative to the model's predictions.

25:05

📉 Estimating Model Performance Decline Due to Covariate Shift

This section emphasizes that covariate shift does not always lead to a decline in model performance. It presents a case where despite significant covariate shift, as measured by PCA reconstruction error, model performance may not necessarily worsen and may sometimes improve. The speaker introduces the concept of multivariate data drift detection and discusses its limitations in accurately predicting performance outcomes. The passage argues for the use of algorithms like DLE to directly assess the impact of covariate shift on performance, highlighting the importance of distinguishing between mere data structure changes and actual performance degradation.

30:07

🌟 Confidence-Based Performance Estimation for Classification Models

This paragraph shifts focus to classification models, introducing Confidence-Based Performance Estimation (CBPE) as a method for assessing the impact of covariate shift on these types of models. The speaker outlines what can be calculated using CBPE, including confusion matrices, precision, recall, and even business impact metrics. The narrative explains the assumption of no concept drift and details the inputs needed for CBPE. Special emphasis is placed on the need for model scores or predicted probabilities, highlighting the importance of model confidence in estimating performance and addressing covariate shift in classification contexts.

35:07

📚 Calibration and Performance Estimation in Classification Models

The final segment delves into the details of performing Confidence-Based Performance Estimation (CBPE) for classification models. It explains the process of calibrating model predicted probabilities to align with the frequency definition of probability, ensuring that they accurately represent the likelihood of positive outcomes. The speaker discusses how calibrated probabilities are used to estimate confusion matrices for individual predictions and aggregate them to assess model performance over data chunks. The paragraph concludes by showcasing how despite the absence of significant covariate shift, a model's accuracy can still experience notable declines, underscoring the critical role of CBPE in monitoring and adjusting for performance shifts in classification models.

Mindmap

Keywords

💡Model Failure

Model failure refers to a significant drop in machine learning model performance. It is a central theme in the video, which aims to explore the causes and implications of this phenomenon. The video specifically focuses on two main causes of model failure: covariate shift and concept drift. Model failure can lead to incorrect predictions or decisions in practical applications, hence understanding and addressing it is crucial for maintaining the effectiveness of machine learning systems.

💡Covariate Shift

Covariate shift is a change in the distribution of input data for a model, which occurs when the model is applied to a new environment or demographic different from the one it was trained on. The video delves into how covariate shift can affect model performance, noting that the impact can be both positive and negative. An example given is the change in data distribution due to deploying a model in a different country, leading to potential differences in model predictions and performance.

💡Concept Drift

Concept drift refers to a change in the underlying relationship between input features and the target variable over time. In the video, it is described as a significant factor that can lead to machine learning model failure, especially when the patterns the model learned no longer apply to current data. The script mentions scenarios like changes in economic behavior or demographics that can cause such shifts, highlighting the need for models to adapt to these changes to maintain accuracy.

💡Direct Loss Estimation (DLE)

Direct Loss Estimation (DLE) is an algorithm mentioned in the video used for quantifying the impact of covariate shift on regression model performance. DLE helps to estimate how changes in data distribution affect the accuracy of model predictions. By evaluating the potential loss directly from the model’s output, it provides a way to gauge the performance of regression models under covariate shift, as illustrated with examples in the script.

💡Performance Estimation

Performance estimation involves assessing how well a machine learning model is likely to perform in terms of accuracy, error rates, or other relevant metrics. In the video, this concept is crucial for understanding the impact of covariate shift and concept drift on models. The discussion includes methods for estimating model performance before actual target data becomes available, which is critical for proactive model management.

💡Sampling Mechanism

The sampling mechanism refers to the process by which data samples are selected from a population for training a machine learning model. The video explains that changes in the sampling mechanism can lead to covariate shift, impacting model performance. Examples include changes in customer demographics or market conditions that alter the distribution of the data on which the model was trained.

💡Model Confidence

Model confidence, as discussed in the video, relates to the degree of certainty a model has in its predictions. It plays a key role in evaluating the impact of covariate shift on model performance. High confidence in certain areas of the input space indicates reliable predictions, while low confidence can signal potential issues under covariate shift, affecting the overall performance of the model.

💡Regression Models

Regression models are a type of predictive model used to estimate the relationship between variables and predict continuous outcomes. In the video, regression models are specifically addressed in the context of covariate shift and how the DLE algorithm can be used to quantify its impact on these models’ performance.

💡Classification Models

Classification models are discussed in the video as part of the broader conversation on machine learning performance. These models categorize input data into classes. The script mentions confidence-based performance estimation for classification models, which assesses how covariate shift affects the ability of these models to accurately classify data.

💡Business Impact

Business impact in the context of the video refers to the real-world consequences of model performance, particularly when it fails or degrades. The script emphasizes the importance of predicting and mitigating model failure to prevent adverse business outcomes, underscoring the need for timely and accurate performance estimation to safeguard against financial losses or operational inefficiencies.

Highlights

Introduction to the theoretical and practical aspects of machine learning model performance, focusing on algorithms.

Discussion of two main causes of potential machine learning model failure: covariate shift and concept drift.

Detailed examination of covariate shift, its potential impacts on model performance, and how these impacts can be both positive and negative.

Introduction of direct loss estimation (DLE) for quantifying the impact of covariate shift on regression model performance.

Introduction of confidence-based performance estimation for quantifying the impact of covariate shift on classification models.

Overview of the essential components of machine learning models, including the sampling mechanism and the true pattern in reality they aim to capture.

Explanation of the difference between covariate shift and concept drift, and their respective impacts on model performance.

The significance of capturing model failure or performance deterioration before there is a business impact.

The challenge of quantifying concept drift without access to labels and its implications for model performance.

Using model confidence as a proxy for potential model performance and the relationship between covariate shift and model certainty.

The process of direct loss estimation (DLE) for regression models, including its assumptions and inputs required.

The calibration process for predictive probabilities in classification models to ensure they match the frequency definition of probability.

How confidence-based performance estimation works to predict the impact of covariate shift on classification models without labels.

The importance of model calibration in the accuracy of performance estimation algorithms.

Demonstration that covariate shift is not always indicative of model performance deterioration.

Transcripts

play00:00

model

play00:01

performance so yeah as I mentioned we're

play00:04

going to doing both the theoretical and

play00:06

a bit of practical Deep dive here but

play00:08

really mostly focusing on the algorithms

play00:10

on the agenda we have four items the

play00:13

first one is we're going to talk about

play00:15

two main causes of potential model

play00:17

failure and here by failure I mean

play00:19

significant drop in machine learning

play00:21

model performance then in the second

play00:23

part we're going to focus on one of

play00:25

those causes which is the coar shift and

play00:27

we're going to talk how it can

play00:28

potentially impact model performance and

play00:30

we see that the story there is not going

play00:32

to be that simple and the impact can be

play00:34

both positive and negative depending

play00:36

exactly on the type and place where the

play00:39

coar shift actually happens and then

play00:41

once we have that and we bu kind of an

play00:43

intuition of how coar shift impacts mod

play00:46

performance and what is also the role of

play00:48

H uncertainty in all that we're going to

play00:51

talk about our two main algorithms that

play00:53

actually help us quantify or actually

play00:55

quantify the impact of karat shift on

play00:58

model performance and for regression we

play01:00

have an algorithm called dle which

play01:02

stands for direct loss estimation that

play01:04

helps us quantify the impact of coar

play01:07

shift on model performance you're going

play01:09

to hear that centeres a lot here for

play01:11

regression models and for classification

play01:13

models we have confidence-based

play01:15

performance intimation that does the

play01:17

exact same things but for classification

play01:19

models and here uh it works both for

play01:22

multiclass and binary classification

play01:24

models we're going to be focusing on the

play01:25

binary example just for sake of

play01:28

Simplicity but everything here Will

play01:29

gener analyze also to multiclass

play01:33

problem so now let's get started with

play01:35

the first thing which is the two main

play01:36

causes of machine learning model

play01:40

failure just before we do that we'll do

play01:43

a quick refresh of what are the actual

play01:45

moving Parts uh for our machine learning

play01:47

model what are the things that we ingest

play01:49

what are what are the things we're

play01:50

trying to achieve with supervised

play01:52

machine learning so the first thing is

play01:54

that there exist some kind of true

play01:56

pattern in reality that we're trying to

play01:58

capture in our example here we have just

play02:00

one Feature Feature X and we're trying

play02:02

to predict a binary outcome and we see

play02:05

that as feature X increases we have this

play02:07

kind of sigmo pattern that the

play02:08

probability of a positive class

play02:12

increases um and then if x is very high

play02:15

also it's almost uh no sorry the

play02:17

probability actually decreases and the

play02:19

uh the probability of negative class is

play02:21

almost 100% if x is very high um and

play02:24

this is the pattern that our model will

play02:26

try to capture based on the data we have

play02:29

and the data we have is the sampling

play02:30

mechanism so it's the way we sample the

play02:33

data from our population we never have

play02:35

access to the full population but we

play02:36

have centered sample of our data for

play02:39

example our current customers or all the

play02:43

past uh events that we managed to

play02:45

collect and that just forms our uh data

play02:49

that we then use to train our

play02:51

model and then imagine that we did train

play02:53

the model we manag to capture correctly

play02:55

the pattern that we see in the top of

play02:58

the slide what can then happen

play03:00

the first thing that can happen is

play03:01

so-called coar shift coar shift happens

play03:04

when our sampling mechanism is changed

play03:06

it's not something that we necessarily

play03:07

control ourselves but it might be that

play03:09

we deployed our model in one country but

play03:12

we uh sorry we trained the model in one

play03:14

country but then we need to deploy it in

play03:16

another country or just a different

play03:18

segment of population or even the

play03:20

segment is the same but slight

play03:22

distribution of let's say age changed or

play03:25

we deployed our model on one machine and

play03:27

now it needs to operate on slightly

play03:29

different machine

play03:30

then we see that there is a difference

play03:32

in something and that will also

play03:33

influence how our data looks lies and

play03:36

not only the data but also the actual

play03:38

targets or the labels that we have and

play03:40

the class balance might also change and

play03:43

that may or may not impact the actual

play03:45

performance of the model and I'm going

play03:47

to delve deeper into that in the second

play03:50

part of the presentation but now let's

play03:52

focus on defining it fully so the coar

play03:55

Shi shift is basically the change in the

play03:58

joint model input distribution so the

play04:00

probability of all x's and we see one

play04:04

example of that in one dimension that

play04:06

just the mean mate shift but it also

play04:08

might be any other structural change in

play04:11

the joint model input distribution and

play04:15

it might be that even on every specific

play04:18

Dimension you will see that the

play04:19

distribution does not change but for

play04:21

example the correlation between

play04:23

different features

play04:25

changes then the second reason why

play04:28

machine learning mods May Fail

play04:30

is so-called concept drift and here uh

play04:32

what changes is the actual true pattern

play04:34

that exists let's say that people used

play04:37

to default on loans uh when they made

play04:39

less money but this for some reason no

play04:41

longer holds if that happens the pattern

play04:44

that our model has learned is no longer

play04:47

fully relevant for the real world that

play04:49

the model is operating it so it's very

play04:51

likely that uh the model will actually

play04:54

fail and that's one of the reasons why

play04:56

we're not focusing on it another reason

play04:58

is that to really quantify concept drift

play05:00

we do need to have access to labels why

play05:03

is that because concept drift is simply

play05:05

defined as the change in the

play05:07

relationship between the Target and the

play05:09

model inputs so it's the conditional

play05:11

probability of the target given the

play05:14

feature vector and if that happens we

play05:17

really cannot quantify it before we have

play05:19

access to the targets and we do need to

play05:22

know the model performance and quantify

play05:24

the change of model performance before

play05:26

we have access to the targets and I'll

play05:28

explain that in the next slide hope

play05:31

anything is clear again if you have any

play05:34

questions uh do put it in the Q&A and I

play05:37

will answer the questions at the end of

play05:39

the

play05:41

presentation uh now let just visualize

play05:43

the concept drift just give a bit more

play05:46

of an intuition uh so here we see that

play05:49

the actual true class boundary that

play05:51

exists in reality was this kind of

play05:53

vertical thing that goes like that

play05:55

vertical vertical and then due to some

play05:58

change in how the world operates maybe

play06:01

something like a pandemic or some other

play06:04

huge event or maybe something slower

play06:06

that actually change the actual concept

play06:08

gradually we see that instead of having

play06:11

this class boundary that goes like that

play06:13

we have now basically a horizontal class

play06:16

boundary and of course the model Lear

play06:18

the first uh boundary it's going to

play06:20

operate really badly on the second

play06:23

boundary so this is conent drift this is

play06:25

not something I'm going to focus on now

play06:27

we're going to be focusing on the first

play06:29

reason why machine learning models can

play06:31

fail which is the covarage

play06:34

shift and before we go into really

play06:37

discussing the intuition of how cage

play06:40

shift can impact model performance we

play06:41

need to talk one about one really

play06:44

important thing which is we generally

play06:47

want to catch model failure or catch

play06:49

change or deterioration in model

play06:51

performance before there is business

play06:54

impact and what you see is basically on

play06:56

a timeline first what we the first thing

play06:58

we get is we get model inputs these

play07:00

model inputs are then fed to the model

play07:03

the model will make predictions and as

play07:05

the predictions are made then these

play07:07

predictions will be processed in some

play07:09

way maybe it's going to be a credit

play07:11

scoring model the predictions are credit

play07:13

scores these predictions will then be

play07:15

used to either deny or Grant loans for

play07:18

predictive maintenance use cases you're

play07:20

going to uh get model outputs whether

play07:22

the model needs maintenance or not and

play07:24

based on that the maintenance will be uh

play07:26

performed or not so there's going to be

play07:28

some kind of business impact that the

play07:30

model um will basically create and what

play07:34

we want to do and only once this model

play07:36

business impact happens we then possibly

play07:39

get the targets so in case of loans

play07:41

we're going to have to wait quite long

play07:44

time let's say two three years whatever

play07:45

is the length of the loan to see whether

play07:47

the person actually defaulted on it or

play07:49

not H and then we're going to keep on

play07:51

making more and more predictions in the

play07:53

time and what we want to do ideally is

play07:56

catch model failure before there is

play07:59

business impact so before the loan is

play08:01

granted or not as ideal but also quite

play08:04

okay shortly after the loan is granted

play08:07

so then we can stop the mold from

play08:09

operating let's see for this remaining

play08:11

two years and granting more and more

play08:13

loans so we want to catch uh malary as

play08:16

soon as possible to minimize or ideally

play08:19

eliminate uh the business impact and

play08:22

that also means that we never have the

play08:24

targets before it's too late because if

play08:25

we have the targets the all already

play08:27

acted and we got some feedback from The

play08:30

Real World so we have to focus on trying

play08:32

to estimate the impact of coar shift on

play08:35

model performance without access to

play08:37

targets because basically by definition

play08:38

if we have access to targets there's

play08:40

already business impact and the damage

play08:42

has been

play08:47

done hope that's clear now let's talk

play08:50

about the intuition of how we can

play08:54

quantify the impact of kar shift on mod

play08:57

performance without access to the Target

play09:00

data so the first thing we're going to

play09:02

do is we're going to look at model

play09:03

predictions we have this beautiful

play09:05

dragon fruit picture here which just

play09:08

quantifies or shows what is the

play09:11

confidence of the model depending or

play09:14

when in the model input space a given

play09:16

data point is located so what we did

play09:19

here is we trained a model it was I

play09:21

think a nonlinear svm because it gives

play09:24

very nice uh smooth shapes and uh what

play09:28

we see it's just a simple XR problem and

play09:31

what we see is that if there is a

play09:34

concentration of data points from one

play09:36

class then the model tends to be very

play09:38

confident about predictions that it's

play09:40

very likely that their predictions there

play09:42

are going to be positive or

play09:44

negative so the predictions are

play09:46

concentrated around either being very

play09:48

predicted probabilities are concentrated

play09:50

around being very close to either one or

play09:52

to zero and the confidence is basically

play09:55

just a simple measure of how far from

play09:57

0.5 the prediction is so if the model

play10:00

confidence is zero the model prediction

play10:03

model predict probability is 0.5 if the

play10:05

model predict probability is close to

play10:07

either zero or one then the confidence

play10:09

measure that we see here is going to be

play10:11

very close to one basically just like an

play10:13

absolute value but centered at

play10:17

0.5 and what we see is that uh we can

play10:20

actually look at this model confidence

play10:22

as kind of the proxy for potential model

play10:25

performance to build a bit more

play10:27

intuition imagine that we take a Point

play10:30

that's very close to the class boundary

play10:32

then there's not much information to go

play10:34

about to try to predict whether the data

play10:37

point is going to be positive or

play10:38

negative if it lies exactly on the class

play10:40

boundary it's basically conto it's 50%

play10:44

so there we actually expect for these

play10:47

data points that are close to class

play10:48

boundary you expect performance to be

play10:50

very low because there's no information

play10:53

to really use to try to predict the

play10:54

Target and then we see that also the

play10:56

model confidence is going to be very low

play10:59

we see this intense pink there and then

play11:02

also if we see

play11:05

uh kind of potential points that would

play11:08

appear let's say around minus 3 minus 3

play11:12

um point in the feature space we would

play11:14

also expect the model to really not know

play11:16

what to do and we would expect to see

play11:18

very bad performance because the model

play11:19

was not trained on the data so we can

play11:22

use then that model confidence as a

play11:24

proxy for the expected model performance

play11:28

uh but what is the role of coar shift

play11:30

and all

play11:31

that so the main point is that coar

play11:34

shift is change in the distribution of

play11:37

uh data in the model input space so what

play11:41

you see here is on the test data where

play11:44

we test the model and we decided that

play11:45

the model performance is satisfactory uh

play11:48

we had certain model distribution that's

play11:50

kind of wide and covers most of the

play11:53

green regions where we expect to see

play11:55

high model performance but as we deploy

play11:57

our mod to production for one reason or

play12:00

another we see that actually most of the

play12:02

data points are concentrated very close

play12:04

to z00 region which is exactly in the

play12:07

middle of the Red Zone where it's very

play12:10

hard to predict whether a given data

play12:12

point is going to be positive or

play12:13

negative so then expect if we see that

play12:16

kind of pattern in our cage shift in our

play12:19

drift we expect to see a drop in

play12:23

performance now kind of on the opposite

play12:26

side of the spectrum if we see that

play12:28

there is significant coar shift but the

play12:32

data happens to drift to Regions where

play12:34

the model is very confident of its

play12:36

predictions we actually see we actually

play12:39

expect to see either no significant

play12:41

change in model performance or in very

play12:43

extreme cases like the one you see here

play12:45

we would actually expect to see an

play12:46

increase in performance which means that

play12:48

cavar shift is not always bad news if

play12:51

the data will drift away from the class

play12:53

boundary and to region or the model the

play12:55

predictions are very easy to make

play12:57

because they are all for example here

play12:58

one

play12:59

uh then we might actually see an

play13:01

increase in performance and potentially

play13:03

increase in business impact which means

play13:05

that coverage shift not always bad thing

play13:08

and then last kind of typical use case

play13:11

we might see is drift to Regions that

play13:13

were not seen or not properly seen

play13:16

during training so there was not enough

play13:18

data to really capture the correct

play13:20

pattern in that region if that happens

play13:23

what we expect to see again is because

play13:25

the model didn't learn the pattern there

play13:27

the model will probably predict bad

play13:29

and again as we see here if the data

play13:31

drifts to this free free free region we

play13:35

will expect to see a drop in performance

play13:37

the model is basically predicting random

play13:39

value there H and we will expect model

play13:42

failure

play13:44

actually so

play13:48

now as we covered the actual intuition

play13:51

of how we can think about the

play13:53

relationship between cavaria shift the

play13:55

model certainty or model confidence and

play13:58

expect performance let's try to

play14:00

formalize this thinking with our first

play14:03

algorithm for regression model called

play14:05

direct loss estimation so we will be

play14:07

able with this algorithm to take the

play14:09

model predictions and somehow turn them

play14:13

into expected uh performance of the

play14:16

model given the current feature

play14:18

distribution so given the coar shift

play14:20

we're

play14:21

experiencing first what can we estimate

play14:23

with dle so we can estimate basically

play14:26

any metric that we use as long as can be

play14:29

Quantified on an instance level uh so we

play14:33

can look at root mean squared error

play14:35

which can be calculated on instance

play14:37

level when it just squared error and

play14:39

we're going to aggregate it over a

play14:41

certain period of data or certain number

play14:43

of data points we get Ru root mean of

play14:45

that squared error same for absolute

play14:48

error we can take the mean absolute

play14:49

error and again for logarithmic errors

play14:53

uh the same thing follows so basically

play14:55

any metric that's really used in

play14:57

practice for evaluating regression

play15:00

models what do we need to use D so first

play15:03

thing is we need So-Cal reference data

play15:07

reference data is the data for which we

play15:08

do have the targets and we are happy

play15:10

with performance so we have performance

play15:12

that we can treat as a

play15:14

benchmark typical choice for the

play15:16

reference data is the test set because

play15:18

this is by definition a data point for

play15:22

data set for which we have the targets

play15:25

because we were able to actually test

play15:27

them all on that and evaluate the

play15:28

performance and after we have evaluated

play15:31

the performance we are happy with it

play15:33

then the second thing we need is the

play15:35

actual data in question uh for which we

play15:38

try to estimate our performance so this

play15:40

is just the monitor data the production

play15:42

data that just streams in as we make the

play15:45

predictions either in batches or in

play15:46

streaming it really doesn't matter here

play15:49

importantly we don't need the targets

play15:51

because we try to quantify the impact of

play15:54

kar shift on performance before there is

play15:57

business impact so before tar arrive as

play15:59

kind of a feedback from The Real

play16:02

World and then the last thing we don't

play16:04

really need but we just need to specify

play16:06

is how we want to aggregate our data so

play16:09

imagine that instead of trying to just

play16:11

aggregate data in some chunks we want to

play16:13

estimate the performance for every

play16:15

single data point this is something that

play16:17

we could potentially do but

play16:19

unfortunately in practice the estimation

play16:22

error of that performance is going to be

play16:24

very high just because of the stochastic

play16:26

nature of the data so instead what is

play16:29

more practical uh we're going to look at

play16:32

let's say collection of data points

play16:34

let's say a day of data an hour of data

play16:37

or last 1,000 data points we're going to

play16:40

aggregate those and we're going to

play16:42

assume that the performance for that

play16:43

specific chunk is reasonably constant or

play16:46

it's approximately constant uh which is

play16:48

generally very correct assumption for

play16:51

almost all your cases and then instead

play16:54

of looking at very uncertain estimations

play16:56

for every single data point we're able

play16:58

to very significantly actually reduce

play17:00

the uncertainty of performance

play17:02

estimation and we're going to be looking

play17:04

at the chunks of

play17:06

data and interesting thing also in niml

play17:09

you uh we also output the confidence

play17:12

bonds so you know that if your chunks

play17:15

are very small you will actually get

play17:17

very big confidence BNS which means that

play17:19

you shouldn't trust your predictions too

play17:20

much and potentially try to aggregate

play17:23

data on a higher level to get more

play17:26

points per

play17:27

chunk

play17:32

all right so now as for the assumptions

play17:35

there's already really one assumption

play17:37

that we need to be aware of which is

play17:39

that there is no concept shift uh one

play17:42

thing that I already mentioned concept

play17:43

shift can also impact performance so

play17:45

this is one of the things when we want

play17:46

to get uh good performance estimation um

play17:50

the only thing we need to do is

play17:51

basically assume that there is no

play17:52

concept shift and then we can really use

play17:54

D to get good performance estimation and

play17:56

other thing is that concept shift it's

play17:58

itself so the change in the conditional

play18:02

probability of x given y will also

play18:05

impact the uncertainty distribution or

play18:07

the uncertainty landscape so the very

play18:10

thing that we want to actually leverage

play18:13

would have changed and there's no way to

play18:15

re quantify how that change exactly

play18:16

happens before we have the targets so

play18:19

we're going to assume that there is no

play18:21

concept shift this is actually not a

play18:23

very strong assumption for most of the

play18:25

use cases for any use cases that deal

play18:27

with physical world the physics doesn't

play18:29

change so concept shift is very very

play18:31

rare for any customer uh analytics use

play18:35

cases you might see concept shift but

play18:38

it's something that generally will be

play18:40

very well known and obvious from

play18:42

business perspective let's say that you

play18:44

train a model in Germany and then try to

play18:46

deploy it in China of course the

play18:48

consumers there react differently think

play18:51

differently work differently uh so you

play18:53

will obviously expect to see some

play18:55

concept shift but barring these kind of

play18:58

extreme events or extreme changes

play19:00

concept shift rarely happens in practice

play19:02

in a way that's very very influential on

play19:04

short time

play19:05

scales and we also have a way to measure

play19:08

concept shift but that requires Target

play19:10

so this is something that we rely on

play19:12

that it doesn't happen too often and as

play19:14

long that it is the case then we can

play19:16

wait for the targets and actually

play19:17

quantify the impact of concept shift as

play19:19

well but enough about concept shift now

play19:23

let's assume it does not exist for the

play19:25

time being there's no concept shift the

play19:27

actual pattern that the model has

play19:28

captured is still

play19:30

correct what do we take as inputs first

play19:32

thing is we need to take the model

play19:34

predictions whatever our model predicts

play19:36

uh it's a continuous real valued uh

play19:39

number we take that we also will take

play19:42

all the features that the model consumes

play19:44

and we also take the Target also known

play19:47

as the grun roof for the fitting part so

play19:50

only for the reference data set because

play19:52

on the reference data set we're going to

play19:53

fit the dle it's also a machine learning

play19:56

algorithm and we're going to learn how

play19:59

the model uncertain what is the model

play20:01

uncertainty and how it maps to expected

play20:04

model

play20:06

performance then how is performance

play20:09

estimation possible we already talked a

play20:10

bit about the intuition uh from kind of

play20:13

classification uh variant but now let's

play20:16

talk how it exactly works for regression

play20:19

use case so imagine that we have this

play20:22

very simple regression problem when

play20:23

there is just one model input the value

play20:25

X and we're trying to predict h y

play20:29

and what we see that there is basically

play20:31

a perfect linear relationship we capture

play20:33

all information that we have in our data

play20:35

set to create the best estimator

play20:37

possible which happens to be just linear

play20:40

regression because I created that data

play20:42

to follow regression to follow a linear

play20:44

pattern but we also see another very

play20:47

interesting thing is that the data

play20:49

points seem to be more dispersed uh the

play20:52

closer they are to the x equals to zero

play20:56

and if you go higher where X is close to

play20:59

10 or equals 10 then we see that

play21:01

basically the regression is perfect and

play21:04

there's almost no noise in the data and

play21:07

that means that if we pick a point where

play21:10

value of x is close to the old SES one

play21:13

we expect to see uh bigger error so

play21:16

lower model performance whereas if

play21:19

there's a lot of data points that come

play21:21

from a region between let's say 9 and 10

play21:24

we expect to have very good predictions

play21:26

uh so also very good model performance

play21:28

very low errors however we Define our

play21:30

error whether it's absolute error squid

play21:32

error doesn't matter and just to

play21:35

validate that this is in the case what I

play21:37

did here is I calculated the rolling

play21:39

mean where we roll over X's uh for each

play21:43

100 data points and I calculated the

play21:46

expected er sorry not expected actual

play21:48

realized error that we see absolute

play21:50

error of predictions and we see that

play21:53

indeed for lower predictions we observe

play21:57

higher errors on average and as we

play21:59

increase our uh value of x then we see

play22:03

that there the errors get

play22:06

lower and this is really the key

play22:08

intuition is that the dispersion of

play22:11

points kind of corresponds to

play22:13

uncertainty there is a measure that we

play22:15

can actually measure that based on where

play22:18

data point is in the model input space

play22:21

we can find what is the expected error

play22:24

of that data point based on how much

play22:26

information there is about the actual

play22:29

Target uh in that region in our model

play22:34

inputs to visualize it in two Dimensions

play22:37

imagine that now we have two features of

play22:39

course it generalizes to hundreds

play22:41

thousands of features no matter how big

play22:43

our model input space is and we can then

play22:46

quantify this expected error and we'll

play22:49

see how in a second uh and then imagine

play22:52

that on the test data we have data

play22:55

points distributed as such the blue

play22:57

points and as we deploy our model to

play23:00

production we see that unfortunately a

play23:03

lot of data points actually drifted to

play23:05

the red areas and that means we expect

play23:08

to see a drop in model performance for a

play23:11

given chunk of data so we're going to

play23:13

aggregate all those predictions and

play23:15

we're going to get that let's say mean

play23:17

squared error or mean absolute error and

play23:19

that's going to be the metric we're

play23:20

going to be using we're going to be

play23:21

estimating and it actually happens that

play23:24

as long as there's no concept J this is

play23:26

the true expected um mean absolute error

play23:30

or mean squared error that we expect to

play23:32

see once the targets

play23:35

arrive now the algorithm itself uh we

play23:38

have very detailed uh documentation and

play23:41

blogs about how the algorithm Works in

play23:43

detail everything it's like we do open S

play23:46

we're not hiding everything so you can

play23:47

actually see how it works and Di is also

play23:49

fully available in our open source and

play23:51

the cloud product uh but how it works so

play23:54

first thing we're going to do is we're

play23:56

going to calculate the loss metric on

play23:58

the ref reference data again for

play23:59

reference data for example our test data

play24:02

we have the target data so we can just

play24:04

simply calculate the actual loss metric

play24:06

we're using such as

play24:08

Mae we're going to do it for every Point

play24:10

separately so we're going to get all the

play24:12

absolute errors then we're going to

play24:14

train the model to estimate this loss

play24:16

metric absolute errors using the

play24:19

features that the actual monitored model

play24:21

consumes and the monitored model

play24:23

predictions as well as a feature because

play24:25

generally these predictions can also be

play24:27

informative on the error itself we see

play24:30

in practice that they are then we're

play24:32

going to H directly we're going to take

play24:35

that uh regressor let's say it's a

play24:38

gradient boosting model actually we use

play24:39

in our open source and the cloud product

play24:42

we all use the gradient boosting models

play24:44

and then we can use that model the dle

play24:48

to directly estimate the loss on the

play24:50

monitor data so for every data point

play24:53

we're going to estimate what is the

play24:55

expected absolute error there and then

play24:57

once we have all those absolute errors

play24:59

for a given chunk of data let's say Day

play25:01

of data we can aggregate that data per

play25:04

chunk and that is the output that we

play25:06

have the

play25:08

expected um absolute error mean absolute

play25:11

error per

play25:15

chunk how does it look like so here I

play25:18

wanted to show one thing that might not

play25:21

be obvious to some of you which is that

play25:23

just because the Isis ofar shift it does

play25:25

not mean that the performance has

play25:27

dropped what you see on the left side is

play25:30

an overall measure of change in the

play25:34

joint model input distribution the data

play25:36

structure so just a way to measure

play25:38

multivar data drift it's called PCA

play25:41

reconstruction error um it's one of the

play25:44

algorithms that we have for multivariate

play25:46

uh drift detection for multivariate coar

play25:48

Sho detection and the higher it is the

play25:51

stronger there is a change uh between a

play25:54

given chunk and the reference data in

play25:56

terms of uh the data structure so data

play25:59

might be distributed differently looked

play26:01

differently we know that the internal

play26:02

data structure has changed if that value

play26:05

goes up not going to go into details of

play26:07

how it works uh but of course we can

play26:09

share our resources for that later but

play26:12

what I wanted to convey here is that as

play26:15

uh data drifts and we see that there is

play26:17

actually very strong varage shift it

play26:19

does not mean that there is any issue

play26:21

with model performance and even in more

play26:24

interestingly you see that there are two

play26:26

spikes in terms of uh ma here uh but

play26:30

they don't really correspond to anything

play26:31

specific in terms of data drift and you

play26:34

really need to quantify the impact of

play26:37

data drift on performance orari shift on

play26:39

performance you cannot just rely onari

play26:41

shift measures as a proxy for

play26:44

performance because there's really

play26:46

basically no relationship whatsoever and

play26:48

we need to be able to find the

play26:50

performance expected performance as fast

play26:53

as possible and there is a place for

play26:56

drift detection mostly as a kind of r

play26:58

root cause analysis tool where we can

play27:00

try to figure out which features are

play27:02

exactly drifted how they are drifting

play27:04

what is the actual um main reason why

play27:08

our models are failing within the

play27:09

umbrella of coar shift but for trying to

play27:12

estimate whether the performance is good

play27:14

we need to rely on algorithms such as

play27:16

dle to really know what is the impact of

play27:19

coar shift on

play27:22

performance so that's D that's how we uh

play27:26

estimate the impact of cage shift on

play27:28

performance without access to labels in

play27:30

a way that actually gives you expected

play27:32

performance in the metric of performance

play27:34

and that is a much more accurate measure

play27:37

of what is the current level of your

play27:39

predictive performance compared to just

play27:41

doing Simple coverage detection or data

play27:43

drift

play27:48

detection so now the last part is how we

play27:52

can quantify the impact of coar shift on

play27:55

performance for classification mods uh

play27:59

so here we have confidence based

play28:01

performance estimation we're very

play28:02

creative with names as you can

play28:04

see and what it does is basically what

play28:06

it says it

play28:08

does let's talk about the same format as

play28:11

before first thing is what can we

play28:12

calculate so we can calculate first of

play28:14

all the confusion Matrix for both binary

play28:16

and multiclass variations I'm going to

play28:18

be focusing on on binary uh just for the

play28:21

sake of Simplicity here we can also

play28:23

calculate Precision recall accuracy

play28:25

basically any metric that you can get as

play28:28

long as you have the confusion Matrix

play28:30

because we'll actually get the confusion

play28:31

Matrix first and then we're going to be

play28:33

Computing any metric we want you also

play28:35

can compute

play28:37

Matrix that need a range of confusion

play28:41

matrices so basically uh you will change

play28:43

the threshholds you will get different

play28:45

confusion Matrix and then we construct

play28:47

curve such as rock a or rock rock curve

play28:51

we can also get those metrics so a KC

play28:54

average Precision Etc and the last thing

play28:56

we can get similarly actually to uh D is

play29:00

we can get the business impact so for

play29:02

some use cases let's say uh for credit

play29:06

scoring you know what is the actual

play29:09

expected monetary loss if somebody

play29:11

defaults on the loan and you also know

play29:13

how much money you're making and on

play29:15

average per loan that's given if you

play29:18

know that you can estimate not only the

play29:22

machine learning metrics but we can also

play29:24

estimate that business metric which is

play29:26

how much money the model is making per

play29:28

chunk of data per day of operation and

play29:30

this can be very useful for

play29:32

communicating with business stakeholders

play29:34

because then you also increase the

play29:35

visibility of your model and you can say

play29:37

that the model that you deployed to

play29:38

production today made let's say

play29:41

50,000 which is obviously a very useful

play29:44

thing for you just from personal career

play29:46

progression perspective because you can

play29:47

really show the impact of that your work

play29:53

has assumption is exactly the same thing

play29:56

there's no concept shift inputs are very

play29:59

similar to dle but I wanted to mention

play30:01

one specific thing here is that we not

play30:04

only need the modal predictions Z or one

play30:06

for binary prediction use case but we

play30:09

also need the model scores or the model

play30:11

predicted probabilities because we're

play30:12

going to be leveraging those instead of

play30:15

building our model we're going to be

play30:17

leveraging those to estimate the

play30:19

uncertainty of each prediction and then

play30:21

the last thing is that we need the

play30:23

targets for the reference data set just

play30:26

as

play30:26

before now I'm going to skip the

play30:29

intuition part because we already

play30:30

covered the intuition at length both in

play30:33

the section two about how cover can

play30:35

impact performance and in dle and the

play30:37

intuition is exactly the same if the

play30:39

models are more uncertain then we expect

play30:41

to see a drop in model

play30:44

performance what is the kind of recipe

play30:46

the actual algorithm uh that we're using

play30:49

so it has five points the first one is

play30:51

we just take the model predictions and

play30:53

the predicted probabilities then the

play30:55

predicted probabilities need to be Cal

play30:58

calibration means that we're going to

play31:00

take the actual predict probability that

play31:01

your model outputs let's say

play31:03

0.7 and we're going to ensure that this

play31:06

0.7 actually corresponds to the

play31:09

frequency definition of probability

play31:12

which is that for a large number of

play31:16

samples so let's say we get 10 million

play31:18

data points that all out 0.7 we expect

play31:22

70% of them to turn out to be positive

play31:25

this is what means to have 0.7

play31:27

probability according to the frequencies

play31:29

definition and we're going to ensure

play31:32

that the model predictive probabilities

play31:34

actually follow that definition in the

play31:36

aggregate on

play31:37

average and I think also an expectation

play31:40

yeah and then uh once we have those

play31:43

calibrated probabilities we're going to

play31:45

find the expected confusion Matrix for

play31:47

every point this is really the magic

play31:49

point when we're going to turn those

play31:50

predicted probabilities and mod

play31:52

predictions into a confusion Matrix for

play31:55

every data for every data point

play31:58

separately once we have that we can then

play32:01

aggregate that confusion Matrix for a

play32:03

chunk so instead of having just let's

play32:05

say a thousand different instance level

play32:07

confusion matrices we're going to get

play32:09

one big confusion Matrix per chunk and

play32:12

once we have that we have the expected

play32:13

confusion Matrix Matrix for a given

play32:15

Chunk we can just use it to calculate

play32:17

any metric that we care about including

play32:20

the business impact as long as we know

play32:22

what is the impact of each true positive

play32:23

false positive

play32:25

Etc so now we're going to go over each

play32:28

of these points step by step requires a

play32:32

bit more Nuance than the D where we just

play32:34

train the model to predict the errors

play32:36

here we have uh few more points and I'm

play32:39

going to skip Point number two at first

play32:41

we're going to assume that the

play32:42

calibration is already there and then

play32:45

we're going to talk about all the other

play32:48

parts so now for how we actually do it

play32:51

we're just going to start with the

play32:52

confusion Matrix and we're going to try

play32:54

to fill it even though we don't have the

play32:56

targets normally we have the actual

play32:59

column here so we need to know what the

play33:00

actuals are but we're going to for the

play33:02

time being actually for this algorithm

play33:05

we're going to assume we have no targets

play33:06

so we need to do something else there

play33:08

and it turns out we can actually do it

play33:11

which is really the magic behind

play33:13

it so the first point is that we're

play33:16

going to take a look at the model

play33:17

prediction and let's say that the model

play33:19

makes a positive prediction that already

play33:22

gives us some information about the

play33:24

confusion Matrix because if the model

play33:26

predicts positives it means that it's

play33:27

definitely not a negative prediction so

play33:29

we already know that it's not a false

play33:32

negative we know uh that it's not a true

play33:34

negative so in both of those cells we're

play33:36

going to put

play33:38

zero then we also know that the model

play33:42

predicts let's say 0.7 and for the time

play33:45

being I'm going to assume that these

play33:46

probabilities are perfectly calibrated

play33:49

so uh the 0.7 actually does correspond

play33:52

to 70% chance that a uh this prediction

play33:56

will turn out to be positive

play33:58

and now remember that at the end we're

play34:00

going to put those predictions together

play34:02

so we can work with an average so what

play34:05

we're going to do is we're going to say

play34:07

that on average in expectation uh there

play34:09

is

play34:10

0.7 of true positive here and 0.3 of

play34:14

false positive and as we aggregate uh

play34:17

the predictions in the chunk we're going

play34:19

to get to an expected confusion Matrix

play34:21

following the exact method like that so

play34:24

then for the second prediction we do the

play34:26

same thing it's 0.5 four uh it's a

play34:28

negative prediction we put 0.4 in false

play34:32

negative 0.6 in true negative because

play34:35

this is the remainder 1 minus 0.4 is

play34:38

0.6 for the first prediction the same

play34:41

etc etc we should do it for like a few

play34:44

hundred data points more but now let's

play34:46

say that our chunk is only three data

play34:48

points and then once we have that um

play34:51

confusion Matrix we can then just take a

play34:55

metric let's say accuracy and we can

play34:58

estimate it using this expected

play35:00

confusion Matrix so accuracy is just two

play35:02

positives plus two negatives divided by

play35:04

the total number of data points we get

play35:07

we already have all the information that

play35:09

we need from our confusion Matrix we

play35:11

calculate the accuracy we have the

play35:13

expected accuracy that's it that's read

play35:15

the algorithm as long as probabilities

play35:18

are

play35:19

calibrated and which are these numbers

play35:21

here right these are the two numbers

play35:23

that we get from the probability so

play35:25

these numbers for them to be Cor correct

play35:28

the probability needs to also be

play35:30

correct but it's actually

play35:36

not so how do we know whether the

play35:40

probabilities that our model outputs are

play35:42

actually correct uh the way to really

play35:45

inspect that to find out is so-called

play35:48

calibration curve which is exactly what

play35:50

you see here and what does it what does

play35:53

it display what does it show on the

play35:55

x-axis you have the mean pred icted

play35:58

probability of a certain bucket of

play36:00

points how do we create these buckets of

play36:03

points we're going to take a test that

play36:05

the mo we're going to take a data set

play36:08

that the model has not seen before such

play36:10

as the test set so our reference data

play36:12

set and we're going to make predictions

play36:14

on the entire data set then we're going

play36:16

to sort these predictions by predicted

play36:19

probability that the model outputs and

play36:21

then we're going to aggregate these

play36:23

predicted probabilities

play36:24

into buckets um let's say the first

play36:28

bucket goes from 0 0 to 0.1 then the

play36:31

second from 0.1 to 0.3 Etc and then

play36:35

we're going to calculate the mean

play36:36

predicted probability for each of the

play36:38

buckets and then once we have that for

play36:41

each of the buckets we're going to count

play36:43

the fraction of actual positives that we

play36:46

observe in that bucket for that bucket

play36:48

in our test set and if the probabilities

play36:52

are correctly calibrated what we expect

play36:54

to see that if the mean predicted

play36:56

probability is let's say

play36:58

0.3 then and we expect to see 30% of

play37:02

positives for that bucket but this is

play37:04

unfortunately not the case for vast

play37:06

majority of machine learning algorithms

play37:09

I believe for everything that's not

play37:10

logistic aggression so gradient boosting

play37:13

models and any kind of Ensemble or buing

play37:16

models deep neural networks for deep

play37:18

neural networks is really bad actually H

play37:21

we see either this s shaped curve or the

play37:24

inverse s shaped curve inverse s shape

play37:27

curve really only happens for visan Ms

play37:29

as far as I know but don't quote me on

play37:31

that I'm not sure here but the point is

play37:34

that we want a straight line we don't

play37:36

have a straight line so what can we do

play37:38

we can actually just make it into

play37:39

straight line with another regression

play37:41

model that we're going to train on the

play37:44

test set on the reference data how does

play37:47

it work so basically to turn the

play37:49

predicted probabilities into actual

play37:51

probabilities we only need two steps

play37:53

first we're going to take just one

play37:55

feature which is the predicted

play37:56

probability and then we're going to use

play37:59

that to predict what is the probability

play38:02

of Y so the our Target being equal to

play38:05

one given that predictive probability so

play38:07

we're just going to do onedimensional

play38:09

regression we can use any regression

play38:11

algorithm here uh we can use gradient

play38:13

boosting we can use Simple linear models

play38:16

whatever you want and then we can then

play38:20

uh straighten the line that you see here

play38:23

so instead of having an S shap curve

play38:24

you're going to have a straight curve

play38:27

once we that we can then say that on the

play38:30

aggregate on average and also only for

play38:33

the test Set uh which is one of the

play38:36

issues with CBP that we solve with our

play38:38

new better algorith called mcbp which m

play38:41

stands for multic calibrated that's

play38:42

available in the cloud uh but let's

play38:46

assume now that these probabilities are

play38:48

properly calibrated even if there's cage

play38:51

and that means that we can follow the

play38:54

reasoning that I outlined on the slides

play38:56

before to get the expected confusion

play38:58

Matrix for every point then aggregate it

play39:01

per chunk and then compute of the Matrix

play39:03

that's basically how the algorithm works

play39:06

now for the results I wanted to show

play39:09

kind of the opposite case of what you

play39:11

see of what you saw for the regression

play39:13

use case so here what you see is the PCA

play39:16

reconstruction our multi measure of

play39:19

multivariate data drift stays constant

play39:22

stays roughly constant stays within the

play39:24

threshold there is no strong variant

play39:26

shift but if even so we see a

play39:29

significant drop in the accuracy and we

play39:32

see that the estimated and realized

play39:34

accuracy follow each other very very

play39:36

closely so we're good at estimating

play39:39

accuracy here and we see that there is a

play39:41

big deep of roughly 20% compared to the

play39:46

test set so it's definitely business

play39:48

significant even though there is no

play39:50

coverage shift so again if you just

play39:52

monitor cover ship you might be uh the

play39:55

model might actually be experiencing

play39:57

failure there might be significant

play39:58

performance degradation but it's not

play40:00

something you will see just based on the

play40:02

coar shift

play40:04

measures that is really it so now thanks

play40:07

for listening we have uh our

play40:09

documentation that outlined the

play40:11

algorithms in detail both for the cloud

play40:13

and the open source and as for the cloud

play40:15

it's something

Rate This

5.0 / 5 (0 votes)

Etiquetas Relacionadas
Machine LearningModel FailureCovariate ShiftConcept ShiftPerformance PredictionDirect Loss EstimationClassification ModelsRegression ModelsAlgorithmData Drift
¿Necesitas un resumen en inglés?