AWS re:Invent 2020: Detect machine learning (ML) model drift in production

AWS Events
5 Feb 202129:50

Summary

TLDRThis AWS re:Invent session, led by Principal Solutions Architect Sireesha Muppala, delves into detecting machine learning model drift in production using Amazon SageMaker. The session covers the importance of model monitoring, introduces SageMaker's Model Monitoring capability, and outlines the end-to-end deployment and monitoring process. It discusses strategies for addressing model drift, including retraining, and provides a functional notebook demonstrating the code and APIs behind SageMaker's monitoring steps, ensuring models remain accurate and reliable over time.

Takeaways

  • 🌟 Amazon SageMaker is a fully managed service that streamlines the machine learning process, including data collection, model training, deployment, and monitoring.
  • 🔍 Model Monitoring in SageMaker is crucial for detecting model drift in production, ensuring models remain accurate and reliable over time.
  • 🛠️ SageMaker's Model Monitor capability automates the monitoring of machine learning models in production, detecting errors and triggering corrective actions.
  • 📈 Model drift, both in data and performance, can significantly impact prediction quality, making continuous monitoring essential for maintaining model accuracy.
  • 🔄 SageMaker allows for the setting up of alarms in Amazon CloudWatch based on model monitoring metrics, enabling proactive management of model performance.
  • 📊 Data drift and accuracy drift metrics are persisted in S3 buckets and can be visualized in SageMaker Studio, providing insights into model behavior.
  • 🚀 The end-to-end flow for deploying and monitoring models in production includes deploying the trained model, capturing inference requests, baselining, and reacting to drift detection.
  • 📚 The demo in the session showcased how to use SageMaker to host trained models, capture inference data, generate baseline statistics, and monitor for data quality drift.
  • 🔧 SageMaker processing jobs can be used to automate the analysis of captured inference data against baseline constraints, identifying data drift violations.
  • 💡 CloudWatch alerts can be configured based on threshold values for drift metrics, triggering actions such as retraining when model performance degrades.
  • 👨‍🏫 The session emphasized the importance of understanding and managing model drift to ensure business outcomes are not negatively impacted by outdated or degraded models.

Q & A

  • What is the main focus of the re:Invent session presented by Sireesha Muppala?

    -The session focuses on detecting machine learning model drift in production using Amazon SageMaker, discussing the importance of monitoring models, introducing model monitoring capabilities, and demonstrating an end-to-end model deployment and monitoring flow.

  • What is Amazon SageMaker and what does it offer?

    -Amazon SageMaker is a fully managed service that simplifies each step of the machine learning process by providing decoupled modules for data collection, model training and tuning, model deployment, and model monitoring in production.

  • Why is it important to monitor machine learning models in production?

    -Monitoring is crucial because real-world data may differ from training data, leading to model drift or data drift over time. This can degrade model performance and impact business outcomes, making continuous monitoring essential for identifying when to retrain models.

  • What is model drift and how can it affect model performance?

    -Model drift refers to the gradual misalignment of a model with real-world data as it ages, due to changes in data distributions. This can significantly impact prediction quality and model accuracy, necessitating proactive monitoring and corrective actions.

  • How does Amazon SageMaker's Model Monitoring capability help in detecting model drift?

    -Amazon SageMaker's Model Monitoring capability uses Model Monitor to continuously monitor machine learning models, detect errors, and trigger alerts for remedial actions. It analyzes data based on built-in or customer-provided rules to determine rule violations and emits metrics into Amazon CloudWatch for further action.

  • What are the steps involved in the end-to-end model deployment and monitoring flow?

    -The flow starts with deploying the trained model, enabling data capture, capturing real-time inference requests and responses, generating baseline statistics and constraints, creating a data drift monitoring job, and taking corrective actions once drift is detected.

  • How does SageMaker handle data capture for model monitoring?

    -When deploying a SageMaker endpoint, data capture can be enabled to capture request and response data in a specified S3 location. This captured data is used for comparing against baseline data to identify data drift.

  • What is the purpose of generating baseline statistics and constraints in model monitoring?

    -Baseline statistics and constraints are used to establish a reference point for monitoring. They include metadata analysis and thresholds for monitoring purposes, helping to detect deviations in real-world data compared to the training data.

  • How can businesses react to model drift detection?

    -Once model drift is detected, businesses can take corrective actions such as retraining the model, updating training data, or updating the model itself. They can also set up CloudWatch alarms to trigger these actions when certain thresholds are violated.

  • What is the role of SageMaker Studio in visualizing model monitoring results?

    -SageMaker Studio can visualize data drift and accuracy drift metrics that are persisted in S3 buckets. It allows users to chart metrics against baselines for better analysis and understanding of model performance over time.

  • Can you provide an example of how to use SageMaker Model Monitor in a real-world scenario?

    -In the demo, a Jupyter notebook is used to demonstrate hosting trained machine learning models on Amazon SageMaker, capturing inference requests and results, analyzing a training dataset to generate baseline statistics and constraints, and monitoring a live endpoint for violations against these baseline constraints.

Outlines

00:00

🤖 Introduction to Machine Learning Model Drift

Sireesha Muppala, a Principal Solutions Architect at AWS, introduces the session on detecting machine learning model drift in production. She explains the importance of monitoring ML models and provides an overview of Amazon SageMaker, a fully managed service that streamlines the machine learning process. The session will cover model deployment, monitoring, and corrective actions upon detecting model drift. SageMaker's Model Monitoring capability is highlighted for its ability to detect errors and trigger remedial actions without the need for custom tooling.

05:05

🔍 End-to-End Model Deployment and Monitoring

This paragraph outlines the architecture and steps involved in deploying and monitoring machine learning models in production using Amazon SageMaker. It starts with deploying the trained model and enabling data capture, followed by capturing real-time inference requests and responses. Baseline data is established using a baselining job that generates statistics and constraints from training data. A data drift monitoring job is then executed periodically to compare inference requests against the baseline, generating reports and metrics that can trigger alarms and corrective actions.

10:05

📊 Detecting Data and Model Quality Drift

The paragraph discusses the process of detecting data quality drift and model accuracy drift. It explains how to use SageMaker Model Monitor to detect accuracy drift by comparing predictions with ground truth data. The process involves capturing predictions, providing ground truth inference, and executing a merge job to combine these datasets. Model quality monitoring jobs generate statistics, violations, and Cloud Watch metrics, which can be visualized in SageMaker Studio. The paragraph also covers how to take actions based on these metrics, such as retraining the model.

15:06

🎬 Demonstrating Model Monitoring in Action

In this paragraph, a demo is presented using a Jupyter notebook to demonstrate hosting trained machine learning models on Amazon SageMaker, capturing inference requests and results, and using SageMaker Model Monitor to analyze training data sets. The demo covers generating baseline statistics and constraints, monitoring a live endpoint for violations against baseline constraints, and identifying data quality drift. The use case involves an XGBoost-based movie recommendation model, and the process is shown through various API calls and execution results.

20:11

📚 Analyzing Captured Data and Setting Up Monitoring

This paragraph details the steps taken in the demo to analyze captured data and set up continuous monitoring. It includes examining the captured data for baseline violations, automating the analysis process using SageMaker processing jobs, and generating statistics and constraints files. The constraints file suggests threshold values for detecting data drift. The paragraph also covers configuring continuous monitoring and analyzing the results to detect data drift.

25:12

🚀 Reacting to Drift Detection and Model Retraining

The final paragraph of the script covers the steps taken in the demo to react to drift detection and trigger model retraining. It includes creating an 'sns' topic and a Cloud Watch alarm to monitor for drift violations. If a violation is detected, an alarm is triggered, and a message is published to the 'sns' notification topic, which in turn triggers a Lambda function to retrain the model. The demo concludes with instructions to delete resources to avoid unnecessary costs and a reminder to complete the session survey.

Mindmap

Keywords

💡Model Drift

Model drift refers to the gradual decline in the performance of a machine learning model over time due to changes in the data distribution it encounters in the real world. It is a critical issue in the video's context as it can significantly impact the prediction quality, leading to suboptimal business outcomes. The script discusses detecting and addressing model drift through monitoring and taking corrective actions when necessary.

💡Amazon SageMaker

Amazon SageMaker is a fully managed service that simplifies the process of building, training, and deploying machine learning models. It is central to the video's theme as it provides the infrastructure and capabilities needed for monitoring machine learning models in production. The script mentions SageMaker's role in deploying models, setting up endpoints, and monitoring for model drift.

💡Model Monitoring

Model monitoring is the process of continuously tracking the performance of a deployed machine learning model to detect any deviations from expected behavior. In the video, model monitoring is a key capability of Amazon SageMaker that helps identify model drift and performance degradation, enabling proactive measures to maintain model accuracy.

💡Data Capture

Data capture involves recording the input and output data of a machine learning model as it processes real-world requests. This feature in Amazon SageMaker is highlighted in the script as a prerequisite for monitoring models, as it allows for the analysis of inference data against the baseline to detect data drift.

💡Data Distribution

Data distribution refers to the statistical properties of a dataset, including the range, center, and shape of the data points. Changes in data distribution over time can cause model drift, as the script explains. Monitoring the data distribution is essential to ensure the model continues to perform well with new data.

💡Machine Learning Model

A machine learning model is an algorithm that learns patterns from data and makes predictions or decisions based on that learning. The video discusses the importance of monitoring these models in production to ensure they remain accurate and relevant as they encounter new data.

💡Retraining

Retraining is the process of training a machine learning model again, typically with new or updated data, to improve its performance. The script mentions retraining as a corrective action to be taken when model drift is detected, ensuring the model stays aligned with current data trends.

💡CloudWatch Metrics

CloudWatch Metrics are the data points collected and monitored by Amazon CloudWatch, which can be used to trigger alarms and visualize operational data. In the context of the video, these metrics are emitted by SageMaker model monitoring and can be used to set up alerts for model drift.

💡Baseline

A baseline in model monitoring is a set of statistics and constraints derived from the training data that serves as a reference point for comparison against real-world inference data. The script describes using a baseline to detect data drift and to establish what is considered normal behavior for the model.

💡Drift Detection

Drift detection is the process of identifying significant changes in the data or model's performance that deviate from the established baseline. The script details the steps and tools used in Amazon SageMaker to detect both data drift and model quality drift, which are crucial for maintaining model reliability.

💡SageMaker Studio

SageMaker Studio is an integrated development environment (IDE) for machine learning provided by AWS. It is mentioned in the script as a tool for visualizing the metrics and results from model monitoring jobs, helping users to understand and react to model drift.

💡Jupyter Notebook

A Jupyter Notebook is an open-source web application that allows for creating and sharing documents containing live code, equations, visualizations, and narrative text. The script refers to a Jupyter Notebook demonstration that showcases how to use SageMaker's capabilities to monitor and manage model drift.

Highlights

Amazon SageMaker is a fully managed service that simplifies the machine learning process.

Monitoring machine learning models in production is crucial due to potential model drift.

Amazon SageMaker's Model Monitoring capability helps detect model drift without building custom tooling.

Model drift can be caused by changes in real-world data distributions over time.

Continuous monitoring allows for timely retraining of ML models to maintain prediction quality.

Amazon SageMaker provides real-time inference endpoints for deployed models.

Data capture is enabled in SageMaker endpoints to collect inference requests and responses.

Baseline data is essential for comparing against real-time inference data to detect data drift.

SageMaker generates statistics and constraints from training data for monitoring purposes.

Data drift monitoring jobs compare inference requests against baseline stats and constraints.

Violations of monitoring rules trigger metrics emission into Amazon CloudWatch for alerting.

Model quality can be monitored by comparing predictions with ground truth data.

SageMaker Model Monitor supports out-of-the-box metrics for classification and regression.

CloudWatch alerts can be set up based on threshold values for detected drift metrics.

Detected model drift can trigger actions such as retraining the model.

The end-to-end flow for deploying and monitoring ML models in production includes reacting to model drift detection.

A demo in the session shows how to host trained models on SageMaker, capture inference requests, and monitor for violations.

The movie recommendation model demo uses the movie lens dataset and focuses on the 'age' feature for data drift analysis.

SageMaker processing jobs automate the analysis of data capture files for detecting baseline violations.

Continuous monitoring can be configured using SageMaker's monitoring schedules.

CloudWatch alarms and SNS topics can be used to trigger actions like model retraining when drift is detected.

The demo concludes with a walkthrough of the process for data quality drift detection and retraining.

Transcripts

play00:02

Thank you for viewing the re:Invent session on detecting

play00:06

machine learning model drift in production.

play00:08

My name is Sireesha Muppala

play00:10

and I'm a Principal Solutions Architect at AWS.

play00:14

I work with multiple customers across various business

play00:18

verticals on their AI/ML workloads.

play00:22

Today we'll start our session with a quick introduction

play00:26

of Amazon SageMaker and a shorter discussion

play00:29

on why it is important to monitor

play00:31

machine learning models in production.

play00:33

Then I'll briefly introduce Model Monitoring capability

play00:38

of Amazon SageMaker, then we'll look into

play00:41

the end-to-end Model Deployment and Monitoring Flow.

play00:45

We'll follow that up with a few options

play00:49

that you as a customer can take once a model drift has been detected.

play00:54

Finally, we'll wrap it up with a functional notebook

play00:58

that will take a look at the code and the APIs

play01:02

behind these various steps.

play01:06

Amazon SageMaker is a fully managed service that removes

play01:10

the heavy lifting from each step of machine learning process

play01:14

through decouple modules for collecting and preparing data,

play01:18

training and tuning a model, deploying the trained model,

play01:22

and finally, monitoring the models deployed into production.

play01:28

With deploying module, SageMaker provides the ability to host

play01:34

real-time inference endpoints.

play01:37

In this session we'll focus on monitoring

play01:40

the production model endpoints.

play01:43

Machine learning models are typically trained

play01:46

and evaluated using historical data.

play01:50

But the real-world data may not look like the training data,

play01:53

especially as the models age over time

play01:57

and the data distributions change.

play02:00

For example, the inputed data units may change from Fahrenheit to Celsius

play02:06

or maybe, all of a sudden, your application is sending

play02:09

null values at your model,

play02:11

which impacts the model quality quite a bit.

play02:14

Or maybe, in a real-time retail world consumer scenario,

play02:19

the consumer purchases preferences change over time.

play02:24

This gradual misalignment of the model in the real world

play02:27

is known as model drift or data drift and it can have

play02:32

a big impact on prediction quality.

play02:35

Similarly, the model performance may degrade over time as well.

play02:39

Degraded model accuracy over time impacts business outcomes.

play02:45

To proactively address this problem, it is crucial to continuously monitor

play02:51

the model performance.

play02:53

This continuous monitoring allows you to identify the right time

play02:59

and the frequency to retrain your ML model.

play03:03

While retraining too frequently can be too expensive,

play03:07

not training often enough could result

play03:10

in less-than-optimal predictions from your machine learning model.

play03:17

Amazon SageMaker model monitoring capability

play03:21

addresses this exact need.

play03:24

Using Model Monitor, machine learning models are monitored

play03:28

and errors are detected so that you as a customer

play03:32

can take remedial actions.

play03:35

The model monitoring capability eliminates the need to build

play03:39

any kind of tooling to monitor models in production and detect

play03:44

when corrective actions need to be taken.

play03:48

The model monitoring capability analyses the data collected

play03:51

based on built-in rules or customer provided rules

play03:56

at regular frequency to determine if there are any rule violations.

play04:01

The built-in statistical rules can be used to analyze tabular data

play04:06

and detect common issues such as outliers in prediction data.

play04:11

Drift and data distributions can also be detected and so can the changes

play04:17

in prediction accuracy based on observations

play04:20

from the real world.

play04:22

With model monitoring, when these rules are violated,

play04:27

metrics are emitted into Amazon Cloud Watch

play04:30

so that you can set up alarms to audit and retrain models.

play04:36

Data drift and accuracy drift metrics are also persisted

play04:40

into S3 buckets and can be visualized in SageMaker Studio.

play04:47

Using all of these capabilities together, an end-to-end flow

play04:52

for deploying and monitoring models in production looks like this.

play04:56

It starts with deploying the trained model

play04:59

and ends with taking a corrective action once drift is detected.

play05:05

Now, to go along with that end-to-end flow,

play05:08

here's the end-to-end architecture.

play05:11

There are quite a few different components here,

play05:13

so let's get to billing this out step-by-step.

play05:19

The very first step here is to deploy the trained model.

play05:25

We start with ground truth training data

play05:28

and run a training job on SageMaker, which generates a model artifact.

play05:36

A deployed SageMaker endpoint makes the trained model available

play05:41

for model consumers.

play05:43

Now, when you create that endpoint, make sure you enable data capture.

play05:49

Now, with this endpoint deployed, a consuming application

play05:53

can now start sending requests and give back predictions

play05:57

from your model.

play05:59

Since data capture was enabled in our previous step, the request

play06:03

and the responses are captured in the S3 location of your choice.

play06:10

Now that the real-time inference request and the responses

play06:13

are being captured, to identify if there's

play06:16

any kind of data drift, we need to have some baseline data.

play06:23

In the next step, we execute that baselining job that generates

play06:28

the statistics and constraints about the training data.

play06:33

The statistics generated include metadata analysis

play06:37

of the training data.

play06:38

That means metrics are just some mean maximum value, minimum value

play06:43

for duplicate features and metrics like this then counts

play06:47

for string features.

play06:49

On the other hand, the constraints generated captures

play06:53

a threshold for these stats for monitoring purposes.

play06:58

So, the constraints can also include conditions along the lines of:

play07:02

a particular feature should always be considered as a string,

play07:06

not as an integer or, a particular specific field

play07:11

should be a not null field.

play07:14

You can review these constraints generated

play07:17

and choose to modify or even alright them based on

play07:21

your business domain knowledge.

play07:24

So, now that we have both the baseline details

play07:27

and we have captured the inference request,

play07:30

so we can compare the two to identify any kind of drift.

play07:36

In this step you'll create a data drift monitoring job

play07:41

that SageMaker will periodically run on your behalf at the schedule

play07:46

that you select.

play07:48

The job compares the inference requests

play07:51

against the baseline's stats and constraints.

play07:54

For each execution of the monitoring job,

play07:57

the generator results include a violation report,

play08:01

that is persisted once again in Amazon S3,

play08:04

a statistics report of the data that is collected during the run

play08:08

and also summary metrics

play08:11

and stats that are emitted to Amazon Cloud Watch.

play08:16

Out of the box, here are the few violations

play08:20

that are generated.

play08:22

So, we have data type check as the first one.

play08:25

So, this violation is generated if the data types

play08:29

of a particular feature in inference request doesn't match

play08:34

the baseline constraint.

play08:36

Similarly, we have violations for completeness check,

play08:41

missing column check, extra columns check,

play08:43

and categorical values check as well.

play08:47

Ok, so, at this point we're able to detect data quality drift.

play08:53

But what happens if the quality of the model itself changes?

play08:59

Say for example, the accuracy of the model decreases.

play09:02

Now, let's see how to use the Model Monitor capability

play09:06

to detect the accuracy drift.

play09:11

At the core, the process for detecting accuracy drift

play09:14

will look very similar to the one that we just went through

play09:18

for data drift.

play09:19

We collect the predictions made and the ground truth

play09:23

of the prediction and compare the two.

play09:26

First, by merging.

play09:28

Now, we already have the predictions captured

play09:31

because we enabled data capture for our endpoint.

play09:35

You next need to provide ground truth inference

play09:39

that the model consuming application should be providing.

play09:43

So, what does this mean?

play09:44

What does a prediction ground truth mean?

play09:47

That would actually depend on what your model is predicting

play09:51

and what the business use case is.

play09:55

Let's say for example you are monitoring

play09:57

a movie recommendation model.

play10:00

A possible ground truth inference in this case is whether the user

play10:05

has actually watched the recommended movie or not.

play10:08

Or maybe they just clicked on the video but they didn't actually

play10:12

complete watching it.

play10:14

So, there should be some application logic

play10:17

that needs to provide this ground truth inference.

play10:21

With both the predictions captured and ground truth provided

play10:25

by your model consuming application, SageMaker executes a merge job

play10:31

to merge these two data sets together.

play10:34

The merge job once again is a periodic job that is executed

play10:39

on your behalf.

play10:41

Once you have the data merged, it's time to monitor the accuracy.

play10:48

In this step, you create a model monitoring quality job.

play10:54

Excuse me, it should actually be model quality monitoring job.

play10:58

A job that is executed periodically at a schedule that you provide

play11:05

on your behalf by SageMaker.

play11:08

Once again, the model quality job also generates statistics, violations

play11:14

and Cloud Watch metrics.

play11:16

The metrics that are generated by the two monitoring jobs

play11:21

can actually be visualized in SageMaker Studio as well.

play11:25

For the model quality monitoring job,

play11:28

here are some of the metrics that are generated.

play11:30

These include accuracy, affluent values, precision,

play11:34

and recall.

play11:35

SageMaker Model Monitor supports classification and regression metrics

play11:40

out of the box, but you can bring your own metrics as well.

play11:45

You can also choose to chart a particular metric

play11:50

against the baseline for visualization purposes.

play11:55

Ok, so now, at this point, we're able to detect

play11:58

both data quality drift and model quality drift.

play12:02

Now it's time to take actions on that.

play12:08

Both the data drift and the model quality monitoring jobs

play12:13

emit Cloud Watch metrics, as I mentioned before.

play12:17

The data drift monitoring job emits Cloud Watch metrics

play12:21

such as maximum-minimum average values for numerical features

play12:26

along with completeness and drift metrics for both numerical

play12:31

and string features.

play12:32

You can create Cloud Watch alerts for these metrics

play12:36

based on threshold values and, if those thresholds are violated,

play12:40

Cloud Watch alerts will be raised.

play12:43

Once that an alert is generated, you can decide on what actions

play12:48

you want to take on these alerts.

play12:50

Maybe one of the possible actions would be to retrigger training.

play12:56

Similarly, model quality monitoring job also

play12:59

generates Cloud Watch metrics.

play13:01

So, here you can see accuracy, affluent, recall and you'll see stats

play13:07

of maximum values, minimum values, count and average

play13:10

for these various metrics.

play13:13

So, once we have those metrics, you can take actions like updating

play13:17

the model, updating your training data, and retraining

play13:23

and updating the model itself.

play13:25

Now, if you choose to retrain the model, now you're completing

play13:30

that loop, so you go back all the way to the ground truth training data

play13:34

and start training your model one more time.

play13:38

This is the end-to-end flow for deploying the monitored ML models

play13:45

in production.

play13:47

We started with deploying the trained model

play13:49

and ended with reacting to the model drift detection.

play13:54

While this flow showcased both data drift and accuracy drift,

play13:59

you can actually choose to do one or the other.

play14:02

For example, after deploying the model in production,

play14:06

you can choose to monitor the accuracy drift

play14:11

and completely act on it.

play14:13

Similarly, after deploying the model in production,

play14:16

you can monitor and detect, and act on data drift completely

play14:23

bypassing the accuracy drift.

play14:25

In fact, this is exactly what we're going to see in our demo next,

play14:30

with the help of a particular use case.

play14:34

So, let's jump in to the demo.

play14:38

In this demo, I'll walk through a Jupyter notebook that demonstrates

play14:42

how to host trained machine learning models

play14:46

on Amazon SageMaker and capture inference requests and results,

play14:51

how to use SageMaker Model Monitor to analyze a training data set

play14:55

to generate baseline statistics and constraints

play14:58

about the training data.

play15:00

And finally, how to use SageMaker model monitoring capability

play15:05

to monitor a live endpoint for violations

play15:08

against the baseline constraints, to identify the data quality drift

play15:14

and react to it.

play15:16

In the interest of time, the notebook has already been executed.

play15:20

We'll exam the various API calls used and the results of the execution.

play15:25

The use case used in this demo is an XGBoost based

play15:31

movie recommendation model.

play15:33

In Section 1, we deal with a few steps

play15:37

to set everything up.

play15:39

We'll import all the necessary libraries,

play15:42

specify the AWS related region and role variables,

play15:46

as well as define several other variables

play15:49

that we're gonna use throughout the notebook here.

play15:59

And once we have the setup activities out of the way,

play16:02

let's start looking at the training data itself.

play16:07

The movie recommendation model was trained using the movie lens data

play16:12

that is available at this link.

play16:15

In this data set, the target variable is the rating of the movie

play16:21

provided by the user and the features include user ID,

play16:26

item ID, movie genre, age, zip code of the user, user gender,

play16:32

and finally, one hard encoded representation

play16:35

of the user occupation.

play16:39

Throughout the notebook, we'll use the feature age,

play16:42

which is a numerical field, to discuss baseline violations

play16:48

and data drift.

play16:49

The same concept will apply to other features of the data set as well.

play16:55

In Section 2, we'll upload the pre-trained model

play17:00

to the S3 bucket and then we'll create

play17:03

a SageMaker model entity using SageMaker's create model API.

play17:11

At this point, we have a model entity in hand and we need

play17:16

to host that on an endpoint.

play17:18

So, to do that, we'll first specify the data capture configuration,

play17:26

where you specify this EnableCapture to True

play17:31

and you specify whether you want to capture both the input data

play17:36

and output data, or just one of those,

play17:39

and where you want to process the data captured

play17:43

in your S3 locations.

play17:45

So, right here, using this destination S3 URL path

play17:49

you specified and you'll have complete control

play17:52

over that S3 location, which means that this will allow you

play17:56

to version this data and secure it with IM policies

play18:01

and encryption according to your needs.

play18:04

Once you have the data capture configuration set up,

play18:07

you pass that into the endpoint configuration API

play18:13

from SageMaker along with the compute instance type

play18:18

and the compute instance count that is necessary to host the model.

play18:24

So, at the end of this step, you'll have

play18:27

an endpoint configuration ready

play18:30

and, using that endpoint configuration in the next step here,

play18:34

you actually create an endpoint using

play18:37

the create_endpoint API.

play18:39

So, this is basically hosting the model on the compute instance

play18:45

that you have specified with data capture enabled.

play18:50

As you can see, this takes a few minutes to complete.

play18:53

In my experimentation it took about 7 minutes for that to complete.

play18:59

At the end of the 7 minutes, what you have

play19:02

is a real-time inference endpoint that is ready to take

play19:08

inference requests.

play19:09

So, in Section 3 we're going to hit that endpoint with data

play19:16

and start capturing the input data, as well as the results.

play19:21

To invoke the endpoint, as you can expect,

play19:24

we'll use the SageMaker's endpoint: invoke_endpoint API,

play19:30

to which you point in the endpoint's name.

play19:36

Now, using that approach, here we have recommendations

play19:41

for a couple of users that we got back

play19:44

from the deployed endpoint.

play19:46

Now, since the endpoint already has the data capture enabled

play19:50

and we started sending requests to the endpoint,

play19:53

we should start seeing the data capture files in the S3 location.

play20:00

So, when you list out the objects in the S3 location

play20:04

that you previously specified, you'll start seeing this .json files

play20:10

which have that data capture.

play20:13

So, once we have the list, let's actually go look

play20:17

at one of the single file that has been captured.

play20:22

So that's what we're doing in this cell where we are just

play20:26

printing out the S3 object body.

play20:29

So, here you can see that all the data that is captured

play20:33

is in .json line formatted file and for each inference request made

play20:39

against the real-time endpoint, a single line is captured.

play20:44

If you dive into the content of the single .json line,

play20:49

you'll see that we're capturing the endpoint input,

play20:54

the endpoint output, as well as event metadata.

play21:01

And let's go dive a little bit into the data input

play21:06

that is captured in this file.

play21:08

So, here you can see that this is the data that is coming

play21:13

at your endpoint and if you observe one of the features

play21:17

which is age, for our specific purposes,

play21:20

you'll see that the inference traffic sent

play21:23

has age as a float value, 37.0 to be exact.

play21:28

But when we initially examined the training data,

play21:31

this feature was an integer.

play21:34

So, while the deployed model provided prediction

play21:37

even with this kind of inconsistency, it is good to understand

play21:41

how your inference traffic is deviating

play21:45

from the training baseline.

play21:47

But it is tedious and error prone to perform this kind of analysis

play21:52

manually on each line of the data capture file.

play21:57

So, in Section 4 we're going to automate this process

play22:01

using SageMaker processing job.

play22:05

Before we kick-off the processing job,

play22:07

we're going to specify where exactly our baseline data is

play22:12

and where we want our results of the processing job analysis

play22:17

should go in to.

play22:19

So, we specify those values and then we kick-off

play22:23

the processing job in this particular cell here.

play22:28

As you can see, this is also another job that takes a few minutes

play22:32

to complete and, in my particular case here,

play22:37

it took about 6 minutes.

play22:40

And at the end of which, I have two different files:

play22:43

constraints.json and statistics.json.

play22:48

So, let's explore what's in the generated statistics file here.

play22:54

In this file, for numerical features like rating, item ID, and user ID,

play23:02

the processing job calculates stats like sum, mean, standard deviation,

play23:07

minimum and maximum values, along with identifying

play23:11

any kind of missing values.

play23:15

Similarly, for numerical type of features

play23:18

like zip code,

play23:23

we'll see a similar thing, but in addition,

play23:27

we'll see the distinct count for that particular feature.

play23:34

Alright, that's about stats.

play23:36

Now, additional to the stats we also had a constraints file

play23:41

in our results folder.

play23:44

So, let's go look at what's in this constraints.json.

play23:47

So here you can see that, for [INDISCERNIBLE] features

play23:52

like user ID, item ID, movie genre, your particular feature cannot be

play23:58

non-negative, right?

play24:00

And also, you'll see that features like zip code, right here,

play24:04

it needs to be a string value, not a numerical value.

play24:09

And if you actually print out the contents of the constraints file,

play24:14

like I'm doing in this cell here, towards the end

play24:16

of this constraints file, you'll have this monitoring config section

play24:21

that defines the threshold values for various constraints.

play24:26

And if these threshold values are violated,

play24:32

that means we're observing data drift.

play24:36

So that's what the threshold values are used for, to detect drift.

play24:41

Now, this constraint file generated is just a suggestion

play24:44

by SageMaker processing job, you can choose to overwrite it,

play24:48

either at the field level or the file level,

play24:51

based on your business domain knowledge.

play24:56

In Section 5 we're going to take this a little bit forward

play24:59

and then configure a continuous monitoring

play25:02

and analyze the results to actually do the data drift.

play25:07

So, we start by creating the monitoring schedule right here.

play25:11

And once that is scheduled, we'll start generating

play25:14

some inference traffic just in an infinite loop here,

play25:19

we're continuously hitting that endpoint.

play25:22

And we'll use the different APIs like describe_monitoring_schedule,

play25:29

list_monitoring_executions, to look at the status of various executions

play25:36

of this monitoring schedule.

play25:38

And this monitoring schedule is going to be executed periodically,

play25:43

based on the period that you provided when you configured it.

play25:50

Ok, so let's look at the very first execution

play25:55

of the job here.

play25:57

So, here.

play25:59

So, when you look at the status of the very first execution,

play26:02

you can see that the job has been completed but it has been completed

play26:07

with violations.

play26:09

So, we can use the response from the list monitoring

play26:14

execution API calls to find the exact location

play26:20

of that violations report.

play26:24

So, here you can see that the report you alright is

play26:28

right here, so that means that violations report is actually

play26:32

stored in your S3 bucket as well.

play26:35

So, here's another view of constraint_violations.json.

play26:41

Now, if you look at actually what is in the constraints file,

play26:45

you'll see several different violations here.

play26:49

And I'm gonna focus on what's happening

play26:52

with the feature age' here.

play26:54

So, for feature 'age', I got a data type check, which, once again,

play27:00

means that the 'age' feature was expecting a numerical value,

play27:07

an integral value, but it got float value.

play27:10

So, we got a violation at about 100% here.

play27:14

Now we're able to detect any kind of violations

play27:19

against our baseline data.

play27:22

In Section 6, we're going to take it one step forward

play27:26

and use that to detect drift and retrigger training.

play27:34

In Section 6, we'll first create an 'sns' topic

play27:37

and a Cloud Watch alarm.

play27:39

For our Cloud Watch alarm, we use the SageMaker's specific name spaces

play27:45

and SageMaker's specific dimensions here.

play27:50

And you'll see that we also set a threshold for our drift value here,

play27:55

which means that, if there's a violation

play27:58

or drift notice for 'age' feature, then the alarm is triggered,

play28:03

indicating that there's a drift.

play28:06

So, when you execute this code, you should see a Cloud Watch alarm

play28:11

generated in your console right here.

play28:14

In the initial stages it will be an insufficient data stage

play28:18

and, in a few minutes, it will change into the alert stage

play28:22

or the alarm stage.

play28:23

When that alert is triggered, a message gets published

play28:26

in the 'sns' notification topic which triggers a Lambda function,

play28:32

which will actually retrigger the retraining of the model.

play28:38

So, in the SageMaker console, you should see a new training job

play28:43

kicked-off and either this will be in progress or completed,

play28:48

depending on when you check on the status of the job.

play28:52

So, that brings us to the end of the data quality

play28:58

drift detection demo.

play29:00

If you're experimenting with this notebook

play29:02

in your AWS account, we recommend that you delete all the resources

play29:06

using SageMaker APIs that are mentioned

play29:09

in this optional section right here, so that you can avoid

play29:13

any unnecessary cost.

play29:16

The code for the demo you saw is available in the GitHub

play29:20

at this particular location.

play29:22

While the notebook just showed you how to monitor for data drift,

play29:26

you can easily extend it to include accuracy monitoring as well,

play29:32

using the appropriate SageMaker APIs.

play29:36

That brings us to the end of this session.

play29:39

Thank you for taking the time to watch this session

play29:42

and please remember to complete the session survey

play29:45

and leave us your feedback.

play29:47

Thank you.

Rate This

5.0 / 5 (0 votes)

関連タグ
AI MonitoringMachine LearningData DriftModel AccuracyAWS SageMakerReal-time InferenceML DeploymentCloud ComputingPredictive AnalyticsAI Solutions
英語で要約が必要ですか?