AWS DevDays 2020 - Deep Dive on Amazon SageMaker Debugger & Amazon SageMaker Model Monitor

Julien Simon
26 Mar 202044:35

Summary

TLDRThis session delves into Amazon SageMaker's advanced capabilities, focusing on SageMaker Debugger for monitoring model training and Model Monitor for detecting data and prediction quality issues post-deployment. The speaker demonstrates using these tools with a classification model, highlighting features like data capture, real-time analytics, and automated rule-based checks to ensure model reliability and efficiency. Additionally, cost optimization strategies for SageMaker are discussed, including spot instance training and model tuning for efficient resource utilization.

Takeaways

  • 📘 The session focuses on Amazon SageMaker's capabilities, specifically the Debugger and Model Monitor features, which assist in inspecting model training and identifying data quality issues post-deployment.
  • 🛠️ Amazon SageMaker Debugger allows users to save model states, such as tensors representing the model's parameters, gradients, and weights, periodically during training to S3 for later inspection.
  • 🔍 Debugger's rules can be configured to monitor for undesirable conditions during training, such as class imbalance or vanishing gradients, providing real-time feedback and potentially stopping the training job if issues arise.
  • 📊 SageMaker Debugger can visualize feature importance, helping to understand which dataset features contribute most to the model's predictions, enhancing model explainability.
  • 🔄 The session demonstrates using spot instances for training models on SageMaker to save costs, with significant savings shown in the example.
  • 📈 Model Monitor captures data sent to a model in production, including requests and predictions, storing this information in S3 for analysis and monitoring data quality over time.
  • 📉 Model Monitor can detect data quality issues such as missing features, mistyped features, or drifting features, which may degrade prediction quality if not addressed.
  • 📝 The speaker emphasizes the importance of cost optimization in using SageMaker, covering topics like using managed services, spot training, and elastic inference to reduce expenses.
  • 📚 The session mentions various resources for learning more about SageMaker, including documentation, AWS blog posts, YouTube videos, and podcasts, encouraging attendees to explore these for further insights.
  • 🚀 The speaker highlights the productivity improvements offered by SageMaker Debugger and Model Monitor, suggesting they can save users significant time and frustration in model development and monitoring.
  • 🗓️ The session ends with an invitation for attendees to ask questions, indicating the speaker's availability for further discussion and assistance on Twitter.

Q & A

  • What is the main focus of the session on Amazon SageMaker?

    -The session focuses on Amazon SageMaker Debugger and Model Monitor, explaining how they help in inspecting model training and finding data quality and prediction quality issues once models are deployed.

  • Where can the slides and recording of the session be found?

    -The slides can be found in the handout tab on the control panel, and the recording will be sent in a follow-up email after the event.

  • What is the dataset used in the notebook for training the model?

    -The dataset used is a direct marketing dataset, which is a supervised learning problem classifying customers into two classes based on whether they accept an offer or not.

  • What is the purpose of using spot instances in SageMaker?

    -Spot instances are used to save money on training costs. They allow users to specify a maximum training time and a total time, including waiting for spot instances, to control how long they are willing to wait for spot instances.

  • How does SageMaker Debugger save model information during training?

    -SageMaker Debugger saves tensors, which are high-dimensional arrays representing the state of the model, periodically during the training job. This model state is saved to S3.

  • What is the purpose of defining rules in SageMaker Debugger?

    -Rules in SageMaker Debugger are used to check for unwanted conditions during the training job. They can be built-in or custom Python code to inspect tensors and ensure the training job is not suffering from issues like class imbalance, vanishing gradients, or exploding tensors.

  • How can feature importance be visualized using SageMaker Debugger?

    -Feature importance can be visualized by accessing the specific tensor by name, getting all the steps, and then plotting the values for each step using a tool like matplotlib.

  • What is the role of Amazon SageMaker Model Monitor in the session?

    -Amazon SageMaker Model Monitor helps in capturing data sent to the model in production, saving incoming and outgoing data (request and response) to S3, and running analytics to detect data quality and prediction quality issues.

  • How can baseline statistics be generated for Model Monitor?

    -Baseline statistics are generated by uploading the training set to S3 and creating a baseline using SageMaker Processing. This process computes statistics on the training set, such as feature types, ranges, distributions, and constraints.

  • What is the significance of creating a monitoring schedule in Model Monitor?

    -A monitoring schedule in Model Monitor is used to periodically analyze the captured data and compare it with the baseline statistics. It helps in detecting discrepancies and alerting to problems like missing features, mistyped features, or drifting features.

  • What are some cost optimization strategies mentioned in the session?

    -Cost optimization strategies include using managed services like EMR or Glue, stopping and sizing notebook instances appropriately, using local mode, managing spot training, optimizing models with Model Tuning Autopilot, using Elastic Inference, and leveraging inference with custom chips for high-throughput prediction.

Outlines

00:00

📘 Introduction to Amazon SageMaker Debugger and Model Monitor

The speaker introduces the session on Amazon SageMaker, focusing on the Debugger and Model Monitor features. They explain that the Debugger helps to inspect model training processes by saving tensors and model states to S3, which can later be analyzed for issues or metrics plotting. Model Monitor is highlighted for its ability to detect data and prediction quality issues post-deployment. The session promises an in-depth look at these tools using a notebook available on GitHub, with a practical example of building a classification model on a direct marketing dataset. The speaker also mentions the availability of slides and a recording for future reference.

05:02

🔍 Setting Up SageMaker Debugger for Model Training

The speaker details the process of setting up SageMaker Debugger for a training job. They explain how Debugger saves model information such as tensors, parameters, gradients, and weights during training to S3. This feature allows for the inspection of the training job for any undesirable conditions or analysis of metrics. The speaker demonstrates how to configure Debugger by defining tensor collections and setting a save interval. They also discuss the use of built-in rules to monitor for specific conditions like class imbalance during training, using a real-world example of a highly imbalanced dataset.

10:04

🚀 Launching the Training Job with Debugger and Model Monitor

The speaker proceeds to launch a training job with the configured Debugger settings. They describe the process of setting hyperparameters and initiating the training job, while emphasizing the cost-saving benefits of using spot instances for training. The speaker explains how the Debugger runs in parallel with the training job, checking for predefined rules and saving model states to S3. They also mention the ability to stop the training job if a rule is triggered, to prevent wasting resources on a failing job.

15:05

📊 Analyzing Model States and Tensors with SageMaker Debugger

After the training job is completed, the speaker explains how to explore the saved tensors in S3 using the SageMaker Debugger SDK. They demonstrate how to access and plot the history of specific tensors over the training steps, using the area under the curve (AUC) as an example metric. The speaker also shows how to analyze feature importance to understand which features contribute most to the model's predictions, using a plot to visualize the importance of different features.

20:06

🌟 Advanced Use Cases of SageMaker Debugger

The speaker provides an overview of advanced use cases for SageMaker Debugger, such as model pruning, where unnecessary parts of a deep learning model are removed during training to reduce model size without significantly impacting accuracy. They highlight the extensive examples available in the SageMaker GitHub repository, which showcase how to utilize Debugger for deep insights into model behavior and performance.

25:08

🔎 Implementing Data Capture with SageMaker Model Monitor

The speaker shifts focus to SageMaker Model Monitor, starting with its data capture functionality. They explain how to enable data capture when deploying a model, which involves saving incoming and outgoing data to S3 for analysis. The speaker demonstrates capturing data by sending test samples to a real-time endpoint and shows how the captured data can be accessed and analyzed.

30:10

📈 Establishing a Baseline for Model Monitor

To utilize Model Monitor effectively, the speaker describes the process of establishing a baseline using the training dataset. They explain how to upload the training data to S3 and use SageMaker Processing to compute statistics and constraints that define what 'clean' data looks like. This baseline is crucial for later comparison with real-world data to detect discrepancies.

35:13

🚨 Detecting Data Drifts with Model Monitor

The speaker discusses how Model Monitor can detect data drifts by comparing incoming data against the established baseline. They demonstrate setting up a monitoring schedule to periodically analyze the captured data for deviations from the training set. The speaker also simulates data corruption to illustrate how Model Monitor can alert users to such issues, pointing out the importance of this feature in maintaining model accuracy over time.

40:14

🛡️ Cost Optimization Strategies in SageMaker

Towards the end of the session, the speaker touches on cost optimization strategies in SageMaker. They mention various techniques such as using managed services for data preparation, stopping and sizing notebook instances appropriately, leveraging spot training, and considering elastic inference for deployment. The speaker encourages attendees to read a detailed blog post on cost optimization for more insights and to share any additional techniques they might have.

📚 Additional Resources and Closing Remarks

In the final part of the session, the speaker provides a list of resources for further learning, including the SageMaker documentation, AWS blog posts, a Medium blog, a YouTube channel, and a podcast. They invite attendees to follow them on Twitter for more updates and assistance, emphasizing their openness to answering questions and providing guidance on SageMaker and related topics.

Mindmap

Keywords

💡Amazon SageMaker

Amazon SageMaker is a fully managed service that provides developers and data scientists the ability to build, train, and deploy machine learning models quickly. It is central to the video's theme, showcasing its capabilities in model training and monitoring. The script discusses using SageMaker for training models, debugging, and monitoring, highlighting its efficiency and integration with other AWS services.

💡Debugger

In the context of the video, 'Debugger' refers to SageMaker Debugger, a feature that enables developers to inspect and understand the model training process by saving model states and analyzing tensors. It is a key concept as the script delves into how Debugger can help identify issues such as class imbalance or vanishing gradients during training, using rules to monitor for undesirable conditions.

💡Model Monitoring

Model Monitoring is another core concept in the video, referring to the capabilities of SageMaker Model Monitor. It helps in detecting data quality and prediction quality issues post-deployment by capturing and analyzing data sent to the model in production. The script explains how it can be used to ensure the model's predictions remain accurate over time by monitoring for data drift or other anomalies.

💡XGBoost

XGBoost, short for eXtreme Gradient Boosting, is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. In the script, XGBoost is used as an example algorithm for building a classification model on the direct marketing dataset, demonstrating SageMaker's support for popular machine learning frameworks.

💡Direct Marketing Dataset

The Direct Marketing Dataset is a real-world dataset used in the script to illustrate the process of training a classification model. It classifies customers into two categories based on whether they accepted an offer or not. The dataset is used to demonstrate the practical application of SageMaker's capabilities in a supervised learning context.

💡Spot Instances

Spot Instances in the video refer to Amazon EC2's spare computing capacity offered at a discounted price. The script mentions using Spot Instances for training machine learning models in SageMaker to save on costs, highlighting the cost-effectiveness of using such instances when the model training time is flexible.

💡Feature Importance

Feature Importance is a concept discussed in the script that helps in understanding which features in the dataset contribute the most to the predicted outcome. It is illustrated through the use of SageMaker Debugger, showing how it can be used to analyze the impact of different features on the model's predictions, enhancing model explainability.

💡Data Capture

Data Capture is a functionality of SageMaker Model Monitor that is explained in the script. It involves saving the incoming and outgoing data (requests and predictions) sent to a deployed model's endpoint to S3 for later analysis. This feature is crucial for auditing, debugging, and understanding the performance of the model in production.

💡Baseline

In the context of model monitoring, a 'Baseline' is a set of statistical properties derived from the training data that serves as a reference point. The script explains how a baseline is created using SageMaker Processing and then used to compare against real-world data to detect discrepancies, which is essential for identifying data drift or other issues affecting model performance.

💡Elastic Inference

Elastic Inference is a service mentioned in the script that allows for cost-effective inference by sharing a GPU across multiple instances. It is part of the broader discussion on cost optimization in machine learning workflows. The script suggests that Elastic Inference can be a more cost-efficient option for deploying models compared to using full GPU instances.

Highlights

Introduction to Amazon SageMaker Debugger and Model Monitor for inspecting model training and deployed model performance.

Amazon SageMaker Debugger helps in saving model information and inspecting tensors during training.

Demonstration of using the XGBoost algorithm to build a classification model on a direct marketing dataset.

Explanation of basic data preprocessing and uploading datasets to Amazon S3 for training, validation, and testing.

Utilization of an estimator for model training with Amazon SageMaker, including the use of spot instances to save costs.

Configuration of SageMaker Debugger to enable model state saving and rule checking during training.

Use of built-in rules in SageMaker Debugger to monitor for issues like class imbalance during training.

How to access and plot model metrics and feature importance using the SageMaker Debugger SDK.

Example of using SageMaker Debugger for model pruning in deep learning to optimize model size and performance.

Introduction to SageMaker Model Monitor for capturing and analyzing data sent to models in production.

Setting up data capture on a SageMaker endpoint to record incoming and outgoing data for analysis.

Creating a baseline for data using training set statistics to compare against real-world data for discrepancies.

Using a monitoring schedule to periodically check for data quality issues in production model data capture.

Detecting and reporting data violations that differ from the training set baseline with Model Monitor.

Cost optimization techniques for SageMaker, including managed services, spot training, and elastic inference.

Resources for further learning on SageMaker, including documentation, blog posts, YouTube channel, and Twitter.

Transcripts

play00:00

hi everyone welcome to this new session

play00:02

on Amazon sage maker if you have any

play00:05

questions please submit them the

play00:07

questions pane on the control panel and

play00:10

I will answer them at the end a copy of

play00:13

the slides can be found in the handout

play00:15

tab on the control panel and you will

play00:18

get a copy of the recording in a

play00:20

follow-up email after the event in this

play00:24

session I'm going to dive even deeper on

play00:27

sage maker and this is actually 0 slides

play00:32

or almost and we're going to talk about

play00:35

Amazon sage maker bottle debugger and

play00:38

how it helps inspecting what's going on

play00:41

during model training and then we'll

play00:44

look at Amazon sage maker model monitor

play00:47

which helps you find data quality and

play00:53

prediction quality issues once your

play00:56

models have been deployed okay so let's

play01:00

jump straight into the notebook this

play01:03

notebook is available on github and here

play01:07

I'm going to use the XJ boost I'll go to

play01:11

build a classification model on the

play01:13

direct marketing data set so I'm not

play01:16

going to dive too deep on the data set

play01:19

and and the problem we're trying to

play01:21

solve I'm actually going deeper into

play01:23

this in the next session which is more

play01:28

obsessed about performance and accuracy

play01:31

here we want to inspect and and and

play01:35

monitor so it's not so much about

play01:37

getting great accuracy so in a nutshell

play01:40

this is a direct marketing dataset it's

play01:43

a supervised learning problem

play01:44

classifying customers into two classes

play01:48

customers who accept an offer customers

play01:50

who don't okay so kind of a yes or no

play01:54

problem so the first step is to download

play01:58

the data set extract it and we can take

play02:02

a look with pandas right so it's a CSV

play02:06

file a bunch of features and a why

play02:09

column saying yes or no

play02:11

did the customer accept the offer okay

play02:15

then I'm doing some some basic

play02:17

pre-processing but again I'm not going

play02:20

to to look at this now because I'm not

play02:23

concerned with this at the moment so

play02:27

basic processing then complete the data

play02:30

set and and upload everything to s3

play02:35

okay so I have a training set I have a

play02:38

validation set and I have a test set

play02:41

okay and I have the three locations for

play02:45

those three data sets okay so now we

play02:50

want to train a model okay so if you

play02:52

work with sage maker before you know

play02:54

this means using an estimator okay

play02:59

I'm using a built in I'll go here so I'm

play03:01

using this estimator object which is the

play03:04

the generic I'll go for for built in

play03:08

sorry the generic so I'm using the

play03:10

estimator object which is the generic

play03:13

object for built-in algos first

play03:16

I grab the name of the container image

play03:20

for XJ boost in the region I'm running

play03:22

in and I am configuring the estimator

play03:26

and yes it is quite a mouthful because I

play03:31

try to fit as much as I could in there

play03:33

so we're going to take it step by step

play03:35

let's look at the bits that we probably

play03:38

already know okay so the bits we all

play03:41

probably already know are these right we

play03:45

need to pass the name of the container

play03:48

so basically selecting the algo we want

play03:51

to we want to use we pass an IM role to

play03:55

give stage maker permissions to access

play03:57

s3 pool docker containers etc etc the

play04:02

session which is a technical object then

play04:04

we use file mode to say want to copy the

play04:08

data set to the instance before training

play04:10

we define the location the output

play04:14

location for the model and we define how

play04:18

much infrastructure we want okay so here

play04:20

we're training on one ml and for 2xl

play04:23

instance okay

play04:25

so if you work with say drinker before

play04:27

and even if you haven't I guess you know

play04:30

these are very simple very reasonable

play04:32

okay

play04:33

the next bit is about using spot

play04:36

instances

play04:37

okay so spot instances are a great way

play04:39

to save money they've been available on

play04:41

ec2 for a long time they are now

play04:44

available on Sage maker so just say hey

play04:46

I want to use spot instances for

play04:49

training my max training time is this

play04:52

and my total time so training time plus

play04:55

waiting for spot is this okay

play04:59

so that's how you control how much time

play05:01

you're ready to wait for spot instances

play05:03

if they're in really high demand okay

play05:07

fine so let's look at the next bits okay

play05:10

so the next bits are actually new stuff

play05:14

let's look at this one first okay so

play05:18

this is the debugger configuration okay

play05:22

and as you can imagine this is an object

play05:25

coming from the sage maker debugger is

play05:29

DK okay and this is how we're going to

play05:33

enable sage maker debugger for this

play05:38

training job okay so let me explain what

play05:41

sage maker debugger does so what it does

play05:44

is as your job is running it's going to

play05:49

save model information so tensors okay

play05:54

tensors are high dimensional arrays that

play05:58

represent the state of the model

play06:01

parameters gradients weights if you use

play06:07

deep learning etc and that model state

play06:12

is going to be saved

play06:14

periodically during the training job and

play06:17

of course it's going to be saved to s3

play06:19

as you can see here okay so the high

play06:22

level idea is we save that stuff

play06:26

periodically to s3 okay and later on

play06:30

we're going to be able to look at it

play06:32

okay and understand what happened in

play06:35

that training job

play06:37

potentially looking for bizarre

play06:39

conditions or just you know plotting

play06:42

metrics and whatnot okay so that's a

play06:45

super easy way to save model state

play06:48

periodically okay and that's what we see

play06:51

here so we define collections for the

play06:56

different tensors so we have metrics and

play06:59

of course these are predefined A's

play07:01

we have feature importance that's a

play07:04

really nice one as we'll see telling us

play07:07

which features in the dataset

play07:09

contribute the most to the predicted

play07:12

outcome by XJ boost and then we pass the

play07:16

save interval okay so do we want to save

play07:19

at every step or every five steps or ten

play07:23

steps so here I'm saving all steps

play07:24

literally saving everything okay so

play07:28

that's what we're doing here

play07:31

and we'll see how this actually happens

play07:33

but this is how you configure it this is

play07:35

how you configure the saving part of the

play07:38

program okay but it's not just about

play07:42

saving okay it would be nice already

play07:45

free if we were able to to look at that

play07:48

model state after training is complete

play07:51

but we can actually define rules okay so

play07:55

we can actually ask sage maker debugger

play08:00

to check for unwanted conditions during

play08:04

the training job so we have a list of

play08:07

built-in rules that you'll find in

play08:10

documentation and you can add your own

play08:12

okay so you could write your own Python

play08:15

code to inspect your tensors and check

play08:19

for you know weird stuff happening there

play08:21

so here I'm using a rule a built in

play08:26

recalled class imbalance because as it

play08:29

turns out this is a very imbalanced data

play08:32

set about eight to one and well building

play08:36

classifiers for imbalance states that is

play08:39

more difficult okay so I want to make

play08:42

sure that these training jobs not

play08:44

suffering from that imbalance rule

play08:48

and I could further configure this but

play08:51

you'll find this information in the in

play08:55

the doc okay so here I'm just saying hey

play08:56

keep your keep an eye out for class

play09:00

imbalance weirdness

play09:01

during the training job and and use that

play09:05

stuff to look at metrics basically okay

play09:08

so that's what sage maker debugger does

play09:11

okay one it will save define tensor

play09:18

collections to s3 so that you can

play09:20

inspect them okay to it configures rules

play09:26

that are checked during training okay to

play09:31

make sure your training jobs not

play09:33

suffering for from something undesirable

play09:37

and and just weird

play09:39

okay so just on the top of my mind you

play09:43

can check for you know lost not

play09:45

decreasing and vanishing gradients and

play09:49

exploding tensors and a whole bunch of

play09:53

things yes they all have very funny

play09:55

names okay so please take a look at the

play09:58

doc for more alright so this is my

play10:01

estimator okay

play10:03

the usual part and more stuff for

play10:07

debugger okay next I'm just setting some

play10:13

hyper parameters okay and as you can see

play10:16

I'm actually very reasonable here for

play10:19

once I am NOT setting any crazy ones

play10:21

because once again I am not really

play10:24

trying to get to a high performance

play10:26

model here I'm just trying to show those

play10:30

those new capabilities if you're curious

play10:32

about optimizing hyper parameters etc

play10:37

that's the next session okay so don't

play10:39

miss it okay and then I co fit and my

play10:43

training job starts okay let's take a

play10:47

look at the log so we see the usual

play10:50

stuff start the training job launching

play10:53

instances etc and we see new stuff as

play10:56

well okay we see debugger rule status

play11:01

class in

play11:01

in progress so and then we see pretty

play11:05

much though the training job as usual so

play11:08

what's this bit so what this means is

play11:11

based on the configuration above okay

play11:14

here we can figure one rule well we see

play11:19

sage record firing up in parallel of the

play11:23

training job another job for that

play11:27

debugging rule okay and we we have one

play11:29

job Peru so if we'd configured let's say

play11:33

five debugging rules then we would see

play11:35

five debugging jobs okay and as you can

play11:38

imagine what these jobs do is they look

play11:42

in real-time as they become available

play11:44

they look at the tensors that are saved

play11:47

in s3 and they check for whatever

play11:51

condition they've been set up for okay

play11:53

so there's code looking at tensors and

play11:55

applying that logic trying to figure out

play11:58

yes or no is class imbalance a problem

play12:01

or it's lost not decreasing etcetera

play12:04

etcetera okay so that's a separate job

play12:07

running in parallel okay alright so we

play12:13

see our job we see very nice savings

play12:18

from spot okay so 66.7% that's very nice

play12:24

we saved two thirds of the on the

play12:28

training cost so make sure you know how

play12:30

to use spot instances and decrease your

play12:32

say check your bill okay we'll talk

play12:35

about cost optimization some more at the

play12:37

end as promised in the session

play12:41

description but as you can see spot is

play12:45

already a very nice way to save money

play12:47

okay I can check the condition of that

play12:52

debugging job

play12:53

as the training job is going okay so

play12:57

here when I check was in progress now

play12:58

for sure it's done and if something

play13:04

happens which I don't think happen here

play13:06

if the rule is actually triggered then

play13:10

the debugging job stops and your

play13:13

training job stops as well be

play13:15

there's really no point in continuing

play13:17

training if something is not going well

play13:20

it's just a waste of time and money so

play13:23

if a rule is triggered then the

play13:26

debugging job will let you know

play13:27

something went wrong and and the

play13:29

training job the corresponding training

play13:31

job will be stopped

play13:32

okay so no need to if you have vanishing

play13:36

gradients for deep learning job that

play13:38

train for seven days well you know you

play13:42

might as well save that time and money

play13:43

okay so once the job is over so whether

play13:48

it successfully completed or not you can

play13:53

go and explore those stances in s3 okay

play13:56

and you need to do that using that SDK

play14:00

for SM stage maker debuggers debugger

play14:05

basically find the path for the artifact

play14:10

so basically where did we save the

play14:12

tensors and then create a trial from

play14:15

from that okay and once you have that

play14:19

then you can start exploring your data

play14:25

okay so there's not enough time to cover

play14:27

the the full api of the the trials SDK

play14:32

but again all the documentation is

play14:34

online but basically you can see you can

play14:39

access a specific tensor by a name and

play14:42

you can get all the steps okay remember

play14:45

we save data periodically okay so you

play14:48

can get the tensor values for every step

play14:51

and and then you get here we're building

play14:54

as you can see we're building a Python

play14:56

list and returning a list of steps and a

play15:00

list of values okay so we we have

play15:03

basically only history for that tensor

play15:05

over time okay and we can plot it using

play15:08

matplotlib so here's an example okay

play15:13

where we plot for that trial we plot our

play15:17

metrics so we can see in that metrics

play15:19

collection we actually had two

play15:22

individual values we had the area under

play15:26

curve which was the

play15:28

the metric configured for that training

play15:30

job and we have that for the training

play15:32

set and we have that for the validation

play15:34

set okay so we can see all the values

play15:36

over time and and that's already very

play15:41

useful right pretty easy to do that okay

play15:44

and I guess we could have continued

play15:47

training for a bit more or not but

play15:49

doesn't matter at this point okay so you

play15:53

can easily access just like that okay

play15:55

basically those those two three lines

play15:57

okay access tensors by name and get all

play16:02

their steps and get all the values for

play16:05

the steps okay again full history over

play16:08

those model model values and model

play16:13

states okay so that's pretty cool you

play16:16

don't need to write any bespoke code to

play16:18

do that

play16:20

remember we saved another collection we

play16:23

saved feature importance okay so feature

play16:27

importance once again will tell us which

play16:32

feature contributes most to the

play16:36

prediction okay and our our data set

play16:40

here as sixty plus features because we

play16:45

use a one heart and coding on

play16:47

categorical variables etc so it's

play16:49

actually much more than the 20 columns

play16:51

we have originally and and we can see

play16:55

that okay so if we plot that then we see

play16:59

that feature okay the orange one I

play17:02

believe is this one f1 ok and this one

play17:08

is f5 okay so we see that feature 1 and

play17:15

peach 5 are the important ones okay

play17:20

which is good because that's exactly

play17:21

what the comment says here and so future

play17:25

one is actually the job it's the number

play17:27

of the column right in the in the CSV

play17:30

file so there's no well there's no magic

play17:33

on those numbers so does that person

play17:36

have a job and

play17:38

how is it housed is it you know is it

play17:43

renting or is that person renting or is

play17:46

that person owning and these are I guess

play17:49

important factors in you know how much

play17:52

money you can generally spend so so not

play17:56

surprising to see them as highly

play17:58

contributing to whether you would accept

play18:00

a marketing offer or not okay so this is

play18:04

really really cool because you certainly

play18:08

heard about model explain ability well

play18:11

this is it right so if you train the

play18:14

classification model and you see that

play18:17

okay job and housing are important

play18:19

factors then you can compare that to the

play18:23

to the legacy solution that you have

play18:25

maybe it's an IT application or maybe

play18:28

it's just humans looking at forms and

play18:31

deciding yes or no should they get you

play18:34

know should they get that offer and you

play18:38

know it helps you understand what's

play18:40

going on inside the model okay and this

play18:42

is super easy just save the chancers and

play18:45

and just plot that stuff okay again

play18:49

there are some additional details here

play18:51

on future importance and it's all in the

play18:55

doc but you can see just copy paste that

play18:57

code it will work out of the box with

play19:00

with your own model okay so there you

play19:03

there you are that's a first example of

play19:09

sage maker debugger okay so you can do

play19:14

much more I'm just gonna give you a

play19:16

taste of what's possible here and I'm

play19:19

gonna jump to the Amazon sage maker

play19:23

examples repository on github which is

play19:27

amazing and it has hundreds and hundreds

play19:30

of notebooks showing you how to use sage

play19:33

maker in all kinds of ways and there's a

play19:36

specific directory here for sage maker

play19:40

debugger and well I have to say these

play19:43

are even more amazing honestly I'm still

play19:45

going through some of those but here you

play19:48

see you know how deep

play19:51

down you can go and literally rip the

play19:54

guts of your model and and there are

play19:59

some really amazing examples on deep

play20:02

learning right so if you want to look at

play20:06

one of them I guess we can maybe take a

play20:08

look at this one really quickly so this

play20:12

one shows you how to do model pruning so

play20:15

model pruning is an advanced technique

play20:17

where during training you look at deep

play20:22

learning connections right so neuron

play20:27

connections and you figure out how much

play20:30

they contribute to the output so it's

play20:32

kind of similar to that feature

play20:34

importance example that we saw except

play20:37

here we go one step further and we say

play20:40

hey if a certain filter if a certain a

play20:46

convolution operation does not

play20:48

contribute to to the outcome then we

play20:51

remove it okay so this is an amazing

play20:53

notebook and I'll just go all the way to

play20:57

the end okay and this this animation

play21:02

here shows you over time over ten

play21:06

iterations that you know we keep

play21:09

removing parameters in that deep

play21:12

learning model okay it's a PI torch

play21:14

model and of course we shrink the model

play21:18

accordingly right we go from 200 plus

play21:22

Meg's to you about 70 Meg's and we

play21:25

hardly lose any accuracy so we shrink

play21:28

the model by a factor of three and we

play21:31

hardly lose any accuracy because we just

play21:33

drop parts of the of the model that just

play21:37

do nothing for us okay so this is a very

play21:41

very advanced example and if you're into

play21:45

deep learning and computer vision I mean

play21:47

yeah you're gonna love this one but not

play21:51

all of them are as hardcore at ease we

play21:55

have some extra boost examples as well

play21:58

we have some you know slightly slightly

play22:02

easier ones but again if you want to

play22:05

really really dive deep on on sage maker

play22:08

debugger just go check those notebooks

play22:10

spend time to read the code and run them

play22:13

and and again you'll be able to tear

play22:16

your model apart sand and understand

play22:19

exactly what's going on okay so really

play22:22

really cool stuff here okay

play22:26

so that's that's it for stage maker

play22:30

debugger we could spend you know hours

play22:32

on this but hey I'm I'm limited for time

play22:36

ask me your questions happy to answer

play22:38

questions after after the session okay

play22:42

so we saw how to save model state

play22:44

inspect expected etc now let's talk

play22:48

about another capability which is sage

play22:51

American model monitor okay so sage make

play22:54

your model monitor will help us do two

play22:56

things first it's gonna help us save

play23:00

data that is sent to our model in

play23:04

production okay so let's say we deploy

play23:07

the model on a real-time endpoint and

play23:09

we're able to capture data sent to that

play23:13

endpoint and we're able to capture

play23:16

predictions okay so incoming and

play23:19

outgoing data request response if you

play23:22

will save everything to is three and

play23:26

look at it run analytics etc and we're

play23:30

gonna be able to do much more but I'm

play23:33

gonna keep some some tension here let's

play23:36

just look at capturing data first so

play23:39

going back to the model that I trained

play23:42

I'm going to call deploy on that

play23:44

estimator as usual okay

play23:46

give an endpoint name give some

play23:49

infrastructure requirements and AHA

play23:53

there's new stuff again we pass the data

play23:56

capture config coming from here okay

play24:00

from the model monitor SDK and what do

play24:04

we say here we say hey please capture

play24:06

okay

play24:07

yes that kind of makes sense please

play24:11

capture everything okay 100 percent of

play24:14

data right you could sample down if you

play24:17

want it

play24:18

by default we are going to capture

play24:21

incoming and outgoing data so there are

play24:23

samples and predictions and we're going

play24:26

to put all that stuff in s3 here okay

play24:30

very simple object okay so as you can

play24:35

see here we are enable data capture at

play24:38

deployment time but if you have an

play24:41

existing endpoint you can do the same

play24:44

okay all you have to do is you create a

play24:47

new endpoint configuration with a data

play24:50

capture config object and you update the

play24:54

endpoint configuration with that new

play24:56

with that you endpoint okay so that's so

play24:59

that's all it takes right all right

play25:02

after a few minutes this is live and

play25:05

it's an endpoint so we can send it some

play25:08

data right so just loading some test

play25:11

samples using the invoke endpoint API

play25:15

from boto 3 sending data getting

play25:19

predictions back okay now this is just

play25:21

an excuse to to log some stuff of course

play25:26

ok and if I look in my capture path I

play25:32

can see files right so I can see our

play25:34

file here ok

play25:37

adjacent lines file containing incoming

play25:40

and outgoing data and I can copy it to

play25:44

my local notebook instance and I can

play25:47

take a look at it ok and well what do we

play25:52

see exactly what we thought we'd see I

play25:54

suppose we see input data which is CSV

play26:00

data right and then I see my my data

play26:03

point here all the way through here ok

play26:08

my features and then I see the output ok

play26:13

and the output is basically the 0 to 1

play26:18

probability for that sample ok remember

play26:21

it's a binary classification model so we

play26:23

get a probability between 0 and 1 ok and

play26:26

we see how a whole bunch of that ok

play26:30

so again this is already very useful

play26:32

because you know if you want to to

play26:37

monitor our data if you want to capture

play26:40

data and replay it right if you want to

play26:43

do back testing you could say well okay

play26:45

let's capture real real life data and we

play26:48

can replay that stuff in a damn or test

play26:52

environment you know no code to write

play26:55

the only thing that we've done was that

play26:57

data capture config object on on the

play27:02

endpoint right so this is already pretty

play27:04

nice okay and then we do bat prediction

play27:08

because why not okay and that's it okay

play27:13

so this is the capture part right but

play27:17

model monitor actually goes a little

play27:21

further than this okay and once again

play27:23

let me show you more examples so in that

play27:28

same repo okay say to make sure examples

play27:32

you have a directory for model monitor

play27:35

and you have some examples here and I'm

play27:38

gonna show you a little bit more because

play27:41

I have a little bit more time okay so in

play27:46

this notebook we're actually using a

play27:48

different data set but again it doesn't

play27:50

really matter we can just focus on on

play27:53

the model monitor part so what we do

play27:56

here is actually we take an existing

play27:59

model okay this is a turn partition

play28:01

model so probably again a binary

play28:04

classifier a model that has already been

play28:07

trained okay and we import it we deploy

play28:12

it on sage maker we set up data capture

play28:16

exactly the same way capture everything

play28:20

we deploy it right and this is a good

play28:24

example of the modularity of sage maker

play28:26

see we're just taking a model that you

play28:29

could have trained on on another machine

play28:31

on your laptop baby and you can very

play28:33

easily deploy it on sage maker okay then

play28:37

we send it some data okay just like

play28:40

we've done in the previous example we

play28:43

see capture file

play28:44

we can see what's inside those files

play28:47

okay so Jason lines format

play28:52

exactly the same all right so this is

play28:54

really what I've done in my previous

play28:56

example but again we can go further we

play28:59

could say okay so we have data capture

play29:02

we have that stuff ready to ready to run

play29:07

and actually already running so now we

play29:10

can say well it'd be nice if we could

play29:13

compare incoming data

play29:16

okay real life data sent to my endpoint

play29:21

to the data that I use to train the

play29:25

model okay and well you can absolutely

play29:28

do that so the first step is to generate

play29:31

a baseline okay so generating a baseline

play29:34

means you're going to compute some

play29:37

statistics using the training set okay

play29:41

so here we upload the training set to s3

play29:44

and we create a baseline okay so we

play29:51

launched a specific job that will load

play29:56

the training set as you can see here and

play30:00

it's going to compute all kind of

play30:02

statistics on it it's going to figure

play30:04

out feature types feature ranges feature

play30:10

distributions etc etc okay if you're a

play30:13

data scientist you certainly do that

play30:16

manually already okay but here you can

play30:18

automate that okay so we can see that

play30:21

job running here and by the way this is

play30:24

based on another sage make your

play30:26

capability go sage maker processing that

play30:29

makes it easy to run scikit-learn or

play30:32

SPARC processing jobs on data and you

play30:36

can use it in many different ways

play30:37

pre-processing data or computing stats

play30:40

or you know running batches of a batch

play30:45

processing on your data pretty much okay

play30:48

that's a service in itself but hey it's

play30:51

integrated here okay so just compute

play30:55

that baseline and it runs for a bit okay

play30:58

let's not look at that okay and once the

play31:02

job is over we can see some results so

play31:08

we can see statistics on the data and

play31:11

constraints okay and basically what that

play31:15

means if we look at that data here we

play31:22

can see for each feature what type it is

play31:27

okay

play31:28

is it an inner girl is it is it a float

play31:32

or is it something else

play31:35

we can see if we have missing values for

play31:39

that feature okay so apparently not okay

play31:43

we have all features are present in our

play31:45

examples we can see stats okay so we see

play31:50

distributions using kll if you do that

play31:56

stuff for a living you know what I'm

play31:57

talking about if not I don't worry about

play32:00

it it's just a very fast way to compute

play32:02

distributions and there's a whole lot of

play32:07

stuff here

play32:07

right so if you're into statute of love

play32:09

this a mean standard deviation etc etc

play32:13

okay all that stuff is just automated

play32:17

away okay

play32:21

and now that we know what clean data

play32:26

looks like okay hopefully the data set

play32:28

is a clean one of course we can compare

play32:33

incoming data to that okay and the way

play32:40

we're gonna do this is we're going to

play32:42

create a monitoring schedule which is

play32:46

going to look at captured data remember

play32:49

okay we're capturing incoming data and

play32:52

it's going to look periodically at that

play32:54

data and it's going to run those same

play32:58

statistics okay and constraints on that

play33:02

incoming data and it's going to look for

play33:04

our discrepancies okay it's going to

play33:06

look for differences so if everything is

play33:10

fine then okay

play33:12

things are not fine that's gonna tell us

play33:13

and so this is going to alert us to

play33:16

problems like missing features mistyped

play33:22

features are drifting features which are

play33:26

even worse where the distribution of a

play33:28

feature is now different because

play33:31

whatever you know because the real world

play33:34

is ever changing and I think we have

play33:36

good proof at the moment so maybe the

play33:40

hypothesis that were true on your data

play33:43

set a month ago are not true anymore

play33:46

and of course this would mess with your

play33:50

predictions very badly because all those

play33:53

machine learning algorithms use

play33:55

statistics and distributions so if those

play33:59

hypotheses are you know shifting then

play34:03

predictions will shift and and the

play34:06

quality of your predictions are going to

play34:08

degrade over time okay and this is a

play34:11

very nasty problem and it's very

play34:13

difficult to track yes maybe you see

play34:16

your business KPI going down because

play34:18

your predictions are not so relevant but

play34:20

why you know why is that KPI going down

play34:23

well okay this could be one of the

play34:25

reasons okay and of course you could

play34:28

just be bugs right maybe something in

play34:31

your ETL workflow is broken and and all

play34:36

of the sudden you know data is not it's

play34:39

not what it should be or maybe a web app

play34:41

upstream is just you know it's just

play34:45

buggy and and dropping features or

play34:47

adding extra crappy features whatever

play34:50

you know it's software anything can

play34:51

happen and and all of it would impact

play34:54

your models so that's that's not very

play34:56

good okay so that monitoring schedule is

play35:00

what is going to fix that for you okay

play35:04

all right so next we're going to start

play35:09

generating traffic and and of course we

play35:13

break it on purpose okay and we break it

play35:17

on purpose

play35:18

because we're applying buggy

play35:22

pre-processing to to that day

play35:25

okay so so this is buggy code that

play35:28

arbitrarily and randomly breaks incoming

play35:32

data okay so take a look at that and so

play35:35

that traffic is gonna be it's gonna be

play35:38

bad that traffic right it's gonna be

play35:40

garbage and so after a while you know

play35:44

once or monitoring schedule kicks off

play35:48

okay and of course here I think we have

play35:51

a already schedule but you can you can

play35:54

configure that so after an hour it's

play35:56

gonna it's gonna fire up and and of

play36:00

course it's going to crunch the data

play36:03

that we captured and again remember this

play36:07

is bad data because we literally broke

play36:10

it for testing purposes and we then see

play36:14

that oh that Marlin monitoring schedule

play36:17

did run but and it completed but it

play36:22

detected violations so violations are

play36:24

basically data that doesn't look like

play36:29

the training set okay which is what we

play36:32

want here we broke it okay so we can go

play36:35

and grab the reports for those

play36:39

monitoring schedules and there is a

play36:42

violation report which we can visualize

play36:45

and here well what did we do well I

play36:52

guess we broke yeah so we expected

play36:56

integers and maybe we did pass strings

play36:59

or something so I think we you know we

play37:01

messed up we messed up a number of

play37:04

features yeah in the processing script

play37:09

and these are picked up okay

play37:11

so again this is just one of the

play37:14

violation here we just messed with the

play37:17

datatype but if you had different

play37:20

statistical properties those would be

play37:22

highlighted as well okay so this is a

play37:24

really really cool capability if you ask

play37:28

me because it it just runs in the

play37:30

background you know and it will catch it

play37:32

will catch that stuff and and then you

play37:35

can go and look at those and try to

play37:37

understand okay

play37:38

is that my ETL chain you know messing

play37:43

with my data or did a feature disappear

play37:46

from my data set because maybe you know

play37:50

maybe my web app is not logging it any

play37:52

longer you know it basically points you

play37:55

at the the problem and then you can go

play37:59

and investigate more but at least you

play38:01

know what to look for you know what was

play38:04

wrong in that sample that you received

play38:06

okay and then you can you know you can

play38:10

start and start your schedules and you

play38:12

can delete them if you want just so you

play38:15

know you can't delete an endpoint you

play38:18

feel as an active monitoring schedule so

play38:20

you need to make sure if you get that

play38:22

error you need to delete the monitoring

play38:25

schedule first and then you'll be able

play38:27

to delete the endpoint okay all right

play38:31

well I think that's it for for model

play38:33

monitor and again we have more more

play38:35

notebooks here and including

play38:38

visualization etc etc so both these

play38:43

debugger and model monitor our notebooks

play38:47

are really really awesome so spend some

play38:50

time you know read documentation first

play38:52

go through the basic examples and then

play38:54

you can dive deep into that and and set

play38:57

this up and both capabilities really

play38:59

will save you so many hours of

play39:03

frustration trying to understand why

play39:06

it's your training job not going right

play39:08

and why is my model not predicting right

play39:10

and these are great great productivity

play39:14

improvements and and we get a really

play39:17

good feedback from customers so well

play39:19

please try them out and let us know okay

play39:22

just just a few more things I promised I

play39:26

would talk about cost optimization for a

play39:28

second so the reason why I'm including

play39:33

this in these sections because usually I

play39:37

see a lot of customers who you know

play39:39

first they try to get a hang of sage

play39:41

maker and then they get protective and

play39:43

they deploy and they really love the

play39:45

fact that they can launch all that

play39:48

infrastructure and amount etc etc and

play39:51

and then you know it scales very nicely

play39:53

but if you don't pay attention then you

play39:57

could end up spending a little more

play40:00

money than you expect it okay so you

play40:03

have to be careful there and and I wrote

play40:06

a blog posts over a year ago already but

play40:10

it's been a dated post reinvents with

play40:13

all the new launches and and this is on

play40:16

my medium blog and i pretty much walk

play40:20

through all the steps so data

play40:23

preparation using manage services like

play40:29

my EMR instead or glue which is a really

play40:33

cool tool for machine learning as well

play40:35

instead of trying to write your bespoke

play40:37

code on ec2 ground truth for labeling

play40:41

which will save you lots of time and

play40:44

money and then you know and working with

play40:49

notebook instances right stopping them

play40:51

when you don't need them right sizing

play40:54

them using the local mode I mean if

play40:56

you've never heard of all those things

play40:58

if local mode means nothing to you then

play41:01

you're probably spending too much money

play41:03

okay so go through that blog post check

play41:07

all the boxes and and send me a tweet

play41:09

telling me how much money you saved okay

play41:13

managed spot training we saw is a

play41:16

fantastic way to save you know easily 60

play41:19

70 percent on training jobs again

play41:21

right-sizing working with your data set

play41:26

in the right format streaming with pipe

play41:29

mode again if you have large data set

play41:31

and you've never heard of pipe mode

play41:33

please take a look I have a really

play41:37

fantastic guest post from hime-chan's

play41:40

an engineer with mobile I working at

play41:44

very large scale tensorflow and they

play41:47

this is really you know all the

play41:49

knowledge you need on pipe mode then

play41:52

optimizing models model tuning autopilot

play41:57

optimizing prediction etc etc so as you

play42:02

can see there are so many things you can

play42:04

do

play42:04

optimization elastic inference okay if

play42:08

you deploy on GPU instances and you

play42:10

never looked at elastic inference I can

play42:13

pretty much guarantee that you're you

play42:16

know probably wasting quite a bit of

play42:17

money so take a look here inferential is

play42:21

a great it's a great new capability as

play42:24

well with a custom chip for a super high

play42:27

throughput prediction you know much more

play42:31

efficient than GPUs etc etc right the

play42:34

list goes on and I keep updating this

play42:37

post every time so long story short if

play42:42

you've never worried about cost

play42:43

optimization and you know even if you're

play42:46

working at small scale and if you're

play42:49

working with GPU and Stasi etc please

play42:51

take a look at this blog post I got a

play42:54

lot of good feedback on it and you know

play42:57

I want you to spend exactly what you

play42:59

need to spend and not a penny more so so

play43:02

please read this if you have other

play43:05

techniques to share happy to add them to

play43:08

the post again lots of money to save if

play43:12

you do things right here okay all right

play43:16

I think we're almost done so if if you

play43:21

want more content well of course you can

play43:24

go and read the sage maker documentation

play43:26

but I guess you figure that out I have

play43:28

plenty of machine learning blog post on

play43:32

the AWS blog and that's a good way to

play43:34

keep an eye out for new stuff because

play43:36

there will be new stuff all the time my

play43:40

medium blog which I just showed you my

play43:43

youtube channel where there's a quite a

play43:46

bit of sage maker videos and the video

play43:49

version of my podcast as well the audio

play43:52

podcast is on Buzz proud and I'm always

play43:55

happy to to chat and answer questions on

play43:59

Twitter so feel free to ping me my

play44:01

direct messages are open and and you

play44:05

know don't hesitate if there's anything

play44:06

I can help you with or if you're looking

play44:08

for resources I can quickly point you to

play44:11

that

play44:11

okay so thanks again this was a pretty

play44:15

dense session on stage we come debugger

play44:17

and session

play44:18

monotone I hope you learned a lot and

play44:21

now we're available to answer your

play44:23

questions and once again thank you very

play44:26

very much for attending and I hope

play44:29

you're safe wherever you are okay see

play44:33

you soon bye bye

Rate This

5.0 / 5 (0 votes)

相关标签
SageMakerModel TrainingDebugging ToolsData QualityPrediction IssuesMachine LearningCost OptimizationSpot InstancesFeature ImportanceModel Monitoring
您是否需要英文摘要?