Top 6 ML Engineer Interview Questions (with Snapchat MLE)

Exponent
26 Feb 202420:05

Summary

TLDRIn this insightful interview, machine learning engineer Raj from Snapchat discusses fundamental concepts such as training and testing data, hyperparameter tuning, and optimization algorithms like batch gradient descent. He addresses the challenges of non-convex loss functions, the importance of feature scaling, and the distinction between classification and regression. Raj also shares practical insights on model deployment, monitoring for concept drift, and strategies to handle exploding gradients, emphasizing the importance of domain-specific considerations in machine learning.

Takeaways

  • ๐Ÿ“˜ Training data is the portion of data used by a machine learning algorithm to learn patterns, while testing data is unseen by the algorithm and used to evaluate its performance.
  • ๐Ÿ”ง Hyperparameters, such as the number of layers or learning rate in a neural network, are tuned using a validation set, which is a part of the training data.
  • ๐Ÿ” The final model evaluation is performed on the test data set, which should not influence the learning process or hyperparameter tuning of the model.
  • ๐Ÿ›  Gradient descent optimization techniques include batch gradient descent, mini-batch gradient descent, and stochastic gradient descent, each with different approaches to updating model parameters.
  • ๐Ÿงฉ Batch gradient descent uses the entire training set for each update, mini-batch gradient descent divides the training set into smaller groups, and stochastic gradient descent involves random shuffling and smaller batches.
  • ๐Ÿ”„ The choice between different gradient descent techniques often depends on memory requirements and the desire to introduce noise to prevent overfitting.
  • ๐Ÿ” Optimization algorithms do not guarantee reaching a global minimum in non-convex loss functions, often settling in a local minimum or saddle point.
  • ๐Ÿ”„ Feature scaling is important for algorithms that use gradient-based updating, as it helps to stabilize and speed up convergence by normalizing different scales of features.
  • ๐Ÿ”ฎ Classification predicts categories, while regression predicts continuous values; the choice depends on the nature of the outcome variable and the problem context.
  • ๐Ÿ”„ Model refresh in production is triggered by a degradation in performance, which can be monitored through various metrics and by comparing with the training set performance.
  • ๐Ÿ” Concept drift is a common reason for performance degradation in production, where the relationship between input features and outcomes changes over time.
  • ๐Ÿ’ฅ Exploding gradients in neural networks can be mitigated by gradient clipping, batch normalization, or architectural changes like reducing layers or using skip connections.

Q & A

  • What is the purpose of training data in machine learning?

    -Training data is used by a machine learning algorithm to learn patterns. It helps in choosing the parameters of the model, such as those of a logistic regression algorithm, to minimize error on the training set.

  • Why is testing data important in machine learning?

    -Testing data is crucial as it is data that the algorithm has not seen before. It is used to evaluate the performance of the model without bias, ensuring that the model's performance is gauged on data other than what it was trained on.

  • What are hyperparameters in the context of machine learning?

    -Hyperparameters are parameters that are not learned from the data but are set prior to the training process. They include aspects like the number of layers in a neural network, the size of the network, or the learning rate. They are tuned using a validation set to maximize performance.

  • How does the validation set differ from the training set and test set?

    -The validation set is a portion of the training data used to tune hyperparameters. It is not used in the learning process of the algorithm but to adjust the model's hyperparameters. The training set is used to learn the model, and the test set is used to evaluate the final model's performance.

  • What is the difference between batch gradient descent, mini-batch gradient descent, and stochastic gradient descent?

    -Batch gradient descent uses the entire training set to compute the gradient and update parameters at once. Mini-batch gradient descent divides the training set into smaller batches and updates parameters using each mini-batch. Stochastic gradient descent shuffles the training set and updates parameters using small random batches, introducing more randomness.

  • Why might one choose to use mini-batch gradient descent over batch gradient descent?

    -Mini-batch gradient descent can be chosen over batch gradient descent due to memory requirements, as it allows for the processing of smaller subsets of data that can fit into RAM or a GPU. It also adds noise to the gradient computation, which can act as a regularizer and help prevent overfitting.

  • Are optimization algorithms guaranteed to find a global minimum for non-convex loss functions?

    -No, optimization algorithms are not guaranteed to find a global minimum for non-convex loss functions. They often converge to a local minimum or a saddle point, which may still be a good solution depending on the performance on validation and test sets.

  • Why is feature scaling important in machine learning?

    -Feature scaling is important because it helps in normalizing the range of independent variables or features of data. This ensures that the features contribute equally to the result and helps in faster convergence of gradient-based machine learning algorithms.

  • What is the difference between classification and regression in machine learning?

    -Classification predicts a discrete outcome, often a category such as yes or no, while regression predicts a continuous numerical value. The choice between them depends on the nature of the problem and the type of outcome variable being predicted.

  • How can you tell when it's time to refresh a machine learning model in production?

    -A model may need to be refreshed when its performance degrades, which can be detected by monitoring metrics like precision, recall, loss, or accuracy. If the performance in production does not match the training performance, it might be time to update the model.

  • What is concept drift, and how can it affect a machine learning model's performance?

    -Concept drift refers to changes in the relationship between input features and the outcome variable over time. This shift in the underlying data distribution can cause a model's performance to degrade as the assumptions it was trained on no longer hold true.

  • How can exploding gradients be managed during the training of neural networks?

    -Exploding gradients can be managed by gradient clipping, which limits the value of gradients to a certain threshold, or by using batch normalization to stabilize the gradients. Additionally, adjusting the network architecture, such as reducing the number of layers or using skip connections, can help mitigate this issue.

Outlines

00:00

๐Ÿ“š Introduction to Machine Learning Fundamentals

The video script begins with an introduction to the concepts of training and testing data in machine learning. Raj, a machine learning engineer at Snapchat, explains that training data is used by algorithms to learn patterns and minimize error, while testing data assesses the algorithm's performance on unseen data. Hyperparameters like the number of layers or learning rate in neural networks are tuned using a validation set, separate from the training data. The importance of using different data sets to avoid biased performance evaluation is highlighted.

05:02

๐Ÿ” Deeper Dive into Training Data and Model Evaluation

This paragraph delves into the specifics of model training and evaluation. It discusses the use of batch gradient descent, mini-batch gradient descent, and stochastic gradient descent as optimization techniques. The differences between these methods in terms of how they handle the training set for parameter updates are explained. The paragraph also touches on the choice of optimization algorithm based on memory requirements and the potential for overfitting when the training set's order influences the model.

10:05

๐Ÿ”ง Handling Exploding Gradients and Model Deployment

The script addresses the challenge of exploding gradients in neural networks during backpropagation and offers solutions such as gradient clipping, batch normalization, and architectural choices like reducing the number of layers or using skip connections in architectures like the Transformer. It also covers the considerations for model deployment, including monitoring performance metrics and refreshing models when there's a significant deviation from the training performance.

15:06

๐Ÿ› ๏ธ Model Performance and Concept Drift

The final paragraph discusses reasons for discrepancies in model performance between development and production environments, such as concept drift where the underlying data distribution changes. It emphasizes the importance of monitoring data and prediction distributions, as well as confidence scores, to detect when a model's performance degrades. The conversation wraps up with a reflection on the interview, suggesting the inclusion of a case study for a more applied perspective on the discussed topics.

Mindmap

Keywords

๐Ÿ’กTraining Data

Training data refers to the dataset used by a machine learning algorithm to learn patterns and make predictions. It is the foundation upon which the algorithm builds its understanding. In the script, Raj explains that the parameters of algorithms like logistic regression are chosen to minimize error on the training set, emphasizing its importance in the learning process.

๐Ÿ’กTesting Data

Testing data is a separate dataset that the algorithm does not see during training. It is used to evaluate the performance of the trained model to ensure that it can generalize well to new, unseen data. The script mentions the importance of not gauging the model's performance on the same data it was trained on, highlighting the need for an unbiased evaluation.

๐Ÿ’กHyperparameters

Hyperparameters are configuration settings of a machine learning algorithm that are set prior to the start of the training process. They include parameters like the number of layers in a neural network or the learning rate. The script discusses how a validation set, a subset of the training data, is used to tune these hyperparameters to maximize performance.

๐Ÿ’กValidation Set

A validation set is a portion of the training data used for tuning hyperparameters. It helps in selecting the best model configuration without affecting the learning process of the algorithm itself. The script explains the role of the validation set in the model development process, ensuring that the final model is not influenced by the data used for hyperparameter tuning.

๐Ÿ’กGradient Descent

Gradient descent is an optimization technique used to minimize a loss function by iteratively moving in the direction of the steepest descent as defined by the negative of the gradient. The script describes different variations of gradient descent, such as batch, mini-batch, and stochastic gradient descent, and their applications in machine learning.

๐Ÿ’กBatch Gradient Descent

Batch gradient descent is a variant of the gradient descent algorithm where the entire training set is used to compute the gradient and update the parameters in one go. The script mentions this method in the context of optimization algorithms, explaining its memory requirements and computational intensity.

๐Ÿ’กMini-batch Gradient Descent

Mini-batch gradient descent divides the training set into smaller subsets, or mini-batches, and computes the gradient for each batch to update the model parameters. The script explains that this method adds noise to the gradient computation, which can help prevent overfitting and is more memory efficient than batch gradient descent.

๐Ÿ’กStochastic Gradient Descent

Stochastic gradient descent involves shuffling the training set and computing the gradient using small random batches of data to update the model parameters. The script discusses this method as a way to introduce randomness in the training process, which can help avoid overfitting and is computationally efficient.

๐Ÿ’กFeature Scaling

Feature scaling is a preprocessing step where features of the data are scaled to a common range or unit, often improving the performance of gradient-based machine learning algorithms. The script highlights the importance of feature scaling in stabilizing gradient descent and speeding up the convergence of the algorithm.

๐Ÿ’กClassification

Classification is a type of supervised learning where the algorithm predicts a category or class for the input data. The script differentiates classification from regression by explaining that classification predicts discrete labels, such as 'yes' or 'no', while regression predicts continuous values.

๐Ÿ’กRegression

Regression is another type of supervised learning where the algorithm predicts a continuous numerical value for the input data. The script uses the example of predicting a person's height to illustrate how regression is used to forecast continuous outcomes.

๐Ÿ’กConcept Drift

Concept drift refers to the change in the statistical properties of the input and output variables over time, which can lead to a degradation in the performance of a machine learning model. The script discusses concept drift as a potential reason for model performance differences between development and production environments.

๐Ÿ’กExploding Gradients

Exploding gradients occur when the gradients in a neural network become excessively large during backpropagation, causing instability in the training process. The script suggests methods to handle this issue, such as gradient clipping, batch normalization, and architectural changes to mitigate the problem.

๐Ÿ’กModel Refresh

Model refresh refers to the process of updating or retraining a machine learning model to maintain its performance as new data becomes available or when the underlying patterns in the data change. The script touches on the importance of monitoring model performance and refreshing the model when its performance degrades.

Highlights

Training data is used by a machine learning algorithm to learn patterns, while testing data evaluates the algorithm's performance without prior exposure.

Hyperparameters like the number of layers or learning rate in a neural network are tuned using a validation set derived from the training data.

Batch gradient descent, mini-batch gradient descent, and stochastic gradient descent differ in how the training set is divided for computing gradients and updating parameters.

Memory requirements often drive the choice between batch, mini-batch, and stochastic gradient descent due to the size of datasets.

Optimization algorithms do not guarantee reaching a global minimum in non-convex loss functions, often converging to local minima or saddle points.

Feature scaling is crucial for algorithms that use gradient-based updating to ensure stability and faster convergence.

The choice between classification and regression depends on the type of outcome predicted, with classification predicting categories and regression predicting continuous values.

A problem can be formulated as either classification or regression, depending on how the outcome variable is treated.

Model refresh in production is triggered by a degradation in performance, benchmarked against the initial training set performance.

Monitoring data distributions and prediction confidence scores can indicate when a model in production needs refreshing.

Concept drift, where the relationship between input features and outcomes changes, is a common reason for model performance decline in production.

Exploding gradients in neural networks can be mitigated by gradient clipping, batch normalization, or architectural changes.

Different models may have varying sensitivity to distribution drift, affecting their robustness in production environments.

Practical machine learning solutions often require domain-specific insights and cannot rely solely on one-size-fits-all approaches.

Incorporating concrete examples or case studies can enhance understanding of machine learning theories and their applications.

The interview covered a broad range of topics in machine learning, providing a comprehensive overview of key concepts and practices.

The discussion on formulation of problems and the differences between classification and regression provided valuable insights into machine learning approaches.

The interview emphasized the importance of considering the specific domain and problem when applying machine learning techniques.

Transcripts

play00:00

I'd love if you could tell me about the

play00:02

uh the terms training data and testing

play00:04

data in the context of machine

play00:06

[Music]

play00:08

learning okay thank you so much for

play00:10

being here with us today Raj can you

play00:12

quickly introduce yourself our

play00:14

viewers yeah absolutely thanks so much

play00:16

for having me uh my name is Raj and I'm

play00:19

currently a machine learning engineer at

play00:21

Snapchat uh where I work on a lot of

play00:24

stuff related to generative AI

play00:26

initiatives at the company that's really

play00:29

cool really popular nowadays and I feel

play00:31

like there's so many products cool

play00:33

products being built with generative AI

play00:34

so really interested to hear the

play00:36

insights that you have in our interview

play00:38

today so to get started I'd love if you

play00:40

could tell me about the uh the terms

play00:42

training data and testing data in the

play00:44

context of machine

play00:46

learning yeah absolutely so training

play00:49

data generally will refer to the portion

play00:52

of the data that a machine learning

play00:54

algorithm uses to learn patterns so for

play00:57

example the parameters of a logistic

play01:00

regression album algorithm rather can be

play01:02

chosen such that the error is minimized

play01:05

on the training set the testing set is a

play01:08

data that is not seen by the actual

play01:11

algorithm and is used purely to gauge

play01:14

the algorithm's

play01:17

performance yeah that makes sense right

play01:19

because you don't want to um gauge the

play01:21

model's performance on the same data

play01:23

that it was trained on so what I'm

play01:25

curious about then is uh you mentioned

play01:27

that you want to minimize the error on

play01:28

the training data set what if though

play01:30

your algorithm involves like parameters

play01:32

that you need to tune so for example

play01:34

like the number of layers or like the

play01:36

size of your neural network or learning

play01:38

rate um if you need to tune those

play01:40

parameters uh do you tune it on the

play01:42

training data like what do you do

play01:44

instead yeah so generally those are

play01:47

called hyper parameters and so what

play01:49

people will typically do is take out a

play01:52

portion of the training data and call it

play01:53

a validation set and then they will tune

play01:56

those hyper parameters to maximize the

play01:59

performance on the validation set okay

play02:02

perfect so now you have training data

play02:03

set and a validation data set and uh so

play02:06

then again which data set do you

play02:09

actually evaluate the final model on

play02:11

typically you will evaluate it on your

play02:14

test data set at the final step um

play02:17

usually you should only be using your

play02:19

validation set to tune these

play02:21

hyperparameters but you should kind of

play02:23

have this hold outet that uh never has

play02:26

any information that is used to inform

play02:29

uh the learning of the actual algorithm

play02:31

as well as the hyper parameters yeah so

play02:34

then moving on So speaking of training a

play02:36

model uh there's many different

play02:38

optimization algorithms for doing so

play02:40

right so could you tell me about the

play02:42

differences between some of them

play02:43

specifically between batch gradient

play02:45

descent mini batch gradient descent and

play02:47

stochastic gradient descent sure yeah so

play02:50

firstly uh gradient descent is an

play02:52

optimization technique like you

play02:54

mentioned that is used to find the

play02:56

minimum of a loss function uh so

play02:58

specifically the GR can be calculated by

play03:01

taking the derivative of the loss with

play03:03

respect to the parameters of a

play03:05

particular algorithm and since the

play03:08

gradient descent actually represents the

play03:10

direction of the steepest descent it can

play03:13

be used to take gradual steps towards

play03:15

the minimum of that loss function um

play03:18

kind of back to your question of those

play03:20

differences uh those terms refer to uh

play03:24

different but related ways of dividing

play03:27

up the trading Set uh comp the actual

play03:30

gradient and then performing the actual

play03:32

parameter updates so batch gradient

play03:35

descent is when you use the entire

play03:38

training set on one go and you compute

play03:40

the gradient and then you do a single

play03:43

step of gradient design mini batch

play03:46

gradient descent is when you divide up

play03:48

the training set into what are called

play03:51

mini batches and you typically choose a

play03:53

batch size um and then you will

play03:55

separately compute the gradients on

play03:58

those midi batches and then you will

play04:00

take a step in that direction for each

play04:02

one of those mini batches stochastic

play04:06

gradient desent uh is related to both

play04:09

batch and mini batch gradient Des set

play04:11

but it mostly refers to shuffling up the

play04:14

training set randomly and then similarly

play04:18

you would divide that up into smaller

play04:20

batches and then compute the gradients

play04:22

on those batches and then perform the

play04:24

respective parameter

play04:26

updates okay yeah that generally makes

play04:29

sense so I'm curious then when might you

play04:31

choose to use for example like full

play04:33

batch gradient descent versus mini batch

play04:35

versus St stochastic like why are there

play04:37

different ones and why do people choose

play04:38

a specific one to

play04:40

use yeah so uh people typically choose

play04:44

to split up the data into batches

play04:47

because of memory requirements so if you

play04:50

have a data set that has millions of

play04:52

data points for example uh unfortunately

play04:55

you cannot usually fit that into memory

play04:57

when actually doing gradient Des set and

play04:59

so in practice uh people will divide

play05:01

these up into many badges so it can fit

play05:04

into your RAM of a GPU let's say and

play05:07

then you know periodically you can

play05:10

compute these updates and then gradually

play05:13

kind of lower the loss function um mini

play05:16

batch gradient descent is also kind of

play05:18

used as a regular a regularizer rather

play05:21

um to prevent overfitting on the

play05:24

training set because it adds a little

play05:26

bit of noise to the actual uh gradient

play05:29

that Computing on these mini batches in

play05:32

response to the stochastic part of it um

play05:36

well let's say that you have a

play05:37

particular training set that has

play05:39

patterns that are underlying in the

play05:41

order of the training set you don't want

play05:44

to overfit your training data uh or

play05:48

rather the training of your model to any

play05:50

order that could be representative

play05:52

within your training set and so people

play05:54

will use stochastic grad grading descent

play05:57

to make sure uh you know that that

play05:59

shuffling kind of takes out that

play06:01

variable of the order within the trading

play06:03

data set okay perfect yeah that makes a

play06:05

lot of sense because with um deep

play06:08

learning a particular there's often very

play06:09

strict like memory requirements and

play06:11

these models are typically quite large

play06:13

uh so it makes sense that we would have

play06:14

variations at account for that um so I'm

play06:17

also curious then you mentioned that we

play06:19

use these optimization algorithms to try

play06:21

to decrease the loss um so I would

play06:24

assume you want the loss to reach some

play06:25

kind of minimum but a lot of loss

play06:27

functions that are encountered nowadays

play06:29

are actually

play06:30

non-convex so can you tell me are um any

play06:33

of these optimization algorithms that we

play06:34

just talked about guaranteed to reach a

play06:36

global

play06:37

Minima in the case of a non-convex

play06:40

function they are not guaranteed to

play06:43

reach a global minimum in fact uh

play06:46

usually they don't reach a global

play06:47

minimum in the case of neural networks

play06:49

they usually have a lot of different

play06:51

Minima uh and so usually it'll converge

play06:53

to some sort of local minimum or

play06:55

possibly a saddle point okay yeah and if

play06:59

the Alm reaches a local minimum instead

play07:02

are there issues with that like is that

play07:04

generally still a good model what do you

play07:07

think uh so you know it depends it could

play07:11

be a good algorithm um you might want to

play07:15

try different uh parameter

play07:18

initialization techniques to see if

play07:20

you're able to get to different Minima

play07:23

within the actual model that you're

play07:25

training however that is dependent on

play07:28

factors such as how it perform forms on

play07:29

your validation set and ultimately on

play07:32

your test set um it kind of just depends

play07:34

how exactly you'd like to go about it

play07:36

however a way of uh potentially getting

play07:39

to a new minimum is using a different

play07:41

parameter initialization technique now

play07:43

that you've started training you need to

play07:45

actually prepare your data for training

play07:47

um so when you're preparing the data

play07:50

something that people often do is they

play07:51

do feature scaling or they do

play07:53

normalization uh can you tell me a

play07:55

little bit more about the importance of

play07:57

these particular pre-processing steps in

play07:58

machine learning yeah so feature scaling

play08:02

is really important for training machine

play08:04

learning algorithms that uh do gradient

play08:07

based updating like we were just

play08:09

discussing and the reasoning behind that

play08:12

is because uh often features have

play08:14

different orders of magnitude and so the

play08:18

derivatives of the loss with respect to

play08:20

those input parameters will be on

play08:22

different scales as well and so when you

play08:25

have them on different scales gradient

play08:27

descent on on normalized features tend

play08:29

to be unstable and converge slower and

play08:33

so feature scaling can be a way of

play08:36

getting an algorithm to converge faster

play08:38

so now that you've like prepared your

play08:40

data let's say that you're actually

play08:42

trying to figure out what type of

play08:44

learning problem that you're uh tackling

play08:47

with your machine learning approach um

play08:49

so some common types of learning

play08:50

problems include classification and

play08:52

regression can you tell me a little bit

play08:54

about the differences between those yeah

play08:56

so classification and uh classification

play08:59

aggression rather refer to the type of

play09:01

outcome predicted by a supervised

play09:04

machine learning algorithm uh and so in

play09:07

the case of classification that will

play09:09

usually predict some sort of category so

play09:12

in the simplest case a yes or a no while

play09:15

as regression will be predicting some

play09:17

sort of numerical or continuous value

play09:20

for example a person's

play09:23

height okay uh can you foresee instances

play09:26

where uh a problem could be both CL

play09:29

ification or regression and if so why

play09:31

might you choose one or the

play09:33

other sure so let's say there was a case

play09:37

where the outcome was a numerical

play09:40

variable and so of course you could use

play09:43

regression to formulate that problem

play09:46

however you could also bin the different

play09:50

values into different categories right

play09:53

so for example in the case of height

play09:56

maybe you can bend them based on ranges

play09:59

so you could have one that says low one

play10:01

that says medium one that says high and

play10:04

then you can turn that into a

play10:06

classification problem uh and I think

play10:08

the general reasoning is making uh it

play10:11

easier for the algorithm to distinguish

play10:14

and learn based on the actual patterns

play10:17

underlying in the data uh sometimes uh

play10:20

for example in the case of the highight

play10:22

uh the scale is kind of all over the

play10:24

place um you know there's kind of a

play10:28

bigger

play10:29

that you have to be able to predict um

play10:32

and so getting the underlying pattern of

play10:35

whether it's in the medium range or the

play10:38

higher range might be something that's

play10:40

easier for the algorithm to learn and it

play10:43

can also be something that is perhaps

play10:45

more useful for the algorithm to learn

play10:48

um so I think it just depends on the use

play10:51

case that you have and and what makes

play10:53

most sense for you know your particular

play10:56

problem right yeah and a lot of these

play10:58

intuitive insights uh that you have

play11:00

about the data can be really important

play11:02

when it comes to like feature

play11:03

engineering or data pre-processing so

play11:05

now let's assume your model's fully

play11:07

trained you've deployed it into

play11:08

production congratulations um okay and

play11:10

it's been in production for a little bit

play11:12

of a while now and you've been

play11:13

monitoring it um and just like you've

play11:16

been measuring various metrics how might

play11:18

you be able to tell um when it's time to

play11:21

actually refresh the model that's in

play11:23

production yeah so typically a model

play11:26

will need to be refreshed when there is

play11:29

a degrading in performance of the

play11:32

algorithm so generally you will

play11:34

Benchmark the performance on some sort

play11:36

of training set and perhaps at some

play11:38

point you see that the performance of

play11:40

your data and production is not matching

play11:42

up to uh the performance on the training

play11:44

Set uh and so some of the ways that you

play11:47

can tell is basically just using some of

play11:49

the metrics that you chose for your

play11:51

initial problem uh for example if a

play11:53

Precision metric or a recall metric uh

play11:57

or perhaps the loss or the accuracy all

play12:00

of that assumes that you do have for

play12:03

example the ground truth label for the

play12:05

data incoming and production um however

play12:08

you know that is possible for certain

play12:10

cases and that can be used as a way to

play12:12

Benchmark and see if it's actually

play12:14

differing from your training performance

play12:17

it isn't always that straightforward

play12:19

because you don't always have that

play12:20

source of grand truth and production so

play12:23

alternative strategies can include

play12:25

monitoring data distributions of the

play12:28

input features for your model as well as

play12:31

prediction distributions and also

play12:34

confidence scores from the algorithm

play12:35

itself so it really depends kind of on

play12:37

your use case and there's no one best

play12:40

way to do it however there are

play12:42

definitely different ways of you know

play12:44

helping solve that R okay yeah um I like

play12:48

how you mentioned that it's really on a

play12:49

case-by Case uh basis you got to really

play12:53

uh look into like the domain specifics

play12:55

of your problem and think about it that

play12:57

way so can you give me um some reasons

play13:00

or um some in insights for why model

play13:03

performance might actually differ uh in

play13:05

production versus uh in

play13:09

development yeah so there's a lot of

play13:12

possible reasons for why this could

play13:15

actually happen in production however I

play13:17

can give one example which is something

play13:20

called concept drift which is where the

play13:23

relationship between the input features

play13:26

and the actual outcome variable changes

play13:29

another way of thinking about it is that

play13:31

typically a supervised machine learning

play13:33

model is represented by the probability

play13:36

distribution of probability of Y given X

play13:39

so concept drift is when this underlying

play13:42

distribution actually changes and so all

play13:46

the assumptions when you trained your

play13:48

model don't actually apply anymore so

play13:50

that can often be a common reason as to

play13:52

why uh performance isn't matching what

play13:54

you would expect okay perfect yeah

play13:56

because um if you trained on one set of

play13:58

data and the new set set of data is

play13:59

pretty different uh it's very possible

play14:02

for your model to just like not uh not

play14:04

be trained well on that new distribution

play14:06

during training itself like there could

play14:08

be a lot of irregularities that happen

play14:11

so sometimes uh something known as an

play14:12

exploding gradient which is when the

play14:14

values of your gradient become really

play14:15

really large um and cause uh can cause

play14:17

training its abilities um can you tell

play14:20

me how you might handle that yeah so

play14:23

like you mentioned the exploding

play14:24

gradients is really because of uh back

play14:28

propag

play14:29

in a neural network and specifically

play14:31

when there are success success of layers

play14:35

in a network uh for which the gradients

play14:38

need to be computed and typically those

play14:40

are calculated with the chain Rule and

play14:43

so that involves multiplications of many

play14:46

different gradients um and so one way of

play14:49

handling it is straight up just clipping

play14:52

the gradients at a certain threshold

play14:55

kind of like a brute force and just

play14:56

saying that hey if it exceeds this value

play14:59

it's too much and we don't want to

play15:01

result in unstable training so that's

play15:03

one way of doing it you could also use

play15:06

what's become a lot more common in the

play15:07

past few years which is batch

play15:09

normalization which is basically using a

play15:12

type of normalization after a particular

play15:16

layer or activation and then uh taking

play15:21

the mean and standard deviation B based

play15:24

on the batch of examples and this can

play15:26

help scale the gradients to more

play15:28

reasonable stable values that's a second

play15:31

way of doing it uh what people also do

play15:35

is they also change their architecture

play15:37

or choose their architecture to help

play15:40

mitigate these exploding gradients um so

play15:42

you could directly just reduce the

play15:45

number of hidden layers which will uh

play15:48

therefore reduce the amount of

play15:49

multiplications that need to happen for

play15:51

the chain rule you could also choose

play15:54

architectures for example the

play15:56

Transformer with skip connections and

play15:58

Skip connections are basically Pathways

play16:01

from certain layers to layers further

play16:03

down in the network rather than the

play16:05

layer that directly follows it and so

play16:07

that kind of gives uh the network a

play16:10

pathway for the gradient to follow

play16:12

without having to pass through several

play16:15

consecutive layers and this can

play16:17

definitely help mitigate the exploding

play16:19

gradient problem okay perfect yeah I

play16:21

like that you suggested a couple of

play16:22

different approaches B based both on the

play16:24

model architectures themselves and the

play16:26

data set um Okay cool so I think this is

play16:30

a really great place to pause thank you

play16:32

so much for answering all these

play16:34

questions today um I'm really curious to

play16:36

hear your insights about this though

play16:37

like if you were the interviewer uh how

play16:39

did you feel about this interview what

play16:40

do you think well and is there anything

play16:42

that you think you would have done

play16:43

differently I think that uh the

play16:45

interview touched on some really

play16:48

important topics in machine learning um

play16:52

I think that these are all relevant

play16:53

topics for example the exploting

play16:55

gradient is extremely common in training

play16:58

neural networks and neural networks have

play17:00

obviously become super common in AI

play17:03

they're very widely used um so I really

play17:05

liked that I like that we also touched

play17:08

on some of the basic fundamentals as far

play17:11

as how you formulate a problem I really

play17:13

liked how we talked about the

play17:15

differences between classification and

play17:17

regression and also the fact that uh

play17:19

those can actually be formulated

play17:21

differently you don't always have to

play17:23

follow a certain format and it really

play17:25

just differs on your use case

play17:29

um I think that it would have been nice

play17:31

to uh have some sort of like case study

play17:36

maybe like very Mini case study you know

play17:39

not something that takes the entire

play17:40

interview uh but maybe where we asked a

play17:44

hypothetical scenario and said okay well

play17:48

what might you do in this case and I

play17:50

think that that was present to some

play17:51

extent uh but perhaps we could have

play17:53

applied it to like a specific domain or

play17:56

a particular company yeah I agree often

play17:59

times I think having that kind of

play18:01

concrete example or a case study really

play18:03

helps us understand like why the the

play18:05

theory applies or why these techniques

play18:07

were invented in the first place you

play18:08

know uh but I do think that actually you

play18:11

had gave quite a few like good like

play18:12

small concrete examples for example like

play18:14

the height problem we talked about that

play18:16

in the case of like classification

play18:18

versus regression um yeah we talked a

play18:20

little bit about some examples of like

play18:22

concept drift as well uh so I thought

play18:23

that that was actually very helpful and

play18:25

I really like that a lot of your answers

play18:27

Incorporated you talking about how um

play18:29

there's not necessarily A one- siiz fits

play18:31

all machine learning like solution a lot

play18:33

of the times you have to pay attention

play18:35

to your particular domain or the

play18:37

particular problem that you're trying to

play18:38

learn um so a lot of it really does

play18:41

depend upon like you just like looking

play18:43

at your data and thinking about like

play18:45

what does this model do what do you want

play18:47

this model to ideally do uh for the user

play18:50

um so that was uh I thought that that

play18:52

was really well done so as for what we

play18:54

might have been able to elaborate more

play18:56

on uh so I believe for the question

play18:59

where we talked a little bit about um a

play19:03

potential reason for why model

play19:05

performance uh might differ in

play19:06

production what you might see that uh

play19:09

regression um maybe we could have also

play19:11

talked about like uh how some models are

play19:14

more sensitive than others to uh

play19:16

distribution drift and sometimes we also

play19:18

call that o o generalization out of

play19:20

distribution generalization even if a

play19:22

model has only seen like quote unquote

play19:24

in distribution data which is like the

play19:25

data that you've seen in development um

play19:28

some models may have like wider decision

play19:30

boundaries than others which tends to

play19:31

make them a little bit less sensitive to

play19:33

distribution DFT and a drift and a

play19:35

little bit more robust in general so

play19:37

that's uh something that would have been

play19:38

interesting to touch upon but in general

play19:41

like you covered so many topics uh so

play19:43

like thoroughly uh so I think we all

play19:45

learned a lot from you today so thank

play19:47

you for being here yeah yeah thanks so

play19:49

much for having me yeah and thanks

play19:50

everybody for watching uh if you have

play19:52

any machine learning interviews coming

play19:53

up good luck thank you for watching bye

play19:57

everyone

play20:00

[Music]

Rate This
โ˜…
โ˜…
โ˜…
โ˜…
โ˜…

5.0 / 5 (0 votes)

Related Tags
Machine LearningModel TrainingData TestingGenerative AIOptimizationGradient DescentFeature ScalingHyperparametersConcept DriftNeural NetworksProduction Monitoring