Gradient Descent, Step-by-Step

StatQuest with Josh Starmer
5 Feb 201923:54

Summary

TLDRThis StatQuest tutorial with Josh Starmer delves into the Gradient Descent algorithm, a fundamental tool in statistics, machine learning, and data science for optimizing parameters. The script guides viewers through the process of using Gradient Descent to find the optimal intercept and slope for a linear regression model, explaining the concept of loss functions and the importance of the learning rate. It also touches on Stochastic Gradient Descent for handling large datasets, providing a comprehensive understanding of the algorithm's application in various optimization scenarios.

Takeaways

  • πŸ“š Gradient Descent is an optimization algorithm used in statistics, machine learning, and data science to find the best parameters for a model.
  • πŸ” It can be applied to various optimization problems, including linear and logistic regression, and even complex tasks like t-SNE for clustering.
  • πŸ“ˆ The algorithm begins with an initial guess for the parameters and iteratively improves upon this guess by adjusting them to minimize the loss function.
  • πŸ“‰ The loss function, such as the sum of squared residuals, measures how well the model fits the data and is crucial for guiding the optimization process.
  • πŸ€” Gradient Descent involves calculating the derivative of the loss function with respect to each parameter to determine the direction and magnitude of the next step.
  • πŸ”’ The learning rate is a hyperparameter that determines the size of the steps taken towards the minimum of the loss function; it's critical for the convergence of the algorithm.
  • πŸšΆβ€β™‚οΈ The process involves taking steps from the initial guess until the step size is very small or a maximum number of iterations is reached, indicating convergence.
  • πŸ“Š Gradient Descent can handle multiple parameters, such as both the slope and intercept in linear regression, by taking the gradient of the loss function.
  • πŸ”¬ The algorithm's efficiency can be improved with Stochastic Gradient Descent, which uses a subset of data to calculate derivatives, speeding up the process for large datasets.
  • πŸ› οΈ Gradient Descent is versatile and can be adapted to different types of loss functions beyond the sum of squared residuals, depending on the nature of the data and the problem.
  • πŸ”š The script provides a step-by-step guide to understanding Gradient Descent, from the initial setup to the iterative process and the conditions for terminating the algorithm.

Q & A

  • What is the main topic of the video script?

    -The main topic of the video script is Gradient Descent, an optimization algorithm used in statistics, machine learning, and data science to estimate parameters by minimizing the loss function.

  • What assumptions does the script make about the viewer's prior knowledge?

    -The script assumes that the viewer already understands the basics of least squares and linear regression.

  • What is the purpose of using a random initial guess in Gradient Descent?

    -The random initial guess provides Gradient Descent with a starting point to begin the optimization process and improve upon iteratively.

  • What is the loss function used in the script to evaluate how well a line fits the data?

    -The loss function used in the script is the sum of the squared residuals, which measures the difference between the observed and predicted values.

  • How does Gradient Descent differ from the least squares method in finding the optimal value for the intercept?

    -Gradient Descent finds the minimum value by taking iterative steps from an initial guess, whereas least squares solves for the optimal value directly by setting the derivative equal to zero.

  • What is the significance of the learning rate in Gradient Descent?

    -The learning rate determines the size of the steps taken towards the optimal solution. It is crucial for balancing between making large progress quickly and avoiding overshooting the minimum.

  • How does the script illustrate the process of Gradient Descent for finding the optimal intercept?

    -The script illustrates the process by starting with an initial guess for the intercept, calculating the derivative of the loss function, determining the step size using the learning rate, and iteratively updating the intercept until the step size is close to zero.

  • What is the role of the derivative in Gradient Descent?

    -The derivative of the loss function with respect to each parameter indicates the slope of the loss function at the current point, guiding the direction and magnitude of the next step in the optimization process.

  • How does the script explain the concept of a gradient in the context of multiple parameters?

    -The script explains that when there are multiple parameters, the derivatives of the loss function with respect to each parameter form a gradient, which is used to update all parameters simultaneously in the optimization process.

  • What is Stochastic Gradient Descent and why is it used?

    -Stochastic Gradient Descent is a variant of Gradient Descent that uses a randomly selected subset of the data at each step instead of the full dataset. It is used to reduce computation time when dealing with large datasets.

  • How does the script conclude the explanation of Gradient Descent?

    -The script concludes by summarizing the steps of Gradient Descent, mentioning the use of different loss functions, and highlighting the practical aspects of using Gradient Descent with large datasets and the role of Stochastic Gradient Descent.

Outlines

00:00

πŸ“š Introduction to Gradient Descent

This paragraph introduces the concept of Gradient Descent as a method for optimizing parameters in various statistical and machine learning algorithms. It emphasizes the algorithm's versatility, from fitting lines in linear regression to optimizing more complex models. The explanation begins with a simple dataset plotting weight against height and aims to find the optimal intercept and slope for the best fit line using Gradient Descent. The paragraph also explains the initial steps of the algorithm, including selecting a random starting point for the intercept and calculating the sum of squared residuals as a loss function.

05:01

πŸ” Gradient Descent's Efficiency Over Manual Calculation

The second paragraph discusses the inefficiency of manually calculating the optimal intercept by trying numerous values, and introduces Gradient Descent as a more efficient alternative. It describes how Gradient Descent adjusts the steps it takes based on its proximity to the optimal solution, taking larger steps when far from the optimum and smaller steps as it approaches. The explanation includes the process of calculating residuals, the sum of squared residuals, and the derivative of this sum with respect to the intercept, which Gradient Descent uses to iteratively improve its estimate.

10:03

πŸ”’ Understanding Gradient Descent's Step Size and Learning Rate

This paragraph delves into the specifics of how Gradient Descent determines the step size for each iteration, which is based on the derivative (or slope) of the loss function and a small number known as the learning rate. It explains the logic behind taking smaller steps when close to the optimal value to refine the estimate and larger steps when further away. The process of updating the intercept based on the step size is detailed, along with the iterative approach of recalculating the derivative and step size until the change is minimal, indicating convergence.

15:07

πŸ“‰ Deriving Gradient Descent for Both Intercept and Slope

The fourth paragraph extends the Gradient Descent process to include the optimization of both the intercept and the slope of the line. It describes the 3D visualization of the loss function over the intercept and slope parameters and the process of taking partial derivatives with respect to each parameter to create a gradient. The explanation includes the application of the chain rule and treating one parameter as a constant while differentiating with respect to the other. The paragraph outlines the iterative steps of plugging in values for the parameters, calculating step sizes, and updating the parameters until the gradient indicates a minimal loss function value.

20:08

🎯 Conclusion: Gradient Descent's Application and Variants

In the final paragraph, the script wraps up by summarizing the Gradient Descent process for optimizing parameters, emphasizing its sensitivity to the learning rate and the potential for automation in determining an appropriate rate. It also touches on the practical considerations when dealing with large datasets, introducing the concept of Stochastic Gradient Descent, which uses subsets of data to speed up the calculation process. The paragraph concludes with a note on the generalizability of Gradient Descent to various loss functions and a call to action for viewers to subscribe for more content.

Mindmap

Keywords

πŸ’‘Gradient Descent

Gradient Descent is an optimization algorithm used to find the minimum of a function by iteratively moving in the direction of the steepest descent as defined by the negative of the gradient. In the context of the video, it is used to estimate the parameters of a linear regression model, such as the intercept and slope, by minimizing the sum of squared residuals. The script explains how Gradient Descent starts with an initial guess and then refines the parameters step by step.

πŸ’‘Least Squares

Least Squares is a method for finding the line of best fit for a set of data points by minimizing the sum of the squares of the vertical distances (residuals) of the points from the line. The script mentions that the audience should have a basic understanding of least squares before diving into Gradient Descent, as it is a foundational concept for understanding how to optimize parameters in linear regression.

πŸ’‘Linear Regression

Linear Regression is a statistical method for modeling the relationship between a dependent variable and one or more independent variables using a linear equation. The script uses linear regression as an example to demonstrate how Gradient Descent can be applied to optimize the intercept and slope of a line that best fits a given dataset.

πŸ’‘Residual

In the context of regression analysis, a Residual is the difference between the observed values and the values predicted by a model. The script explains how the sum of squared residuals is used as a loss function in Gradient Descent to evaluate the fit of the model to the data and to guide the optimization process.

πŸ’‘Loss Function

A Loss Function is a measure of error used to evaluate how well a model's predictions approximate the actual data points. In the script, the sum of squared residuals is described as a type of loss function, which is used to quantify the performance of the model during the Gradient Descent optimization process.

πŸ’‘Optimization

Optimization in the context of the video refers to the process of finding the best possible values for a model's parameters. The script discusses how Gradient Descent can be used to optimize various aspects of statistical and machine learning models, such as the parameters of a linear regression line.

πŸ’‘Intercept

The Intercept is the point where the line in a linear regression crosses the y-axis. In the script, the process of using Gradient Descent to find the optimal intercept value for a line that fits a dataset is explained, demonstrating how the algorithm iteratively adjusts the intercept to minimize the loss function.

πŸ’‘Slope

The Slope is the rate at which the dependent variable in a linear regression changes with respect to the independent variable. The script describes how, after finding the intercept, Gradient Descent can also be used to find the optimal slope that minimizes the sum of squared residuals.

πŸ’‘Derivative

A Derivative in calculus represents the rate at which a function changes with respect to its variable. The script explains how the derivative of the sum of squared residuals with respect to the intercept is calculated to find the direction and magnitude of the next step in the Gradient Descent process.

πŸ’‘Learning Rate

The Learning Rate is a hyperparameter in Gradient Descent that determines the size of the steps taken towards the minimum of the loss function. The script discusses the importance of selecting an appropriate learning rate, as it can significantly affect the convergence of the Gradient Descent algorithm.

πŸ’‘Stochastic Gradient Descent

Stochastic Gradient Descent is a variant of the Gradient Descent algorithm that uses only a single or a small subset of data points to calculate the gradient at each step, rather than the entire dataset. The script mentions this concept as a way to reduce computation time when dealing with large datasets in the context of Gradient Descent.

Highlights

Gradient Descent is a versatile optimization algorithm used in statistics, machine learning, and data science.

The algorithm optimizes parameters by minimizing a loss function, such as the sum of squared residuals.

Gradient Descent can be applied to various optimization problems beyond linear regression, including logistic regression and t-SNE.

The process begins with selecting a random initial value for the parameter, such as the intercept.

The sum of squared residuals is used as a measure of how well the model fits the data.

The algorithm calculates the residual for each data point, which is the difference between observed and predicted values.

A graph of sum of squared residuals versus different intercept values helps visualize the optimization process.

Gradient Descent is more efficient than manual 'plug and chug' methods for finding minimal loss.

The algorithm takes larger steps when far from the optimal solution and smaller steps as it approaches it.

The derivative of the loss function with respect to the parameter is calculated to guide the optimization steps.

The learning rate, a small number, determines the size of the steps taken by Gradient Descent.

Gradient Descent continues iterating until the step size is very close to zero, indicating proximity to the minimum.

A maximum number of steps can also be set to prevent the algorithm from running indefinitely.

The algorithm can be extended to optimize multiple parameters, such as both intercept and slope in linear regression.

The gradient, a set of derivatives with respect to each parameter, guides the descent in multidimensional parameter spaces.

Stochastic Gradient Descent is a variation that uses a subset of data to speed up the calculation in large datasets.

Gradient Descent's effectiveness is demonstrated through its ability to find the optimal intercept and slope, matching least squares estimates.

The learning rate can significantly affect the performance of Gradient Descent, with different rates needed for different scenarios.

Transcripts

play00:00

Gradient Descent is decent at estimating parameters. StatQuest!

play00:11

Hello!

play00:12

I'm Josh Starmer and welcome to StatQuest.

play00:15

Today we're going to learn about Gradient Descent and we're going to go through the algorithm step by step.

play00:21

Note: this StatQuest assumes you already understand the basics of least squares and

play00:26

linear regression, so if you're not already down with that, check out the Quest.

play00:31

In statistics, machine learning, and other data science fields, we optimize a lot of stuff.

play00:39

When we fit a line with linear regression, we optimize the intercept and the slope.

play00:46

When we use logistic regression, we optimize a squiggle. And when we use t-SNE, we optimize clusters.

play00:55

These are just a few examples of the stuff we optimize, there are tons more.

play01:01

The cool thing is that Gradient Descent can optimize all these things, and much more.

play01:07

So, if we learn how to optimize this line using Gradient Descent, then we'll have

play01:13

learned the strategy that optimizes this squiggle, and these clusters, and many more

play01:18

of the optimization problems we have in statistics, machine learning, and data science.

play01:25

So let's start with a simple data set.

play01:29

On the x-axis we have weight.

play01:32

On the y-axis we have height.

play01:36

If we fit a line to the data and someone tells us that they weigh 1.5, we can use

play01:42

the line to predict that they will be 1.9 tall.

play01:48

So let's learn how Gradient Descent can fit a line to data by finding the optimal values for the intercept and the slope.

play01:56

Actually, we'll start by using Gradient Descent to find the intercept.

play02:01

Then, once we understand how Gradient Descent works, we'll use it to solve for the intercept and the slope.

play02:09

So, for now, let's just plug in the Least Squares estimate for the slope, 0.64, and

play02:17

we'll use Gradient Descent to find the optimal value for the intercept.

play02:22

The first thing we do is pick a random value for the intercept.

play02:27

This is just an initial guess that gives Gradient Descent something to improve upon.

play02:33

In this case, we'll use 0, but any number will do.

play02:38

And that gives us the equation for this line.

play02:42

In this example, we will evaluate how well this line fits the data with the sum of the squared residuals.

play02:49

Note: in machine learning lingo, the sum of the squared residuals is a type of Loss Function.

play02:56

We'll talk more about Loss Functions towards the end of the video.

play03:01

We'll start by calculating this residual.

play03:05

This data point represents a person with weight 0.5 and height 1.4. We get the predicted

play03:14

height, the point on the line, by plugging weight equals 0.5 into the equation for the line.

play03:22

And the predicted height is 0.32.

play03:27

The residual is the difference between the observed height and the predicted height,

play03:32

so we calculate the difference between 1.4 and 0.32, and that gives us 1.1 for the residual.

play03:47

Here's the square of the first residual.

play03:51

The second residual is 0.4. and the third residual is 1.3. In the end, 3.1 is the sum of the squared residuals.

play04:04

Now, just for fun, we can plot that value on a graph.

play04:10

This graph has the sum of squared residuals on the y-axis, and different values for the intercept on the x-axis.

play04:19

This point represents the sum of the squared residuals when the intercept equals zero.

play04:25

However, if the intercept equals 0.25, then we would get this point on the graph.

play04:32

And if the intercept equals 0.5, then we would get this point.

play04:40

And for increasing values for the intercept we get these points.

play04:45

Of the points that we calculated for the graph, this one has the lowest sum of squared residuals.

play04:52

But is it the best we can do?

play04:55

What if the best value for the intercept is somewhere between these values?

play05:00

A slow and painful method for finding the minimal sum of the squared residuals is

play05:05

to plug and chug a bunch more values for the intercept.

play05:11

Don't despair!

play05:12

Gradient Descent is way more efficient.

play05:15

Gradient Descent only does a few calculations far from the optimal solution, and increases

play05:22

the number of calculations closer to the optimal value.

play05:27

In other words, gradient descent identifies the optimal value by taking big steps

play05:33

when it is far away, and baby steps when it is close.

play05:38

So let's get back to using gradient ascent to find the optimal value for the intercept, starting from a random value.

play05:44

In this case, the random value was zero.

play05:48

When we calculated the sum of the squared residuals, the first residual was the difference

play05:55

between the observed height, which was 1.4, and the predicted height, which came from the equation for this line.

play06:04

So we replace predicted height with the equation for the line.

play06:09

Since the individual weighs 0.5 we replace weight with 0.5. So, for this individual,

play06:19

this is their observed height and this is their predicted height.

play06:25

Note: we can now plug in any value for the intercept and get a new predicted height.

play06:31

Now let's focus on the second data point.

play06:35

Just like before, the residual is the difference between the observed height, which

play06:40

is 1.9, and the predicted height, which comes from the equation for the line.

play06:47

Snd since this individual weighs 2.3, we replace weight with 2.3.

play06:55

Now let's focus on the last person.

play06:58

Again, the residual is the difference between the observed height, which is 3.2, and

play07:05

the predicted height, which comes from the equation for the line.

play07:11

And since this person weighs 2.9, we'll replace weight with 2.9.

play07:18

Now we can easily plug in any value for the intercept and get the sum of the squared residuals.

play07:26

Thus, we now have an equation for this curve, and we can take the derivative of this

play07:32

function and determine the slope at any value for the intercept.

play07:37

So let's take the derivative of the sum of the squared residuals with respect to the intercept.

play07:44

The derivative of the sum of the squared residuals with respect to the intercept equals

play07:50

the derivative of the first part, plus the derivative of the second part, plus the derivative of the third part.

play07:59

Let's start by taking the derivative of the first part.

play08:03

First, we'll move this part of the equation up here, so that we have room to work.

play08:09

To take the derivative of this we need to apply the chain rule.

play08:15

So we start by moving the square to the front and multiply that by the derivative of the stuff inside the parentheses.

play08:26

These parts don't contain a term for the intercept, so they go away.

play08:32

Then we simplify by multiplying two by negative one.

play08:37

And this is the derivative of the first part, so we plug it in.

play08:44

Now we need to take the derivative of the next two parts.

play08:48

I'll leave that as an exercise for the viewer.

play08:52

Bam!

play08:55

Let's move the derivative up here, so that it's not taking up half the screen.

play09:00

Now that we have the derivative, Gradient Descent will use it to find where the sum of squared residuals is lowest.

play09:08

Note: if we were using least squares to solve for the optimal value for the intercept,

play09:13

we would simply find where the slope of the curve equals zero.

play09:18

In contrast gradient descent finds the minimum value by taking steps from an initial guess until it reaches the best value.

play09:27

This makes Gradient Descent very useful when it is not possible to solve for where

play09:31

the derivative equals zero. And this is why Gradient Descent can be used in so many different situations.

play09:39

Bam!

play09:41

Remember, we started by setting the intercept to a random number.

play09:45

In this case that was zero.

play09:48

So we plug zero into the derivative and we get negative 5.7. So, when the intercept

play09:55

equals 0, the slope of the curve equals negative 5.7. Note: the closer we get to

play10:03

the optimal value for the intercept, the closer the slope of the curve gets to zero.

play10:09

This means that when the slope of the curve is close to zero, then we should take

play10:14

baby steps, because we are close to the optimal value.

play10:18

And when the slope is far from zero, then we should take big steps because we are far from the optimal value.

play10:27

However, if we take a super, huge step, then we would increase the sum of the squared residuals.

play10:35

So the size of the step should be related to the slope, since it tells us if we should take a baby step or a big step.

play10:43

But we need to make sure the big step is not too big.

play10:47

Gradient Descent determines the step size by multiplying the slope by a small number called the learning rate.

play10:55

Note: we'll talk more about learning rates later.

play11:00

When the intercept equals 0, the step size equals negative 5.7. With the step size, we can calculate a new intercept.

play11:11

The new intercept is the old intercept minus the step size.

play11:16

So we plug in the numbers, and the new intercept equals 0.57.

play11:23

Bam!

play11:25

In one big step, we moved much closer to the optimal value for the intercept.

play11:31

Going back to the original data and the original line, with the intercept equals 0,

play11:36

we can see how much the residuals shrink when the intercept equals 0.57.

play11:43

Now let's take another step closer to the optimal value for the intercept. To take

play11:49

another step, we go back to the derivative and plug in the new intercept, and that

play11:54

tells us the slope of the curve equals negative 2.3.

play12:00

Now let's calculate the step size.

play12:03

By plugging in negative 2.3 for the slope, and 0.1 for the learning rate, ultimately

play12:10

the step size is negative 0.23. And the new intercept equals 0.8.

play12:18

Now we can compare the residuals when the intercept equals 0.57 to when the intercept equals 0.8.

play12:27

Overall the sum of the squared residuals is getting smaller.

play12:33

Notice that the first step was relatively large, compared to the second step.

play12:38

Now let's calculate the derivative at the new intercept: and we get negative 0.9.

play12:44

The step size equals negative 0.09, and the new intercept equals 0.89.

play12:54

Now we increase the intercept from 0.8 to 0.89, then we take another step and the

play13:00

new intercept equals 0.92. And then we take another step, and the new intercept equals

play13:08

0.94. And then we take another step, and the new intercept equals 0.95.

play13:16

Notice how each step gets smaller and smaller the closer we get to the bottom of the curve.

play13:23

After six steps, the gradient ascent estimate for the intercept is 0.95.

play13:30

Note: the least squares estimate for the intercept is also 0.95.

play13:37

so we know that gradient descent has done its job, but without comparing its solution

play13:41

to a gold standard, how does gradient descent know to stop taking steps?

play13:47

Gradient Descent stops when the step size is very close to zero.

play13:52

The step size will be very close to zero when the slope is very close to zero.

play13:58

In practice, the minimum step size equals 0.001 or smaller.

play14:05

So if this slope equals 0.009, then we would plug in 0.009 for the slope and 0.1 for

play14:15

the learning rate and get 0.0009, which is smaller than 0.001, so gradient descent would stop.

play14:27

That said, gradient descent also includes a limit on the number of steps it will take before giving up.

play14:34

In practice, the maximum number of steps equals 1000 or greater.

play14:39

So, even if the step size is large, if there have been more than the maximum number of steps, gradient descent will stop.

play14:49

Okay, let's review what we've learned so far.

play14:53

The first thing we did is decide to use the sum of the squared residuals as the loss

play14:57

function to evaluate how well a line fits the data.

play15:02

Then, we took the derivative of the sum of the squared residuals. In other words,

play15:07

we took the derivative of the loss function.

play15:10

Then, we picked a random value for the intercept, in this case we set the intercept to be equal to zero.

play15:17

Then, we calculated the derivative when the intercept equals zero, plugged that slope

play15:23

into the step size calculation, and then calculated the new intercept, the difference

play15:28

between the old intercept and the step size.

play15:32

Lastly, we plugged the new intercept into the derivative and repeated everything until step size was close to zero.

play15:40

Double bam!

play15:44

Now that we understand how gradient descent can calculate the intercept, let's talk

play15:49

about how to estimate the intercept and the slope.

play15:53

Just like before, we'll use the sum of the squared residuals as the loss function.

play15:59

This is a 3D graph of the loss function for the different values for the intercept and the slope.

play16:06

This axis is the sum of the squared residuals, this axis represents different values

play16:13

for the slope, and this axis represents different values for the intercept.

play16:19

We want to find the values for the intercept and slope that give us the minimum sum of the squared residuals.

play16:26

So, just like before, we need to take the derivative of this function.

play16:32

And just like before, we'll take the derivative with respect to the intercept.

play16:38

But, unlike before, we'll also take the derivative with respect to the slope.

play16:44

We'll start by taking the derivative with respect to the intercept.

play16:48

Just like before, we'll take the derivative of each part.

play16:53

And, just like before, we'll use the chain rule and move the square to the front,

play16:59

and multiply that by the derivative of the stuff inside the parentheses.

play17:07

[Music] Since we are taking the derivative with respect to the intercept, we treat

play17:12

the slope like a constant, and the derivative of a constant is zero.

play17:17

So, we end up with negative one, just like before.

play17:23

Then we simplify by multiplying two by negative one.

play17:28

And this is the derivative of the first part.

play17:32

So we plug it in.

play17:35

Likewise, we replace these terms with their derivatives.

play17:39

So this whole thing is the derivative of the sum of squared residuals with respect to the intercept.

play17:46

Now let's take the derivative of the sum of the squared residuals with respect to the slope.

play17:52

Just like before, we take the derivative of each part and, just like before, we'll

play17:59

use the chain rule to move the square to the front and multiply that by the derivative of the stuff inside the parentheses.

play18:11

Since we are taking the derivative with respect to the slope, we treat the intercept

play18:16

like a constant and the derivative of a constant is zero.

play18:21

So we end up with negative 0.5. Then we simplify by moving the negative 0.5 to the front.

play18:32

Note: I left the 0.5 in bold instead of multiplying it by 2 to remind us that 0.5 is the weight for the first sample.

play18:43

And this is the derivative of the first part.

play18:48

So we plug it in.

play18:50

Likewise, we replace these terms with their derivatives.

play18:56

Again, 2.3 and 2.9 are in bold to remind us that they are the weights of the second and third samples.

play19:05

Here's the derivative of the sum of the squared residuals with respect to the intercept,

play19:10

and here's the derivative with respect to the slope.

play19:14

Note: when you have two or more derivatives of the same function they are called a gradient.

play19:21

We will use this gradient to descend to the lowest point in the loss function, which,

play19:26

in this case, is the sum of the squared residuals.

play19:30

Thus, this is why the algorithm is called Gradient Descent.

play19:35

Bam!

play19:37

Just like before, we'll start by picking a random number for the intercept. In this

play19:42

case, we'll set the intercept to be equal to zero, and we'll pick a random number for the slope.

play19:48

In this case we'll set the slope to be 1.

play19:52

Thus, this line, with intercept equals 0 and slope equals 1, is where we will start.

play20:00

Now, let's plug in 0 for the intercept and 1 for the slope.

play20:05

And that gives us two slopes.

play20:08

Now, we plug the slopes into the step size formulas, and multiply by the learning rate, which this time we set to 0.01.

play20:19

Note: The larger learning rate that we used in the first example doesn't work this time.

play20:25

Even after a bunch of steps, Gradient Descent doesn't arrive at the correct answer.

play20:30

This means that Gradient Descent can be very sensitive to the learning rate.

play20:35

The good news is that, in practice, a reasonable learning rate can be determined automatically

play20:41

by starting large and getting smaller with each step.

play20:45

So, in theory, you shouldn't have to worry too much about the learning rate.

play20:51

Anyway, we do the math and get two step sizes.

play20:55

Now we calculate the new intercept and new slope by plugging in the old intercept and the old slope, and the step sizes.

play21:05

And we end up with a new intercept and a new slope.

play21:10

This is the line we started with and this is the new line after the first step.

play21:17

Now we just repeat what we did until all of the step sizes are very small, or we reach the maximum number of steps.

play21:26

This is the best fitting line, with intercept equals 0.95 and slope equals 0.64, the same values we get from least squares.

play21:37

Double bam!

play21:40

We now know how Gradient Descent optimizes two parameters, the slope and the intercept.

play21:46

If we had more parameters then we just take more derivatives and everything else stays the same.

play21:53

Triple bam!

play21:55

Note: the sum of the squared residuals is just one type of Loss Function.

play22:01

However, there are tons of other loss functions that work with other types of data.

play22:07

Regardless of which Loss Function you use, Gradient Descent works the same way.

play22:13

Step 1: take the derivative of the loss function for each parameter in it.

play22:19

In fancy machine learning lingo, take the gradient of the loss function.

play22:24

Step 2: pick random values for the parameters.

play22:29

Step 3: plug the parameter values into the derivatives (ahem, the gradient).

play22:36

Step 4: calculate the step sizes.

play22:40

Step 5: calculate the new parameters.

play22:44

Now go back to step 3 and repeatΒ until step size is very small or you reach the maximum number of steps.

play22:52

One last thing before we're done. In our example we only had three data points, so the math didn't take very long.

play23:02

But when you have millions of data points it can take a long time.

play23:06

So there is a thing called Stochastic Gradient Descent that uses a randomly selected

play23:12

subset of the data at every step rather than the full data set.

play23:16

This reduces the time spent calculating the derivatives of the loss function.

play23:22

That's all.

play23:24

Stochastic Gradient Descent sounds fancy, but it's no big deal.

play23:29

Hooray!

play23:30

We've made it to the end of another exciting StatQuest.

play23:33

If you like this StatQuest and want to see more, please subscribe.

play23:37

And if you want to support StatQuest, well, consider buying one or two of my original

play23:42

songs, or buying a StatQuest t-shirt or hoodie.

play23:45

The linksΒ are in the description below.

play23:48

Alright, until next time.

play23:50

Quest on!

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Gradient DescentMachine LearningOptimizationLinear RegressionLoss FunctionData ScienceStatistical AnalysisAlgorithm TutorialLearning RateStochastic Method