Machine Learning Tutorial Python - 4: Gradient Descent and Cost Function

codebasics
21 Jul 201828:25

Summary

TLDRThis script offers an insightful walkthrough of key machine learning concepts, focusing on the Mean Square Error (MSE) cost function, gradient descent, and the significance of the learning rate. The tutorial aims to demystify the mathematical underpinnings often encountered in machine learning, emphasizing a step-by-step approach to grasp these concepts without being overwhelmed. By the end, viewers are guided to implement a Python program that employs gradient descent to find the best fit line for a given dataset, illustrating the process through visual representations and debugging techniques. The practical exercise involves analyzing the correlation between math and computer science scores of students, applying the gradient descent algorithm to determine the optimal values of m and b, and ceasing iterations once a predefined cost threshold is met, showcasing the algorithm's convergence to the global minimum.

Takeaways

  • πŸ“˜ Mean Square Error (MSE) is a popular cost function used in machine learning to measure the average squared difference between the estimated values and the actual value.
  • πŸ” Gradient descent is an optimization algorithm used to find the best fit line by iteratively adjusting the parameters (m and b) to minimize the cost function.
  • πŸ“ˆ The learning rate is a key parameter in gradient descent that determines the size of the steps taken towards the minimum of the cost function.
  • πŸ“‰ MSE is calculated as the sum of the squares of the differences between the predicted and actual data points, divided by the number of data points.
  • πŸ“Š To implement gradient descent, one must calculate partial derivatives of the cost function with respect to each parameter, which indicate the direction of the steepest increase.
  • πŸ”’ The slope of the tangent at a point on a curve represents the derivative at that point, which is used to guide the direction of the step in gradient descent.
  • πŸ”„ The process of gradient descent involves starting with initial guesses for m and b, then iteratively updating these values based on the partial derivatives and learning rate.
  • πŸ”§ Numpy arrays are preferred for this type of computation due to their efficiency in matrix operations and faster computation speed compared to regular Python lists.
  • πŸ”­ Visualization of the gradient descent process can be helpful in understanding how the algorithm moves through parameter space to find the minimum of the cost function.
  • πŸ”΄ The algorithm stops when the cost no longer decreases significantly, indicating that the global minimum has been reached or is very close.
  • πŸ”¬ Calculus is essential for understanding how to calculate the derivatives and partial derivatives that are used to guide the steps in gradient descent.

Q & A

  • What is the primary goal of using a mean square error cost function in machine learning?

    -The primary goal of using a mean square error cost function is to determine the best fit line for a given dataset by minimizing the average of the squares of the errors or deviations between the predicted and actual data points.

  • How does gradient descent help in finding the best fit line?

    -Gradient descent is an optimization algorithm that iteratively adjusts the parameters (m and b in the linear equation y = mx + b) to minimize the cost function. It does this by calculating the gradient (partial derivatives) of the cost function with respect to each parameter and updating the parameters in the direction that reduces the cost.

  • Why is it necessary to square the errors in the calculation of the mean square error?

    -Squaring the errors ensures that all errors are positive and prevents them from canceling each other out when summed. This makes it easier to optimize and find the minimum value of the cost function.

  • What is the role of the learning rate in the gradient descent algorithm?

    -The learning rate determines the size of the steps taken during each iteration of the gradient descent algorithm. It is a crucial parameter that affects the convergence of the algorithm; a learning rate that is too high may overshoot the minimum, while a rate that is too low may result in slow convergence or getting stuck in a local minimum.

  • How does the number of iterations affect the performance of the gradient descent algorithm?

    -The number of iterations determines how many times the gradient descent algorithm will update the parameters. More iterations allow for more refinement and potentially a better fit, but they also increase the computational cost. There is a trade-off between accuracy and efficiency that needs to be considered.

  • What is the significance of visualizing the cost function and the path taken by gradient descent?

    -Visualizing the cost function and the path of gradient descent helps in understanding how the algorithm navigates through the parameter space to find the minimum of the cost function. It provides insights into the convergence behavior and can help in diagnosing issues such as getting stuck in local minima or failing to converge.

  • Why is it important to start with an initial guess for m and b in the gradient descent algorithm?

    -Starting with an initial guess for the parameters m and b is important because gradient descent is sensitive to the starting point. Different starting points may lead to different local minima. However, with a well-behaved cost function, gradient descent can find the global minimum if the learning rate is properly chosen.

  • What are partial derivatives and how do they relate to the gradient descent algorithm?

    -Partial derivatives are derivatives of a function with respect to a single variable, keeping all other variables constant. In the context of gradient descent, partial derivatives of the cost function with respect to each parameter (m and b) provide the direction and magnitude of the steepest descent, which is used to update the parameters and 'descend' towards the minimum.

  • How can one determine when to stop the gradient descent algorithm?

    -The algorithm can be stopped when the change in the cost function between iterations falls below a certain threshold, indicating that the parameters have converged to a minimum. Alternatively, one can compare the cost between different iterations and stop when the cost does not significantly decrease, suggesting that further iterations will not yield a better fit.

  • What is the purpose of using numpy arrays over simple Python lists when implementing gradient descent in Python?

    -Numpy arrays are preferred over simple Python lists for their efficiency in performing matrix operations, which are common in gradient descent calculations. Numpy arrays are also faster due to their optimized implementations and support for vectorized operations.

  • In the context of the tutorial, what is the exercise problem that the audience is asked to solve?

    -The exercise problem involves finding the correlation between math scores and computer science scores of students using the gradient descent algorithm. The audience is asked to apply the algorithm to determine the best fit line (values of m and b) and to stop the algorithm when the cost between iterations is within a certain threshold.

Outlines

00:00

πŸ“š Introduction to Machine Learning Concepts

The video begins by introducing key concepts in machine learning, such as the mean square error cost function, gradient descent, and learning rate. The speaker aims to demystify the mathematical equations often encountered in machine learning tutorials, encouraging viewers not to be intimidated by their math skills. The tutorial's goal is to implement gradient descent in a Python program, not just to solve a machine learning problem, but to understand the underlying processes for better use of libraries like sklearn. The importance of deriving a prediction function from a dataset is emphasized, using the example of predicting home prices based on area.

05:01

πŸ“ˆ Gradient Descent and Its Visualization

The second paragraph delves into how gradient descent operates, a method for finding the best fit line with fewer iterations. A visual representation is provided, plotting m (slope) and b (intercept) against the mean square error to form a 3D surface. Starting from an initial guess, the process involves taking small steps towards the minimum error, adjusting m and b iteratively. The concept of a learning rate is introduced, which, combined with the slope (or derivative), dictates the size of each step towards the optimal values of m and b.

10:02

πŸ”’ Calculus and Derivatives in Gradient Descent

The third paragraph explains the role of calculus in determining the steps in gradient descent, focusing on derivatives as a means to calculate slopes. The basic concept of a derivative is introduced, including how to calculate it for a function at a particular point. The distinction between a regular derivative and a partial derivative is made, with examples provided. The video also references the channel '3Blue1Brown' for a more in-depth understanding of these mathematical concepts. Finally, the partial derivatives for m and b in the context of the mean square error function are derived and simplified for practical use in the algorithm.

15:05

πŸ’» Implementing Gradient Descent in Python

The fourth paragraph outlines the practical implementation of gradient descent using Python. It begins with initializing m and b, defining the number of iterations, and setting up a loop to perform the iterative process. The loop calculates the predicted y values, the derivatives for m and b, and then updates m and b based on these derivatives and a learning rate. The learning rate is introduced as a key parameter that needs to be fine-tuned. The video also emphasizes the importance of tracking the cost at each iteration to ensure the algorithm is converging towards the correct solution.

20:06

πŸ” Tuning the Learning Rate and Iterations

The fifth paragraph discusses the fine-tuning of the learning rate and the number of iterations for the gradient descent algorithm. It highlights the importance of monitoring the cost at each iteration to ensure it's decreasing, indicating that the algorithm is making progress. The speaker shares their approach to adjusting the learning rate, starting with a smaller value and increasing it to see the effect on cost reduction. The paragraph also demonstrates how an inappropriately large learning rate can cause the algorithm to overshoot the minimum and start increasing the cost. The goal is to find a balance where the cost continues to decrease with each iteration.

25:07

πŸ“‰ Stopping Criteria and Application Exercise

The final paragraph introduces the stopping criteria for the gradient descent algorithm, which involves comparing the cost between iterations to a predefined threshold. The video uses a tolerance level to determine when the algorithm has converged sufficiently. It also presents an exercise for the viewer, which involves applying the gradient descent algorithm to find the correlation between math and computer science scores of students. The exercise aims to give practical experience in implementing and adjusting the gradient descent algorithm to achieve the best fit line for the given dataset.

Mindmap

Keywords

πŸ’‘Mean Square Error (MSE)

Mean Square Error (MSE) is a measure of the average squared difference between the estimated values and the actual value. It is used as a cost function to quantify the performance of a predictive model. In the video, MSE is central to understanding how to evaluate the accuracy of the best fit line in a machine learning context. The script mentions that MSE is calculated by taking the sum of the squares of the differences between predicted and actual data points, then dividing by the number of data points.

πŸ’‘Gradient Descent

Gradient Descent is an optimization algorithm used to minimize the cost function in machine learning models. It iteratively adjusts the parameters (like m and b in a linear model) to find the best fit line. The video explains that gradient descent allows for efficient finding of the best fit line with fewer iterations by taking 'baby steps' towards the minimum error, visualized as a descent towards the lowest point on a cost function surface.

πŸ’‘Learning Rate

The learning rate is a parameter used in gradient descent to control the step size during the update of the model's parameters. It's crucial for the algorithm's convergence; a too-large rate may overshoot the minimum, while a too-small rate may lead to a very slow convergence. In the script, the learning rate is adjusted through trial and error to ensure that the cost is being reduced with each iteration, indicating that the algorithm is converging to the optimal solution.

πŸ’‘Linear Regression

Linear Regression is a statistical method for modeling the relationship between a dependent variable 'y' and one or more independent variables denoted as 'x'. The video discusses linear regression in the context of predicting home prices using the area as the independent variable. The goal is to derive a linear equation that can predict future values based on past observations.

πŸ’‘Cost Function

A cost function is a mathematical function that calculates the error of a model during training in machine learning. The mean square error is a common example of a cost function. The video emphasizes that the cost function evaluates how well the model's predictions match the actual data, and the goal of gradient descent is to minimize this cost.

πŸ’‘Partial Derivative

A partial derivative is a derivative of a function with respect to one variable while keeping the other variables constant. In the context of the video, partial derivatives are used to calculate the slope of the cost function with respect to the parameters 'm' (slope of the line) and 'b' (intercept). These derivatives guide the direction of the 'baby steps' taken by the gradient descent algorithm.

πŸ’‘Intercept (b)

In the context of a linear equation 'y = mx + b', the intercept 'b' is the point where the line crosses the y-axis. The video script discusses how 'b' is one of the parameters that need to be determined to find the best fit line using gradient descent, starting with an initial guess and iteratively adjusting it.

πŸ’‘Slope (m)

The slope 'm' in a linear equation 'y = mx + b' indicates the steepness of the line and the direction of the increase or decrease in 'y' with respect to 'x'. The script explains that 'm' is another parameter to be found using gradient descent, which represents how much 'y' changes for a one-unit change in 'x'.

πŸ’‘Training Data Set

A training data set is a collection of observations or data used for training a machine learning model. The video mentions that the training data set includes input-output pairs that the model uses to learn and derive the prediction function. It's the basis for calculating the error and subsequently the mean square error.

πŸ’‘Python Programming

Python is a high-level programming language widely used in machine learning for its simplicity and the powerful libraries it offers, such as NumPy and scikit-learn. The video script includes a Python program that implements gradient descent to find the best fit line, demonstrating how theoretical concepts are applied in practice.

πŸ’‘Jupyter Notebook

Jupyter Notebook is an open-source web application that allows creation and sharing of documents containing live code, equations, visualizations, and narrative text. The video script refers to using a Jupyter Notebook for visualization purposes, indicating its utility in machine learning for plotting data and observing the model's convergence.

Highlights

Mean square error cost function, gradient descent, and learning rate are fundamental concepts in machine learning.

The tutorial aims to demystify mathematical equations in machine learning and build a Python program for gradient descent.

Machine learning involves creating a prediction function from a dataset, unlike traditional linear algebra where equations are given.

The best fit line is derived from the dataset, representing the optimal equation for predicting future values.

Mean square error (MSE) is calculated by summing squared differences between actual and predicted data points, then dividing by the number of data points.

Gradient descent is an efficient algorithm for finding the best fit line with fewer iterations.

A visualization of the gradient descent process is provided to understand how m and b values are iteratively adjusted to minimize error.

The learning rate is a critical parameter in gradient descent, determining the step size towards the minimum error.

Derivatives and partial derivatives are essential for calculating the slope of the cost function at a given point.

The partial derivatives of m and b with respect to the cost function guide the direction of the next step in gradient descent.

Python code is used to implement gradient descent, using numpy arrays for efficient matrix operations.

The number of iterations and the learning rate are parameters that need to be fine-tuned for optimal performance.

Monitoring the cost at each iteration is crucial to ensure the algorithm is converging towards the global minimum.

The learning rate should be adjusted to ensure the cost is consistently decreasing, avoiding overshooting the minimum.

Gradient descent can be stopped when the cost between iterations falls below a certain threshold, indicating convergence.

A practical exercise is provided to apply the gradient descent algorithm to find the correlation between math and computer science scores.

The exercise involves using the gradient descent algorithm to determine the best fit line for predicting computer science scores from math scores.

The tutorial concludes with a discussion on stopping criteria for the gradient descent algorithm based on cost convergence.

Transcripts

play00:00

mean square error cost function gradient

play00:02

descent and learning rate

play00:03

these are some of the important concepts

play00:05

in machine learning

play00:07

and that's what we are going to cover

play00:09

today at the end of this tutorial we

play00:10

would have written a python program

play00:12

to implement gradient descent now when

play00:14

you start going through machine learning

play00:16

tutorials

play00:17

the thing that you inevitably come

play00:18

across is mathematical equations

play00:22

and by looking at them the first thought

play00:24

that jumps into your mind is oh my god i

play00:26

suck at math i used to get 4 out of 50

play00:29

in my math test

play00:30

how am i going to deal with this let's

play00:32

not worry too much about it

play00:34

we can take one step at a time and the

play00:36

things won't seem that much hard

play00:40

also you won't be implementing gradient

play00:43

descent

play00:44

when solving the actual machine learning

play00:46

problem

play00:47

but the reason we are doing this

play00:49

exercise today is that

play00:50

you should know some of the internals so

play00:53

that while using the sklearn library

play00:56

you know what's going on and you can use

play00:58

the libraries in a better way

play01:00

with that let's get started during our

play01:02

linear algebra class

play01:04

in our school days what we used to have

play01:06

was this equation

play01:07

and x as an input and we used to compute

play01:11

the value of y where the way you derive

play01:15

9 is by multiplying this 3 with 2

play01:18

which will be 6 plus 3 and that's how

play01:21

you come up with 9.

play01:23

in case of machine learning however you

play01:26

have observations or training data set

play01:29

which is your input and output

play01:31

using that you try to derive an equation

play01:35

also known as prediction function so

play01:38

that you can use this equation

play01:40

to predict uh future values of

play01:43

x in case of

play01:46

our problem of predicting home prices

play01:49

we saw in the initial linear regression

play01:52

tutorials that

play01:54

we have area and price and using that we

play01:57

came up with this equation now if you

play02:00

don't know

play02:01

this equation quite well you look at my

play02:03

jupiter notebook

play02:05

and in that notebook you'll see that i

play02:08

have

play02:08

this coefficient and intercept okay

play02:12

so that's what i have put it here

play02:15

so here i have the home prices in monroe

play02:18

township

play02:19

and i have plotted these on this chart

play02:22

and what we are going to look at is

play02:24

how can you derive this equation

play02:27

given this input and output okay so

play02:30

our goal is to derive this equation that

play02:33

equation is nothing but

play02:35

this blue line which is the best fit

play02:38

line

play02:38

uh going through all these data points

play02:40

right now these data points are

play02:42

scattered so it's not possible to draw

play02:44

the perfect line but you try to draw a

play02:46

line which which is

play02:48

kind of a best fit all right but

play02:51

the problem here is you might have

play02:54

so many lines right that can pos

play02:56

potentially go through these data points

play02:59

my data set is very simple here if you

play03:02

have

play03:03

uh like very heavy data set and if it's

play03:06

it's like scattered all over the place

play03:08

then drawing these lines

play03:10

becomes even more difficult so how do

play03:12

you know

play03:13

which of these lines is the best fit

play03:15

line

play03:16

okay so that's the problem that we are

play03:18

going to solve today

play03:20

so one way is you draw any random line

play03:24

then from your actual data point

play03:27

you calculate the error

play03:31

between that data point and the data

play03:33

point predicted by your line

play03:35

okay so call it a delta you collect all

play03:38

these deltas

play03:39

and square them the reason you want to

play03:42

square them

play03:42

is these deltas could be negative also

play03:45

and if you don't square them and just

play03:47

add them then the results might be

play03:49

skewed

play03:51

after that you sum them up and divide it

play03:54

by

play03:54

n so n here is five it is number of data

play03:57

points that you have available

play04:00

the result is called mean square error

play04:04

okay and mean square error is nothing

play04:06

but

play04:07

your actual data point minus the

play04:09

predicted data point

play04:11

you square it sum them up and then

play04:13

divide by

play04:14

n this mean square error

play04:17

is also called a cost function there are

play04:20

different type of cost function as well

play04:22

but

play04:22

mean square error is the most popular

play04:25

one

play04:26

and here y predicted is replaced by mx

play04:29

plus b because you know that y is equal

play04:32

to mx plus b

play04:33

so that's the equation for mean square

play04:36

error

play04:37

now we initially saw that there are so

play04:39

many different lines that you can draw

play04:42

you're not going to try every

play04:44

permutation and combination of

play04:46

m and b because that is very inefficient

play04:50

you want to take some efficient approach

play04:52

where in very

play04:53

less iteration you can reach your answer

play04:56

okay and gradient descent is that

play04:59

algorithm that helps you

play05:00

find the best fit line in

play05:04

very less number of iteration or in a

play05:07

very efficient way

play05:09

okay so we are going to look at how

play05:10

gradient descent works now

play05:13

for that i have plotted m b

play05:17

against the mean square error or a cost

play05:20

function

play05:21

uh so here i have drawn different values

play05:24

of

play05:25

m and here there are different values of

play05:28

p

play05:28

which is your intercept and for

play05:32

every value of m and b you will find

play05:34

some cost so if you

play05:35

keep on plotting those points here and

play05:38

if you create a

play05:39

plane out of it it will look like this

play05:41

it will be like a ball

play05:44

and what you want to do is you want to

play05:47

start with

play05:48

some value of m and b people usually

play05:51

start with zero so here you can see

play05:53

this point has m is zero and b is zero

play05:58

and from that point you calculate the

play06:01

cost so let's say the cost is

play06:02

thousand then you reduce the value

play06:06

of m and b by some amount and we'll see

play06:09

what that amount is later on so you take

play06:12

kind of like a mini step

play06:14

you come here and you will see that the

play06:16

error is now reduced to

play06:18

somewhere around 900 something like that

play06:22

and then again you reduce it by taking

play06:25

one more step you keep on taking these

play06:28

steps

play06:28

until you reach this point which is

play06:32

your minima here

play06:35

the error is minimum and once you reach

play06:39

that you have found your answer

play06:40

because you will use that m and b

play06:45

uh in your prediction function

play06:48

i have plotted these different lines and

play06:51

these different lines will have

play06:53

different values of m and b

play06:54

let's say this orange has m1 b1 so that

play06:57

m1 b1 will be here somewhere

play06:59

blue line will have m2 b2 m2 b2 will be

play07:02

this red dot

play07:03

then red line will have m3 b3 so m3 b3

play07:06

will be somewhere here on this plot so

play07:08

you can have like so many numerous lines

play07:11

which can create this uh plot

play07:14

now we just said that you will take this

play07:18

baby step but how exactly you do it

play07:20

because visually it sounds easy but

play07:22

mathematically

play07:23

when you give this task to your computer

play07:26

uh

play07:26

you have to come up with some concrete

play07:29

approach all right

play07:30

so we'll look into that but

play07:34

here is the nice visualization of uh

play07:38

how you can reduce your m and b

play07:41

and reach the best fit line okay all

play07:44

right

play07:45

so let's look at how you're going to

play07:47

take those baby steps

play07:49

so i have these two charts uh if you

play07:52

look at this 3d

play07:53

plot from this direction what you will

play07:57

see is a chart of

play07:58

b against cost and that will be

play08:02

this curvature similarly if you look at

play08:04

this chart

play08:05

from this direction the chart will look

play08:08

something like this

play08:10

and in both the cases you are starting

play08:12

at this point which is this star

play08:15

and then taking these many steps

play08:19

and trying to reach this minimum point

play08:22

which is this red dot

play08:25

now how do you take these steps

play08:28

right so one approach is let's say you

play08:31

take

play08:31

fixed size steps so here if you plot

play08:35

this

play08:36

against b i'm taking fixed size steps on

play08:39

b

play08:40

but the problem that it can create is by

play08:43

taking these steps i

play08:44

might miss the global minima okay i

play08:47

might just miss it

play08:49

and my gradient descent will never

play08:50

converge it

play08:53

from this point it will just start going

play08:55

up and up and i don't know

play08:57

where is my minima all right so this

play08:59

approach is not going to work

play09:02

what can work is if you take steps like

play09:05

this

play09:06

so here in each step i am following the

play09:09

curvature

play09:11

of my chart and also as i

play09:15

reach near to my red point you can see

play09:18

that

play09:19

the step size is reducing you see like

play09:21

this arrow is bigger and these arrows

play09:23

are getting smaller and smaller

play09:26

if i do something like this then

play09:29

i can reach this minima now how do i do

play09:32

that

play09:32

so at each point you need to calculate

play09:36

the slope

play09:38

okay so for example at this point the

play09:41

slope will be this this is a tangent

play09:43

at the curvature and at this point here

play09:46

my slope will be this

play09:49

once i have a slope i know which

play09:52

direction i need to go in for example

play09:54

if you look at this green line and if

play09:56

i'm at this blue dot

play09:58

i know that i need to go in this

play10:00

direction

play10:02

and then there is something called a

play10:04

learning rate which you can use

play10:06

in conjunction with this slope here

play10:09

to take that step and reach the next

play10:12

point

play10:13

now we have to get into calculus a

play10:17

little bit

play10:18

because calculus allows you to

play10:22

figure out these baby steps and

play10:25

when we are talking about these slopes

play10:28

really this slope is nothing but a

play10:30

derivative of

play10:31

b with respect to this cost function

play10:34

okay if you want to go in details i

play10:37

recommend this channel three blue

play10:41

one brown this guy is very good in

play10:43

explaining mathematical concept

play10:45

using a nice visualization so you will

play10:48

really find it very useful

play10:52

and pleasing but if you don't want to go

play10:55

in details then

play10:56

in this tutorial i'll just quickly walk

play10:58

over some of the basic concepts okay

play11:01

let's look at what is derivative so

play11:03

derivatives is

play11:04

all about slope i'm on this uh

play11:07

website called math is fun and these

play11:10

guys have explained it really well

play11:12

so slope is nothing but a change in y

play11:15

divided by change in x

play11:16

okay so if you have line like this and

play11:19

if you want to calculate slope between

play11:21

the two points here

play11:22

it is 24 divided by 15 but what if you

play11:26

want to calculate the slope

play11:28

at a particular point right like in our

play11:32

case if you remember

play11:36

here we want to calculate a slope at a

play11:38

particular point

play11:39

same thing here right so that slope

play11:42

will be nothing but a small change in y

play11:45

divided by small

play11:46

change in x all right

play11:50

we'll say as

play11:53

x shrinks to 0 and y shrinks to 0

play11:57

that's when you get more accurate slope

play12:01

okay so for the equation like x square

play12:04

that slope will be 2x okay

play12:08

this is again called a derivative

play12:10

derivative

play12:11

is mentioned by this notation d by dx

play12:14

and the derivative of x square is 2x

play12:18

so for example for this chart the slope

play12:21

here

play12:21

is 4 because this is x square

play12:24

and and the value here for the slope

play12:28

will be

play12:29

4. okay now let's look at what is

play12:32

partial derivative when you have an

play12:34

equation like this

play12:36

where you have your function depending

play12:39

on

play12:40

two variables x and y what you try to do

play12:43

is

play12:43

you calculate a partial derivative of x

play12:47

in that case you keep the y zero so here

play12:51

f dot x is nothing but a partial

play12:53

derivative

play12:55

of x square with respect to

play12:58

x okay similarly

play13:01

when you want to calculate a partial

play13:03

derivative of this function

play13:06

with respect to y what you do is

play13:09

you keep x 0 and then you

play13:13

calculate the derivative of y all right

play13:15

and general

play13:17

rule here is let's say if you're a y

play13:19

cube right

play13:20

like how did you come up with 3y squared

play13:22

like you put

play13:23

3 here in front of y and then you

play13:26

subtract 1

play13:27

from 3 so that's how you get 3y square

play13:31

okay so those are like some of the basic

play13:33

concept of

play13:34

our derivative and partial derivatives

play13:36

again if you want to go in detail

play13:39

just follow a three blue one brown a

play13:42

youtube channel

play13:43

and that guy is like really good in

play13:45

explaining these concepts in detail

play13:47

okay so just to revise the concept the

play13:49

derivative of this function

play13:51

uh will be 3 x square and this is the

play13:54

notation

play13:55

for your derivative the derivative of

play13:59

functions which has dependency on two

play14:02

variables

play14:03

it will be a partial derivative and the

play14:05

partial derivative of this function with

play14:07

respect to

play14:08

x will be this and with respect to y

play14:11

will be 2y

play14:12

and this is how you mention your

play14:14

derivative this is the notation that you

play14:16

use

play14:17

right it's short of like looks like a d

play14:19

but it's not like it's like a curve d

play14:21

so now going back to our

play14:25

problem of

play14:28

the line right so here we want to find

play14:31

the partial derivative of b and then for

play14:35

the other charge we want to find the

play14:37

partial derivative of

play14:38

m okay so how do you find that so this

play14:42

is your mean square

play14:43

error function and the partial

play14:45

derivative of

play14:46

m will be this and partial derivative of

play14:49

b

play14:49

will be this now i'm not going to go in

play14:52

detail about

play14:53

how we came about this again you can

play14:55

follow other resources

play14:57

but you can just accept this equation

play14:58

it's sort of like a rule you know it's

play15:00

like uh

play15:02

why earth rotates around the sun or why

play15:05

humans have two eyes

play15:06

well it's just the law of nature but if

play15:09

you want to go in detail on how we

play15:11

derived this then you can follow some

play15:14

other resources but one hint i can give

play15:16

you is see

play15:17

this thing has a square so generally for

play15:19

a derivative you put

play15:20

2 here so 2 cam here and this becomes

play15:24

2 minus 1 which is 1 so which we don't

play15:27

uh

play15:28

show here all right and then once you

play15:30

have partial derivative

play15:32

what you are having is a direction so

play15:34

partial derivatives

play15:36

gives you a slope and then once you have

play15:39

direction now you need

play15:40

to take a step so for the step you use

play15:43

something called

play15:43

learning rate all right so you have

play15:45

initial value of m

play15:47

and then you subtract this much your

play15:50

learning rate into slow

play15:51

so for example you are here on this

play15:54

chart this is your b1 value

play15:56

to come up with this b2 value you will

play16:00

subtract learning rate multiplied by

play16:03

the partial derivative which is nothing

play16:05

but a slope here

play16:08

now let's write python code to implement

play16:10

gradient descent

play16:12

i'm going to use pyjam today instead of

play16:14

jupiter notebook because i'm planning to

play16:16

use some of the debugging features

play16:18

pyjamas community edition is a freely

play16:21

available to download from

play16:22

jetbrains website the problem we are

play16:25

solving here

play16:26

is we have a value of x and y vectors

play16:30

and we want to derive uh the best fit

play16:33

line

play16:34

or an equation using m and b

play16:37

so you have x x and y and you want to

play16:41

come up with correct value of m and b

play16:44

all right so that's our objective

play16:47

here i'm going to use a numpy array

play16:51

instead of

play16:52

simple python list because matrix

play16:55

multiplication

play16:56

is very convenient with this and also

play16:59

numpy array tends to be

play17:01

more faster than simple python list

play17:05

so the first thing we are going to do is

play17:08

start with

play17:09

some value of m current and b current

play17:12

right so again to revise the theory

play17:17

you start with some value of m and b and

play17:21

then you take these baby steps

play17:24

to reach to a global minima so as you

play17:27

can see in the chart we started

play17:29

with m and b values as being zero

play17:32

and then we took these steps one by one

play17:36

to reach the global minimum another

play17:38

thing you need to do

play17:39

is define the number of iterations

play17:45

you have to define how many baby steps

play17:47

you are going to do i'm going to start

play17:49

with 1000

play17:50

and then i will fine tune it okay again

play17:53

all of this is

play17:54

pretty much like a trial and error i

play17:56

will start with some parameters and i

play17:58

will see how my algorithm behaves

play18:00

and then i will fine tune them all right

play18:03

so

play18:04

let's run a for loop simple for loop

play18:07

which just iterates these many

play18:09

iterations

play18:11

and at each step what you do is

play18:15

first thing is you calculate

play18:18

the predicted value of y all right so y

play18:22

predicted

play18:23

is nothing but m current

play18:26

into x plus b current

play18:30

all right pretty straightforward y is

play18:32

equal to mx plus b

play18:34

and for the assumptions that you have

play18:37

for

play18:38

m and b you are calculating y which is y

play18:41

predicted

play18:42

next step is to calculate

play18:46

m derivative and b derivative so m

play18:49

derivative i'm going to call it

play18:50

md and the equation

play18:54

is 2 by n all right now what is n

play18:58

n is the length of these data points

play19:02

i'm assuming x and y's length is same

play19:05

if it is not the case then you can add

play19:07

necessary validation

play19:09

and throw an error all right

play19:12

so the m's derivative is minus 2 by n

play19:16

multiplied by sum of something

play19:19

what is that sum sum is

play19:24

x multiplied by

play19:27

y minus y predicted okay so y

play19:31

minus y predicted

play19:35

and b's derivatives equation is

play19:38

same except the fact that it doesn't

play19:40

have this x multiplication

play19:45

once you have that you're going to

play19:47

adjust

play19:48

[Music]

play19:49

your m current as shown in the equation

play19:53

uh your next m will be your current m

play19:56

minus learning rate into uh

play20:00

m derivative so we have m derivative

play20:03

but we need learning rate

play20:06

okay so i'm going to define

play20:09

learning rate now

play20:12

okay again this is a parameter that you

play20:15

have to start with some value

play20:17

so i'm going to start with point zero

play20:20

zero

play20:21

one uh people generally start with zero

play20:23

zero

play20:24

one and then they gradually improve it

play20:27

you can remove

play20:28

zeros you can

play20:31

you can use like five whatever this is

play20:33

like

play20:34

trial and error you see how your

play20:37

algorithm behaves and then you tweak

play20:39

those parameters

play20:44

similarly for b

play20:47

the equation is your value of b is

play20:51

b minus learning rate

play20:54

into your

play20:58

partial derivative and then

play21:03

at each eye iteration i will print their

play21:05

values

play21:12

let me print iteration also so that i

play21:14

know

play21:15

what's going on at each iteration

play21:31

okay all right so

play21:35

looks like my code is good enough and

play21:38

i can just run it so right click

play21:42

and run it okay

play21:46

so let's see what happened

play21:49

so you can see that we started with some

play21:53

value of

play21:54

m and b and now

play21:59

in the end

play22:02

we are 2.44 and 1.38

play22:06

okay now if you want to know

play22:09

how well you are doing you need to print

play22:12

cost

play22:12

at each iteration you should be reducing

play22:16

your cost

play22:17

right so if you remember that 3d diagram

play22:20

at each step you should be reducing your

play22:22

cost sometimes if you don't write your

play22:24

program well and if you start increasing

play22:26

the cost

play22:27

then you are never going to find the

play22:28

answer so let's

play22:30

print cost so what is our cost cost

play22:33

is equal to if you check the equation it

play22:35

is 1 divided by

play22:37

n multiplied by

play22:40

sum of something and that something is

play22:44

all the data point differences between y

play22:46

and y predicted

play22:47

and their square so we need a list here

play22:52

and the value in each of the

play22:55

the list element is this

play22:58

you will run a for loop

play23:01

on for value in

play23:05

y minus y predicted and i'm using a list

play23:08

comprehension here

play23:10

and for each of these values you want to

play23:14

take their square and this is to deal

play23:16

with the negative values

play23:19

after that i will print

play23:22

the cost at each iteration

play23:27

and when i run this i can now track down

play23:31

the cost you can see the

play23:32

cost is reducing in each of these steps

play23:38

now how do i know when i need to stop

play23:41

so i can keep on increasing my

play23:44

iterations

play23:48

and you can see that i am getting closer

play23:50

and closer to my expected m and

play23:53

b value which is two and three

play23:56

you can also manipulate your learning

play23:58

rate so what i usually like to do

play24:00

is first i will keep the iterations less

play24:04

i will start with some learning rate and

play24:07

i will see if i am reducing the cost on

play24:09

each iterations

play24:10

so here with this learning rate i can

play24:13

see that i am reducing my cost

play24:16

okay so i can maybe take even

play24:20

a bigger step so i will do

play24:24

probably zero and let's see what happens

play24:27

with that

play24:30

now here yes so uh

play24:33

with that also i am reducing my cost

play24:37

so i think this is fine let me try one

play24:40

more

play24:40

bigger step which is

play24:45

this

play24:50

now you can see that i started

play24:52

increasing my cost

play24:53

so this learning rate is too big i am

play24:57

crossing my global minima and i am

play25:00

shooting in the other direction so i

play25:03

have to

play25:03

be between point zero one and point

play25:07

one so how about point zero nine

play25:13

there also i'm increasing so maybe

play25:17

0.08

play25:21

okay that looks good here i'm reducing

play25:23

all right so i

play25:24

i will stick with this learning rate and

play25:28

increase my iterations

play25:29

to let's say 10 000

play25:35

you can see that now i reached my

play25:38

optimum value the expected value of m

play25:42

was 2 and b was 3 so which is

play25:44

almost near to 3 and you can see the

play25:46

cost is

play25:47

very minute this is how

play25:50

you can approach your gradient decent

play25:52

algorithm and stop

play25:54

whenever you reach some threshold of

play25:56

cost or even

play25:58

you can compare the cost between

play26:00

different iteration

play26:01

and you can find out see the the

play26:04

property of this

play26:06

uh curvature is that once you reach the

play26:09

global minima your cost will kind of

play26:12

stay the same okay if you're using the

play26:15

correct learning rate

play26:16

so here you see on in all these

play26:18

iteration your cost is almost

play26:20

remaining constant so you can use uh

play26:23

[Music]

play26:25

com floating point comparison and just

play26:27

compare to cost

play26:29

and stop whenever your cost is

play26:32

not reducing too much

play26:35

i also have a visual representation of

play26:38

how my m and b is

play26:41

moving towards the best fit line so we

play26:44

started here

play26:46

and then we gradually we were moving

play26:48

closer through those points

play26:50

though those red points are not quite

play26:52

visible but they are here here

play26:54

and you can see that gradually i am

play26:57

reaching

play26:58

more and more closer towards those

play27:00

points so you can use this jupyter

play27:02

notebook

play27:02

for a visualization purpose and now

play27:06

we'll move into

play27:07

our exercise section so the problem that

play27:11

you have to

play27:11

solve today is you are given

play27:15

the mathematic and computer science test

play27:18

course for

play27:18

all these students and you have to find

play27:20

out the correlation between the math

play27:22

score and computer science score

play27:24

so in summary b is your x

play27:27

and computer science score which is

play27:30

column c

play27:32

as your y using this

play27:35

uh you will find a value of m and b

play27:39

by applying gradient decent algorithm

play27:42

and what you have to do is you have to

play27:44

compare the cost between each iteration

play27:47

and when it is within certain threshold

play27:51

and to compare the threshold we are

play27:53

going to use

play27:54

a math dot is close function and use a

play27:57

tolerance of

play27:58

1 e raised to minus 20

play28:01

okay so if your two costs are

play28:04

in this range then you have to stop your

play28:07

for loop and you have to tell me

play28:09

how many iteration you need to figure

play28:12

out

play28:13

the value of m and b

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Machine LearningGradient DescentCost FunctionMean Square ErrorMathematical ConceptsPredictive ModelingPython ProgrammingData AnalysisAlgorithm OptimizationLearning RateLinear Regression