Neural Networks Demystified [Part 3: Gradient Descent]

Welch Labs
21 Nov 201406:55

Summary

TLDRThis script discusses improving neural network predictions by introducing the concept of a cost function to quantify prediction errors. It explains the necessity of minimizing this cost function, which is a function of the network's weights and examples. The script highlights the impracticality of brute force optimization due to the curse of dimensionality and introduces gradient descent as a more efficient method. It also touches on the importance of choosing a convex cost function to avoid local minima and mentions stochastic gradient descent as a variant that can work with non-convex functions.

Takeaways

  • 🧠 **Inaccuracy in Predictions**: The initial neural network predictions were inaccurate, indicating a need for improvement.
  • πŸ” **Cost Function Introduction**: A cost function was introduced to quantify the error in predictions, which is essential for model improvement.
  • πŸ“ˆ **Cost Function Calculation**: The cost function sums the squares of the errors, multiplied by one half for simplification.
  • πŸ”§ **Minimizing the Cost**: The process of training a network is essentially minimizing this cost function.
  • πŸ”— **Cost Function Dependency**: The cost function depends on the examples and the weights of the synapses.
  • πŸš€ **Brute Force Inefficiency**: Trying all possible weights (brute force) is inefficient due to the curse of dimensionality.
  • ⏱️ **Time Complexity**: As the number of weights increases, the time required to find the optimal weights grows exponentially.
  • πŸ“‰ **Gradient Descent**: Gradient descent is introduced as a method to minimize the cost function by iteratively moving in the direction that reduces the cost.
  • πŸ“ **Convexity of Cost Function**: The choice of a quadratic cost function ensures its convexity, which aids gradient descent in finding the global minimum.
  • πŸ”„ **Stochastic Gradient Descent**: Using examples one at a time (stochastic gradient descent) can sometimes bypass the issue of non-convex loss functions.
  • πŸ› οΈ **Batch Gradient Descent**: The script concludes with the intention to implement batch gradient descent, using all examples at once to keep the function convex.

Q & A

  • What was the initial problem with the neural network's predictions?

    -The initial problem was that the neural network made really bad predictions of a test score based on the number of hours slept and studied the night before.

  • What is a cost function in the context of neural networks?

    -A cost function is a measure of how wrong or costly our model's predictions are given our examples. It quantifies the error between the predicted values and the actual values.

  • How is the overall cost computed in the script?

    -The overall cost is computed by squaring each error value and adding these values together, then multiplying by one half to simplify future calculations.

  • What is meant by minimizing the cost function?

    -Minimizing the cost function means adjusting the weights of the neural network to reduce the error between the predicted and actual values to the smallest possible amount.

  • Why is it not feasible to try all possible weights to find the best combination?

    -Trying all possible weights is not feasible due to the curse of dimensionality, which exponentially increases the number of evaluations required as the number of weights increases.

  • What is the curse of dimensionality?

    -The curse of dimensionality refers to the phenomenon where the volume of the space increases so fast with the addition of each dimension that even a large number of samples provides little information about the overall space.

  • How long would it take to evaluate all possible combinations of nine weights?

    -Evaluating all possible combinations of nine weights would take one quadrillion, 268 trillion, 391 billion, 679 million, 350 thousand, 583 years and a half.

  • What is numerical estimation and why is it used?

    -Numerical estimation is a method used to approximate the value of a mathematical function. It's used to determine the direction in which the cost function is decreasing, which helps in adjusting the weights.

  • What is gradient descent and how does it help in minimizing the cost function?

    -Gradient descent is an optimization algorithm used to find the minimum of a function by iteratively moving in the direction of the steepest descent as defined by the negative of the gradient. It helps in minimizing the cost function by taking steps in the direction that reduces the cost the most.

  • Why is the cost function chosen to be the sum of squared errors?

    -The cost function is chosen to be the sum of squared errors to exploit the convex nature of quadratic equations, which ensures that the function has a single global minimum and no local minima.

  • What is the difference between batch gradient descent and stochastic gradient descent?

    -Batch gradient descent uses all examples at once to compute the gradient and update the weights, whereas stochastic gradient descent uses one example at a time, which can lead to a noisy but potentially faster optimization process.

Outlines

00:00

🧠 Improving Neural Network Predictions

The script discusses the initial poor performance of a neural network in predicting test scores based on sleep and study hours. It introduces the concept of a cost function to quantify prediction errors. The cost function is defined as the sum of squared errors, multiplied by one half for simplification. The goal is to minimize this cost function by adjusting the weights of the network's synapses. The script highlights the impracticality of brute-force optimization due to the curse of dimensionality, which exponentially increases the number of evaluations needed as more weights are added. It suggests using calculus to find the direction that minimizes the cost, specifically through the use of partial derivatives to guide weight adjustments.

05:05

πŸ“ˆ The Power of Gradient Descent

This section delves into the efficiency of gradient descent for minimizing cost functions in higher dimensions. It contrasts the impracticality of brute-force methods with the speed and effectiveness of gradient descent, which iteratively moves in the direction that reduces cost. The script also addresses the potential issue of non-convex cost functions that could lead gradient descent to a local minimum rather than a global minimum. It explains the choice of a convex cost function to avoid this problem and introduces stochastic gradient descent, which processes examples one at a time and may be less sensitive to non-convexity. The section concludes by emphasizing the importance of understanding the details of gradient descent for optimizing neural network performance.

Mindmap

Keywords

πŸ’‘Neural Network

A neural network is a computational model inspired by the human brain that is designed to recognize patterns. It consists of interconnected units or nodes called neurons, which process information through a network. In the video script, the neural network is used to predict test scores based on hours of sleep and study. The script discusses improving the predictions by adjusting the network's weights, which are the parameters that determine the strength of the connections between neurons.

πŸ’‘Cost Function

The cost function, also known as the objective function or loss function, is a measure of how well the neural network is performing. It quantifies the difference between the predicted values and the actual values. The script describes using the sum of squared errors as the cost function, which is the sum of the squares of the differences between the predicted values (y hat) and the actual values (y). Minimizing this cost function is a key goal in training a neural network.

πŸ’‘Weights

In the context of neural networks, weights are the numeric values that are used to calculate the strength of the connections between neurons. The script mentions that the cost function is a function of the examples and the weights on the synapses, which are the connections between neurons. Adjusting these weights is essential for improving the accuracy of the network's predictions.

πŸ’‘Gradient Descent

Gradient descent is an optimization algorithm used to minimize the cost function by iteratively moving in the direction that most decreases the cost. The script explains that instead of trying all possible weight combinations (brute force), gradient descent allows for a more efficient search by taking steps in the direction that reduces the cost. It's a fundamental technique in training neural networks and is particularly effective in high-dimensional spaces.

πŸ’‘Partial Derivative

A partial derivative is a derivative that measures how one variable of a multivariable function changes when other variables are held constant. In the script, the partial derivative of the cost function with respect to a weight is calculated to determine the direction in which the cost decreases most rapidly for that weight. This is used in gradient descent to update the weights.

πŸ’‘Curse of Dimensionality

The curse of dimensionality refers to the various problems that arise when working with high-dimensional data, such as the increased computational complexity and the sparsity of the data. The script uses this term to illustrate the impracticality of trying all possible combinations of weights when the number of weights increases, as it would require an exponentially growing number of evaluations.

πŸ’‘Convex Function

A convex function is a function for which any straight line connecting two points on the graph of the function lies above or on the graph. The script mentions that the chosen cost function, the sum of squared errors, is convex, which ensures that gradient descent will not get stuck in local minima but will find the global minimum.

πŸ’‘Stochastic Gradient Descent

Stochastic gradient descent is a variant of gradient descent where the gradient is estimated using only one training example (or a small batch) at a time, rather than the entire dataset. The script suggests that even if the cost function is not convex, using stochastic gradient descent can still yield good solutions, as it may not matter if the function is convex when using examples one at a time.

πŸ’‘Local Minimum

A local minimum is a point where all neighboring points have a higher function value, but it is not the lowest point in the entire function. The script cautions that non-convex cost functions can lead to local minima, which might cause gradient descent to stop at a suboptimal solution instead of finding the global minimum.

πŸ’‘Global Minimum

The global minimum is the lowest point of a function across the entire input space. In the context of the script, finding the global minimum is the ultimate goal of using gradient descent to minimize the cost function, as it represents the optimal set of weights for the neural network.

Highlights

Building a neural network to predict test scores based on sleep and study hours.

The current model's predictions are inaccurate.

Introducing the concept of a cost function to quantify prediction errors.

Cost function is the sum of squared errors, multiplied by one half for simplicity.

Minimizing the cost function is equivalent to training the network.

The cost function depends on examples and the weights of synapses.

The curse of dimensionality makes brute force optimization impractical.

Checking a thousand values for one weight takes 0.04 seconds.

For nine weights, brute force would take an astronomical amount of time.

Gradient descent is introduced as a more efficient optimization method.

Gradient descent iteratively takes steps downhill to minimize cost.

Gradient descent can find solutions much faster than brute force.

The potential issue of getting stuck in a local minimum with non-convex cost functions.

Choosing a convex cost function to avoid issues with local minima.

Stochastic gradient descent as an alternative when using examples one at a time.

The upcoming coding of gradients in the next session.

Transcripts

play00:01

Last time we built a neural network in python that made really bad predictions of your score on a test

play00:06

Based on how many hours you slept and how many hours you studied the night before?

play00:11

This time we'll focus on the theory of making those predictions better

play00:15

we can initialize the network we built last time and pass in our normalized Data x

play00:20

Using our forward method and have a look at our estimate of y y hat

play00:25

Right now our predictions are pretty inaccurate to improve our model

play00:29

we first need to quantify exactly how wrong our predictions are well do this with a cost function a

play00:35

Cost function allows us to Express exactly how wrong or costly our model is given our examples

play00:42

One way to compute an overall cost is to take each error value square it and add these values together

play00:50

Multiplying by one half will make things simpler down the road

play00:53

now that we have a cost our job is to minimize it and

play00:57

someone says they're training a network what they really mean is that they're minimizing a cost function

play01:02

Our cost is a function of two things

play01:05

Our examples, and the weights on our synapses

play01:07

We don't have much control over our data. So will minimize our cost by changing the weights

play01:13

Conceptually this is a pretty simple concept, we have a collection of nine individual weights.

play01:18

And we're saying that there is some combination of w's that will make our cost, j, as small as possible.

play01:24

When I first saw this problem in Machine learning

play01:26

I thought I'll just try all the weights until I find the best one after all I have a computer

play01:33

enter the curse of dimensionality

play01:35

Here's the problem let's pretend for a second that we only have one weight instead of nine

play01:40

To find the ideal value for our weight that will minimize our cost

play01:44

We need to try a bunch of values for w let's say we test a thousand values

play01:50

That doesn't seem so bad; After all, my computer is pretty fast

play01:54

It takes about point zero four seconds to check a thousand different weight values for our neural Network

play02:00

Since we've competed the cost for a wide range of values of w we can just pick the one with the smallest cost

play02:06

Let that be our weight, and now we've trained our network

play02:09

So you may be thinking that point zero four seconds the trainer network is not so bad

play02:13

And we haven't even optimized anything yet. Plus, there are other way faster languages than python out there

play02:20

Before we optimize though, let's consider the full complexity of the problem

play02:25

Remember the point zero four seconds required is only for one weight, and we have nine total

play02:31

Let's next consider two weights for a moment, to maintain the same precision

play02:35

We now need to check 1,000 times a thousand or a million values. This is a lot of work even for a fast computer.

play02:43

After our million evaluations we found our solution

play02:46

But it took an agonizing 40 seconds the real curse of dimensionality kicks in as we continue to add dimensions

play02:53

Searching through three weights would take a billion evaluations or 11 hours searching through all nine weights

play02:59

We need for our simple Neural Network would take one quadrillion

play03:03

268 Trillion 391 billion

play03:06

679 million three hundred and fifty thousand five hundred and eighty three and a half years

play03:11

For that reason that just try everything or brute Force optimization method is clearly not going to work

play03:18

Let's return to the one-dimensional case and see if we can be more clever

play03:22

Let's evaluate our cost function for a specific value of w if w is 1.1 for example

play03:29

We can run our cost function and see that J is 2.8

play03:33

We haven't learned much yet, but let's try to add a little more information to what we [already] know

play03:39

What if we could figure out which way was downhill if we could we would know whether to make w smaller or larger to?

play03:46

Decrease the cost

play03:47

We could test the cost function immediately to the left and to the right of our test point and see which is smaller

play03:53

This is called numerical estimation and is sometimes a good approach but for us there is a better way

play03:59

Let's look at our equation so far

play04:02

we have five equations, but we could really think of them as one big approach and

play04:07

since we have one big equation that uniquely Determines our cost J from x y w1 and W2

play04:13

We can [use] our good friend calculus to find exactly what looking for

play04:18

We want to know which way is downhill that is what is the rate of change of J with respect to w?

play04:24

Also known as the derivative and in this case since we're just considering one weight at a time. This is a partial derivative

play04:32

We can derive an expression for DJ 2w

play04:35

That will give us the rate of change of J with respect to w for any value of w if DJ

play04:41

Dw is positive then the cost function is going uphill if Dj. Dw is negative and the cost function is going Downhill now

play04:48

We can really speed things up since we [know] in which direction the cost decreases

play04:52

We can save all the time that we would have spent searching in the wrong direction

play04:57

We can save you even more computational time by iteratively taking steps downhill and stopping when the cost stopped getting smaller

play05:05

This method is known as gradient descent and although it may not seem so impressive in one dimension

play05:10

it is capable of incredible speed ups in higher dimensions in

play05:14

Fact in our final video will show that what would have taken 10 to the 27th function evaluations with our brute Force method

play05:21

Will take less than a hundred evaluations with gradient descent?

play05:25

Gradient descent allows us to find needles in very very very large haystacks

play05:31

Now before we celebrate too much here. There is a restriction

play05:34

What if our cost function doesn't always go in the same direction? What if it goes up and then back down?

play05:40

The mathematical name for this is non convex

play05:43

And it could really throw off our gradient descent algorithm by getting it stuck in a local minimum instead of our ideal Global Minima

play05:51

One of the reasons we chose our cost function to be the sum of squared errors was to exploit [the] convex Nature of quadratic equations

play05:59

We know that the graph of y equals x squared is a nice convex parabola and it turns out that higher dimensional versions are to

play06:07

another piece of the puzzle

play06:08

here is that depending on how we use our data it might not matter if our function is convex or not if

play06:14

We use our examples one at a time instead of all at once

play06:18

Sometimes it won't matter our function is convex. We will still find a good solution

play06:23

this is called stochastic gradient descent

play06:26

So maybe we shouldn't be afraid of non convex loss functions as neural Network Wizard Iyanla Kuhn says in his excellent

play06:32

Talk who is afraid of non convex Loss functions?

play06:36

The details of gradient descent are a deep topic for another day for now

play06:40

We're going to do our gradient descent batch style

play06:43

Where we use all our examples at once and the way we've set up our cost function will keep things nice and convex

play06:49

Next time we'll compute and code up our gradients

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Neural NetworksMachine LearningCost FunctionGradient DescentOptimizationData SciencePredictive ModelingAlgorithmsConvex FunctionsStochastic