Neural Networks Demystified [Part 4: Backpropagation]

Welch Labs
5 Dec 201407:56

Summary

TLDRThis script explains the process of training a neural network using gradient descent to predict test scores based on sleep and study hours. It covers the computation of gradients for two weight matrices, applying the sum rule and chain rule in differentiation, and the importance of the sigmoid function's derivative. The script also details backpropagation, calculating the gradient with respect to each weight, and updating weights to minimize cost. It sets the stage for implementing these concepts in Python and hints at future discussions on gradient checking.

Takeaways

  • 🧠 **Gradient Descent for Neural Networks**: The script discusses using gradient descent to train a neural network to predict test scores based on sleep and study hours.
  • 🔍 **Separation of Gradient Calculation**: The computation of the gradient \( \partial J/\partial W \) is divided into two parts corresponding to the two weight matrices, \( W(1) \) and \( W(2) \).
  • 📏 **Gradient Dimensionality**: The gradient matrices \( \partial J/\partial W(1) \) and \( \partial J/\partial W(2) \) are the same size as the weight matrices they correspond to.
  • 🔄 **Sum Rule in Differentiation**: The derivative of the sum in the cost function is the sum of the derivatives, allowing for the summation to be moved outside of the differentiation.
  • 🌊 **Backpropagation as Chain Rule Application**: The script emphasizes the chain rule's importance in backpropagation, suggesting it's essentially an ongoing application of the chain rule.
  • 📉 **Derivative of the Sigmoid Function**: A Python method for the derivative of the sigmoid function, `sigmoidPrime`, is introduced to calculate \( \partial Ć·/\partial z(3) \).
  • 🔗 **Chain Rule for Weight Derivatives**: The chain rule is applied again to break down \( \partial Ć·/\partial W(2) \) into \( \partial Ć·/\partial z(3) \times \partial z(3)/\partial W(2) \).
  • 🎯 **Error Backpropagation**: The calculus is used to backpropagate the error to each weight, with weights contributing more to the error having larger activations.
  • 📊 **Matrix Operations for Gradients**: Matrix multiplication is used to handle the summation of \( \partial J/\partial W \) terms across examples, effectively aggregating the gradient.
  • 📈 **Gradient Descent Direction**: The direction to decrease cost is found by computing \( \partial J/\partial W \), which indicates the uphill direction in the optimization space.
  • 🔍 **Numerical Gradient Checking**: The script ends with a note on the importance of numerical gradient checking to verify the correctness of the mathematical derivations.

Q & A

  • What is the purpose of using gradient descent in training a neural network?

    -Gradient descent is used to train a neural network by iteratively adjusting the weights to minimize the cost function, which represents the error in predictions.

  • How are the weights of a neural network distributed in the context of the script?

    -The weights are distributed across two matrices, W(1) and W(2), which are used in the computation of the gradient.

  • Why is it necessary to compute gradients ∂J/∂W(1) and ∂J/∂W(2) separately?

    -They are computed separately to simplify the process and because they are associated with different layers of the neural network, allowing for independent updates.

  • According to the script, why is it important that the gradient matrices ∂J/∂W(1) and ∂J/∂W(2) be the same size as W(1) and W(2)?

    -This ensures that each weight in the matrices W(1) and W(2) has a corresponding gradient value, which is necessary for updating the weights during gradient descent.

  • What rule of differentiation does the script mention for handling the sum in the cost function?

    -The script mentions the sum rule of differentiation, which states that the derivative of a sum is the sum of the derivatives.

  • How does the script suggest simplifying the computation of the gradient for a single example before considering the entire dataset?

    -The script suggests temporarily ignoring the summation and focusing on the derivative of the inside expression for a single example, then adding the individual derivative terms together later.

  • What is the role of the chain rule in the context of backpropagation as described in the script?

    -The chain rule is crucial in backpropagation as it allows for the computation of the gradient with respect to the weights by multiplying the derivative of the outer function by the derivative of the inner function.

  • Why is the derivative of the test scores (y) with respect to W considered to be zero?

    -The test scores (y) are constants and do not change with respect to the weights (W), hence their derivative is zero.

  • How is the derivative of Ć· with respect to W(2) computed according to the script?

    -The derivative of Ć· with respect to W(2) is computed by applying the chain rule, breaking it into âˆ‚Ć·/∂z(3) times ∂z(3)/∂W(2), and using the derivative of the sigmoid activation function.

  • What does the term ∂z(3)/∂W(2) represent in the context of the script?

    -∂z(3)/∂W(2) represents the change in the third layer activity (z(3)) with respect to the weights in the second layer (W(2)), indicating the contribution of each weight to the activity.

  • How does the script explain the process of backpropagating the error to each weight?

    -The script explains that by multiplying the backpropagating error (ÎŽ(3)) by the activity on each synapse, the weights that contribute more to the error will have larger activations and thus larger gradient values, leading to more significant updates during gradient descent.

  • What is the significance of matrix multiplication in the context of computing gradients as described in the script?

    -Matrix multiplication is used to efficiently handle the summation of gradient terms across all examples, effectively aggregating the gradient information for all training samples.

  • How does the script suggest implementing the gradient descent update?

    -The script suggests updating the weights by subtracting the gradient from the weights, which moves the network in the direction that reduces the cost function.

Outlines

00:00

🧠 Gradient Descent for Neural Network Training

This paragraph explains the process of using gradient descent to train a neural network for predicting test scores based on sleep and study hours. It details the computation of gradients for two weight matrices, W(1) and W(2). The explanation involves the use of the sum rule in differentiation to simplify the gradient computation, the power rule, and the chain rule. The paragraph also discusses the derivative of the sigmoid activation function and how to apply it to compute the gradient with respect to the weights. It concludes with the concept of backpropagating the error through the network, emphasizing the importance of considering the dimensionality of the matrices involved.

05:03

📉 Batch Gradient Descent and Derivative Computation

The second paragraph delves into the concept of batch gradient descent, where each training example contributes to determining the direction of the weight update. It describes the coding of gradients using Python and Numpy methods. The paragraph continues with the computation of ∂J/∂W(1), explaining the derivative across synapses and the relationship between z(3) and a(2). It also covers the computation of ∂a(2)/∂z(2) using the derivative of the activation function. The final part of the paragraph discusses the computation of ∂z(2)/∂W(1) and how it relates to the input values. The paragraph concludes with the idea of stacking operations for deeper neural networks and the fundamental step of gradient descent in reducing cost, setting the stage for numerical gradient checking in future discussions.

Mindmap

Keywords

💡Gradient Descent

Gradient Descent is an optimization algorithm used to train machine learning models by iteratively moving in the direction that minimizes the cost function. In the context of the video, it is used to adjust the weights of a neural network to better predict test scores based on hours of sleep and study. The script mentions that gradient descent involves computing the derivative of the cost function with respect to the weights, which guides the update of the weights.

💡Cost Function

The Cost Function, also known as the objective function or loss function, measures the performance of the model. It calculates the difference between the predicted values and the actual values. In the script, the cost function is used to assess how well the neural network's predictions match the test scores, with the goal of minimizing this function through gradient descent.

💡Weights

Weights in a neural network are the parameters that are adjusted during training to minimize the cost function. The script discusses how weights are spread across matrices W(1) and W(2), and how their gradients are computed to update the weights and improve the model's predictions.

💡Derivative

A derivative in calculus represents the rate at which a function's output changes as its input changes. In the video, derivatives are used to calculate the gradient of the cost function with respect to the weights, which is essential for the gradient descent algorithm to update the weights correctly.

💡Chain Rule

The Chain Rule is a fundamental calculus technique used to find the derivative of a composite function. The script emphasizes its importance in backpropagation, a method used to calculate the gradient of the cost function with respect to each weight by applying the chain rule iteratively through the layers of the neural network.

💡Backpropagation

Backpropagation is the process of computing the gradient of the loss function with respect to the weights by the chain rule, moving backward through the network. The script describes backpropagation as a key part of training neural networks, where the error is propagated back to each weight to update it.

💡Activation Function

An Activation Function introduces non-linearity into the model, allowing it to learn complex patterns. The script mentions the sigmoid activation function, which is used to transform the input of a neuron into an output that represents the probability of the neuron firing.

💡Sigmoid Function

The Sigmoid Function is a specific type of activation function that squashes the input values into a range between 0 and 1. The script discusses the derivative of the sigmoid function, which is crucial for backpropagation as it helps to calculate the gradient of the cost function.

💡Matrix Multiplication

Matrix Multiplication is a mathematical operation that takes a set of linear equations and combines them into a single equation. In the script, matrix multiplication is used to calculate the product of the activities from one layer and the weights to obtain the input for the next layer in the neural network.

💡Element-wise Multiplication

Element-wise Multiplication, also known as Hadamard product, is an operation that multiplies corresponding elements of two matrices. The script uses this concept when discussing how to apply the derivative of the sigmoid function to the error term in the backpropagation process.

💡Batch Gradient Descent

Batch Gradient Descent refers to the process of updating the weights of a model using the average gradient over the entire dataset. The script explains that during batch gradient descent, the gradient from each example is summed to determine the direction in which to update the weights.

Highlights

Using gradient descent to train a Neural Network for predicting test scores based on sleep and study hours.

Gradient descent requires an equation for the gradient ∂J/∂W and corresponding code.

Weights are spread across two matrices, W(1) and W(2), requiring separate gradient computations.

Gradient matrices ∂J/∂W(1) and ∂J/∂W(2) will match the size of W(1) and W(2).

The sum rule in differentiation is used to simplify the gradient computation.

Derivative of the cost function is computed for a single example before summing up.

The power rule and chain rule are applied to compute the first derivative.

The chain rule is essential for backpropagation, emphasizing its continuous application.

Derivative of y with respect to W is zero since y is a constant.

The derivative of Ć· with respect to W(2) involves the chain rule and sigmoid activation function.

SigmoidPrime function is introduced for the derivative of the sigmoid function.

∂z(3)/∂W(2) represents the change of the third layer activity with respect to W(2).

The calculus backpropagates the error to each weight, affecting the gradient descent update.

Careful attention to dimensionality is required when computing gradients.

The backpropagating error ÎŽ(3) is a 3x1 matrix resulting from scalar multiplication.

Matrix multiplication with transposed a(2) accounts for the omitted summation.

Gradient descent updates are influenced by the gradient direction for each example.

Batch gradient descent aggregates the gradient 'votes' from all examples.

The gradient ∂J/∂W(1) is computed similarly to ∂J/∂W(2) but for the first layer.

Deep Neural Networks can be built by stacking these gradient computation operations.

Gradient descent moves weights in the direction that reduces the cost function.

Numerical gradient checking will be performed in the next session to verify the correctness of the math.

Transcripts

play00:00

Last time we decided to use gradient descent to train our Neural Network

play00:04

so it could make better predictions of your score on a test

play00:06

based on how many hours you slept and how many hours you studied the night before.

play00:11

To perform gradient descent we need an equation and some code for our gradient ∂J/∂W.

play00:18

Our weights W are spread across two matrices.

play00:21

W(1) and W(2)

play00:23

We'll separate our ∂J/∂W computation in the same way, by computing ∂J/∂W(1) and ∂J/∂W(2)

play00:32

independently.

play00:33

We should have just as many gradient values as weight values, so when we're done our matrices

play00:39

∂J/∂W(1) and ∂J/∂W(2) will be the same size as W(1) and W(2)

play00:47

Let's work on ∂J/∂W(2) first

play00:50

The sum in our cost function adds the error from each example to create an overall cost.

play00:56

We'll take advantage of the sum rule in differentiation which says that the derivative of the sums equals the sum of the derivatives.

play01:04

We can move our sigma (ÎŁ) outside and just worry about the derivative of the inside expression first.

play01:11

To keep things simple, we'll temporarily forget about our summation.

play01:15

Once we've computed ∂J/∂W for a single example,

play01:18

we'll go back and add all our individual derivative terms together

play01:23

We can now evaluate our first derivative. The power rule tells us to bring down our exponent 2 and multiply.

play01:30

To finish our derivative we need to apply the chain rule.

play01:33

The chain rule tells us how to take the derivative of a function inside of a function

play01:38

and generally says that we take the derivative of the outside function and multiply it by the derivative of the inside function

play01:45

One way to express the chain rule is as the product of derivatives this will come in very handy as we progress through backpropagation

play01:53

In fact a better name for backpropagation might be: Don't stop doing the chain rule, ever.

play02:00

We've taken the derivative of the outside of our cost function. Now we need to multiply it by the derivative of the inside.

play02:07

y is just our test scores which won't change so the derivative of y, a constant, with respect to W is zero

play02:15

y-hat (Ć·), on the other hand, does change with respect to W(2)

play02:19

So we'll apply the chain rule and multiply our results by minus âˆ‚Ć·/∂W(2)

play02:25

We now need to think about the derivative of Ć· with respect to W(2)

play02:30

Equation (4) tells us that Ć· is our activation function of z(3) so it will be helpful to apply the chain rule again

play02:38

to break âˆ‚Ć·/∂W(2) into

play02:41

âˆ‚Ć·/∂z(3) times ∂z(3)/∂W(2)

play02:45

to find the rate of change of Ć· with respect to z(3) we need to differentiate our sigmoid activation function

play02:52

with respect to z

play02:54

Now is a good time to add a new Python method for our derivative of our sigmoid function, sigmoidPrime

play03:00

Our derivative should be largest where sigmoid function is the steepest, at the value z=0

play03:07

we can now replace âˆ‚Ć·/∂z(3) with f-prime of z(3)

play03:13

Our final piece of the puzzle is ∂z(3)/∂W(2)

play03:17

This term represents the change of z, our third layer activity, with respect to the weights in the second layer.

play03:24

z(3) is the matrix product of our activities a(2) and our weights W(2).

play03:29

The activities from layer 2 are multiplied by their corresponding weights and added together to yield z(3)

play03:36

If we focus on a single synapse for a moment, we see a simple linear relationship between W and z where a is the slope.

play03:44

So for each synapse, ∂z/∂W(2) is just the activation a, on that synapse

play03:50

Another way to think about what the calculus is doing here is that it is backpropagating the error to each weight.

play03:57

By multiplying by the activity on each synapse, the weights that contribute more to the overall error will have larger activations,

play04:05

yield larger ∂J/∂W(2) values,

play04:07

and will be changed more when we perform gradient descent.

play04:11

We need to be careful with our dimensionality here, and if we're clever, we can take care of that summation we got rid of earlier.

play04:18

The first part of our equation, y-Ć·, is of the same dimension of their output data, 3x1.

play04:25

f-prime of z 3 is of the same size and our first operation is a scalar multiplication.

play04:31

Our resulting 3x1 matrix is referred to as the backpropagating error, ÎŽ(3)

play04:37

We determined that ∂z(3)/∂W(2) is equal to the activity of each synapse.

play04:42

Each value in ÎŽ(3) needs to be multiplied by each activity.

play04:47

We can achieve this by transposing a(2) and matrix multiplying by ÎŽ(3)

play04:52

What's cool here Is that the matrix multiplication also takes care of our earlier omission.

play04:57

It adds up the ∂J/∂W terms across all our examples

play05:02

Another way to think about what's happening here is that each example our algorithm sees has a certain cost and a certain gradient.

play05:10

The gradient with respect to each example pulls our gradient descent algorithm in a certain direction.

play05:15

It's like every example gets a vote on which way is downhill and when we perform batch gradient descent,

play05:22

we just add together everyone's vote, call it downhill, and move in that direction.

play05:27

We'll code up our gradients in Python, in a new method, costFunctionPrime

play05:32

Numpy's .multiply() method performs element-wise multiplication and the .dot() method performs matrix multiplication

play05:40

We now have one final term to compute, ∂J/∂W(1)

play05:44

The derivation begins the same way as before, by computing the derivative through our final layer,

play05:50

first ∂J/âˆ‚Ć·, then âˆ‚Ć·/∂z(3).

play05:54

We now take the derivative across our synapses,

play05:56

which is a little different from our job last time, which was computing the derivative with respect to the weights on our synapses.

play06:03

There's still a nice linear relationship along each synapse,

play06:06

but now we're interested in the rate of change of z(3) with respect to a(2)

play06:11

Now the slope is just equal to the weight value for that synapse.

play06:15

We can achieve this mathematically by multiplying by W(2) transposed

play06:20

Our next term to work on is ∂a(2)/∂z(2).

play06:24

This step is just like the derivative across our layer 3 neurons, so we can just multiply by f-prime of z(2).

play06:32

Our final computation here is ∂z(2)/∂W(1)

play06:36

This is very similar to our ∂z(3)/∂W(2) computation. There is a simple linear

play06:41

relationship on the synapses between z and W(1). In this case though, the slope is the input value, x.

play06:48

We can use the same technique as last time and multiplied by x transposed,

play06:53

effectively applying the derivative and adding our ∂J/∂W(1)s together

play06:57

across all our examples. All that's left is to code this equation in Python.

play07:02

What's cool here is that if we want to make a deeper neural network, we could just stack a bunch of these operations together.

play07:09

So how should we change our Ws to decrease our cost?

play07:13

We can now compute ∂J/∂W, which tells us which way is uphill in our 9-dimensional optimization space.

play07:21

If we move this way by adding a scalar times our derivative to all of our weights,

play07:26

our cost will increase. And if we do the opposite,

play07:29

subtract our gradient from our weights, we will move downhill and reduce our cost.

play07:34

This simple step downhill is the core of gradient descent

play07:38

and a key part of how even very sophisticated learning algorithms are trained.

play07:43

Next time we'll perform numerical gradient checking to make sure our math is correct.

Rate This
★
★
★
★
★

5.0 / 5 (0 votes)

Ähnliche Tags
Machine LearningGradient DescentNeural NetworksCost FunctionBackpropagationDerivative RulesSigmoid FunctionChain RuleMatrix MultiplicationOptimization
Benötigen Sie eine Zusammenfassung auf Englisch?