Backpropagation calculus | Chapter 4, Deep learning

3Blue1Brown
3 Nov 201710:17

Summary

TLDRThis script delves into the formal aspects of backpropagation, emphasizing the chain rule from calculus and its application in neural networks. It explains how to compute the sensitivity of the cost function to changes in weights and biases, using a simple network model with single neurons per layer for clarity. The explanation includes the computation of derivatives and the concept of error propagation backwards through the network, highlighting the iterative process that allows neural networks to learn and minimize costs.

Takeaways

  • 🧠 The video script provides a formal introduction to the backpropagation algorithm with a focus on the chain rule from calculus.
  • 📚 It emphasizes the importance of understanding how the cost function is sensitive to changes in weights and biases within a neural network.
  • 🌟 The example used is a simple network with one neuron per layer to illustrate the concepts clearly.
  • 🔄 Backpropagation involves computing the gradient of the cost function with respect to the network's parameters by applying the chain rule iteratively.
  • 📈 The derivative of the cost with respect to the weight is calculated by considering the impact of weight changes on the weighted sum (z), activation (a), and ultimately the cost (c).
  • 🤔 The script encourages viewers to pause and ponder the concepts, acknowledging that the material can be confusing.
  • 🎯 The goal is to efficiently minimize the cost function by adjusting weights and biases based on their sensitivity to the cost.
  • 🔢 For a network with multiple neurons per layer, the process involves keeping track of additional indices for the activations and weights.
  • 🌐 The derivative of the cost with respect to an activation in a previous layer is influenced by the weights connecting to the next layer.
  • 📊 The script explains how the cost function's sensitivity to the previous layer's activations is propagated backwards through the network.
  • 🚀 Understanding these chain rule expressions is crucial for grasping the learning mechanism of neural networks.

Q & A

  • What is the main goal of the video script?

    -The main goal of the video script is to explain how the chain rule from calculus is applied in the context of neural networks, particularly focusing on the backpropagation algorithm and how it helps in understanding the sensitivity of the cost function to various parameters.

  • What is the 'mantra' mentioned in the script and when should it be applied?

    -The 'mantra' mentioned in the script is to regularly pause and ponder. It should be applied when learning about complex topics like the chain rule and backpropagation, especially when they are initially confusing.

  • How is the network described in the beginning of the script?

    -The network described in the beginning of the script is extremely simple, consisting of layers with a single neuron each, determined by three weights and three biases.

  • What does the script mean when it refers to 'indexing' with superscripts and subscripts?

    -In the script, 'indexing' with superscripts and subscripts is used to differentiate between various elements such as the layers and neurons within the network. For example, the superscript L indicates the layer a neuron is in, while subscripts like k and j would be used to index specific neurons within layers L-1 and L, respectively.

  • What is the cost function for a single training example in the context of the script?

    -The cost function for a single training example is given by the formula Al - y^2, where Al is the activation of the last neuron and y is the desired output value.

  • How does the script describe the computation of the cost from the weighted sum (z) and the activation (a)?

    -The script describes the computation of the cost as a process where the weighted sum (z) is first computed, then passed through a nonlinear function to get the activation (a), and finally, this activation is used along with a constant y to compute the cost.

  • What is the chain rule as applied to the backpropagation algorithm?

    -The chain rule as applied to the backpropagation algorithm involves multiplying together the derivatives of the cost function with respect to each parameter (like weights and activations) to find the sensitivity of the cost to small changes in those parameters.

  • How does the derivative of the cost with respect to the weight (WL) depend on the previous neuron's activation (AL-1)?

    -The derivative of the cost with respect to the weight (WL) depends on the previous neuron's activation (AL-1) because the amount that a small nudge to the weight influences the last layer is determined by how strong the previous neuron is, encapsulated by the derivative of ZL with respect to WL, which is AL-1.

  • What is the significance of the term 'neurons-that-fire-together-wire-together' in the context of the script?

    -The term 'neurons-that-fire-together-wire-together' refers to the concept that when neurons in the network are activated together, the connections (weights) between them are strengthened. This is significant because it helps explain how the sensitivity of the cost function to the previous layer's activation influences the learning process.

  • How does the script describe the process of backpropagation?

    -The script describes backpropagation as a process where the sensitivity of the cost function to the activations of the previous layers is calculated by iterating the same chain rule idea backwards, starting from the output layer and moving towards the input layer.

  • What changes when moving from the simple network with one neuron per layer to a more complex network with multiple neurons per layer?

    -When moving from a simple network with one neuron per layer to a more complex one with multiple neurons per layer, the main change is the need to keep track of additional indices for the activations and weights. The equations essentially remain the same, but they appear more complex due to the increased number of indices.

  • How does the script explain the computation of the cost function for a network with multiple neurons per layer?

    -For a network with multiple neurons per layer, the cost function is computed by summing up the squares of the differences between the activations of the last layer (ALj) and the desired output (Yj). Each weight is now indexed by two indices (WLjk), indicating the connection between the kth neuron of the previous layer and the jth neuron of the current layer.

Outlines

00:00

🧠 Understanding Backpropagation and the Chain Rule

This paragraph introduces the formal approach to understanding backpropagation through calculus, specifically focusing on the chain rule in the context of neural networks. It emphasizes the importance of pausing to ponder and clarifies that the chain rule is applied differently in machine learning compared to traditional calculus courses. The explanation begins with a simple network model, detailing the relationship between the cost function, weights, biases, and activations. The goal is to determine how changes in weights affect the cost function, and the paragraph outlines the process of computing derivatives with respect to these variables. The concept of 'neurons-that-fire-together-wire-together' is introduced to explain the impact of previous layer activations on the cost function.

05:03

📉 Calculating Derivatives and Propagating Errors

The second paragraph delves into the specifics of calculating derivatives for the cost function with respect to weights and biases, particularly focusing on the weight WL. It explains how the full cost function is derived by averaging the cost across multiple training examples. The paragraph also discusses the propagation of errors from the output layer back through the network, highlighting the iterative nature of backpropagation. The concept of the gradient vector and its components, the partial derivatives, are introduced. The paragraph simplifies the explanation by considering a network with multiple neurons per layer, explaining the indexing of weights and the computation of the cost function. It concludes by encouraging the viewer to appreciate the complexity of backpropagation and the learning process of neural networks.

Mindmap

Keywords

💡backpropagation

Backpropagation is a fundamental algorithm used in training artificial neural networks. It involves the calculation of the gradient of the loss function with respect to the weights by the chain rule, allowing for the efficient adjustment of weights to minimize the cost. In the context of the video, backpropagation is the core process that enables neural networks to learn from data by iteratively adjusting weights to reduce the error between predicted outputs and actual desired outputs.

💡chain rule

The chain rule is a fundamental concept in calculus that allows the computation of the derivative of a composite function. In the context of the video, the chain rule is used to calculate the sensitivity of the cost function to the weights and biases in a neural network. It involves breaking down the overall derivative into a product of several smaller derivatives, each representing the sensitivity of an intermediate variable to its preceding variables.

💡cost function

In machine learning, a cost function is a mathematical function that measures the difference between the predicted output and the actual desired output. The goal is to minimize this function to improve the accuracy of the model. In the video, the cost function is represented as the sum of squared errors between the network's output and the target values for a given training example.

💡neuron

A neuron in the context of neural networks represents a basic computational unit that processes inputs, weights them, adds a bias, and then typically passes the result through an activation function to produce an output. The video discusses a simple network with single neurons in each layer to illustrate the concepts of backpropagation and the chain rule.

💡activation function

An activation function is a mathematical function applied to the weighted sum of inputs (plus bias) in a neuron to introduce non-linearity into the neural network. Common activation functions include sigmoid and ReLU. In the video, the activation function is used to transform the weighted sum (z) to produce the activation of a neuron, which is then used in the calculation of the cost function.

💡weights and biases

In neural networks, weights are the numerical values that determine the strength of the connection between neurons, while biases are additional terms added to the weighted sum before the activation function to help the model fit the data better. The video focuses on understanding how changes in weights and biases affect the cost function and uses this understanding to perform backpropagation.

💡sigmoid function

The sigmoid function is a type of activation function that maps any real-valued number into (0,1). It is an S-shaped curve that helps in converting the input into a form that can be interpreted as a probability. In the context of the video, the sigmoid function is mentioned as one of the possible nonlinear functions that can be used in the activation of a neuron.

💡ReLU (Rectified Linear Unit)

ReLU is another popular activation function used in neural networks, defined as f(x) = max(0, x). It has the advantage of being computationally efficient and helping to mitigate the vanishing gradient problem. The video mentions ReLU as an example of the special nonlinear function that can be used in place of the sigmoid function.

💡gradient

In the context of optimization, a gradient is a vector of partial derivatives that points in the direction of the greatest increase of a function. In machine learning, the gradient of the cost function with respect to the model parameters (weights and biases) is used to update these parameters in the direction that minimizes the cost. The video explains how the chain rule expressions derived during backpropagation contribute to calculating each component of the gradient.

💡training example

A training example in machine learning is a single data point or instance in the training dataset. It typically consists of input features and the corresponding desired output. The video script discusses the cost of the network for a single training example, which is calculated based on the difference between the network's output and the desired output for that example.

💡nonlinear function

A nonlinear function is a mathematical function that does not have a linear relationship between its input and output. In neural networks, nonlinear functions like the sigmoid or ReLU are used to introduce the ability to model complex relationships between inputs and outputs. The video emphasizes the importance of the choice of nonlinearity in the activation function of neurons.

Highlights

The main goal is to understand the chain rule in the context of networks, which differs from standard calculus teaching.

We begin with a simple network model, each layer consisting of a single neuron.

The network's cost sensitivity to variables like weights and biases is explored to optimize these parameters.

Activation of the last neuron is labeled with a superscript L, indicating its layer.

The cost for a single training example is defined as Al-y^2.

The weighted sum input to the activation function is denoted as z, with the same superscript as the activation.

The derivative of the cost with respect to the weight WL is determined by the chain rule.

The derivative of the cost with respect to the last activation AL is 2AL-y.

The derivative of AL with respect to ZL is the derivative of the chosen nonlinearity function.

The derivative of ZL with respect to WL is AL-1, indicating the influence of the previous neuron's strength.

The full cost function involves averaging the cost across all training examples.

The gradient vector is composed of partial derivatives of the cost function with respect to all weights and biases.

The sensitivity to the bias is simpler, with the derivative being 1.

The chain rule's initial expression shows the cost function's sensitivity to the previous layer's activation.

In networks with multiple neurons per layer, the equations remain similar but require more indices for clarity.

The derivative of the cost with respect to activations in the second-to-last layer is influenced by multiple paths.

Understanding the cost sensitivity to the second-to-last layer allows for the application of the same process to previous layers.

The chain rule expressions provide the derivatives that form the gradient used for minimizing the network's cost.

Backpropagation is the mechanism behind neural network learning, and grasping its core concepts is challenging but essential.

Transcripts

play00:04

The hard assumption here is that you've watched part 3,

play00:06

giving an intuitive walkthrough of the backpropagation algorithm.

play00:11

Here we get a little more formal and dive into the relevant calculus.

play00:14

It's normal for this to be at least a little confusing,

play00:17

so the mantra to regularly pause and ponder certainly applies as much here

play00:20

as anywhere else.

play00:21

Our main goal is to show how people in machine learning commonly think about

play00:25

the chain rule from calculus in the context of networks,

play00:28

which has a different feel from how most introductory calculus courses

play00:32

approach the subject.

play00:34

For those of you uncomfortable with the relevant calculus,

play00:37

I do have a whole series on the topic.

play00:39

Let's start off with an extremely simple network,

play00:43

one where each layer has a single neuron in it.

play00:46

This network is determined by three weights and three biases,

play00:49

and our goal is to understand how sensitive the cost function is to these variables.

play00:55

That way, we know which adjustments to those terms will

play00:58

cause the most efficient decrease to the cost function.

play01:01

And we're just going to focus on the connection between the last two neurons.

play01:05

Let's label the activation of that last neuron with a superscript L,

play01:10

indicating which layer it's in, so the activation of the previous neuron is Al-1.

play01:16

These are not exponents, they're just a way of indexing what we're talking about,

play01:20

since I want to save subscripts for different indices later on.

play01:23

Let's say that the value we want this last activation to be for

play01:28

a given training example is y, for example, y might be 0 or 1.

play01:32

So the cost of this network for a single training example is Al-y2.

play01:40

We'll denote the cost of that one training example as c0.

play01:45

As a reminder, this last activation is determined by a weight,

play01:50

which I'm going to call WL, times the previous neuron's activation plus some bias,

play01:55

which I'll call BL.

play01:57

And then you pump that through some special nonlinear function like the sigmoid or ReLU.

play02:01

It's actually going to make things easier for us if we give a special name to

play02:05

this weighted sum, like z, with the same superscript as the relevant activations.

play02:10

This is a lot of terms, and a way you might conceptualize it is that the weight,

play02:15

previous action and the bias all together are used to compute z,

play02:19

which in turn lets us compute a, which finally, along with a constant y,

play02:23

lets us compute the cost.

play02:27

And of course Al-1 is influenced by its own weight and bias and such,

play02:31

but we're not going to focus on that right now.

play02:35

All of these are just numbers, right?

play02:38

And it can be nice to think of each one as having its own little number line.

play02:41

Our first goal is to understand how sensitive the

play02:45

cost function is to small changes in our weight WL.

play02:49

Or phrase differently, what is the derivative of c with respect to WL?

play02:55

When you see this del W term, think of it as meaning some tiny nudge to W,

play03:00

like a change by 0.01, and think of this del c term as meaning

play03:05

whatever the resulting nudge to the cost is.

play03:08

What we want is their ratio.

play03:11

Conceptually, this tiny nudge to WL causes some nudge to ZL,

play03:15

which in turn causes some nudge to AL, which directly influences the cost.

play03:23

So we break things up by first looking at the ratio of a tiny change to

play03:28

ZL to this tiny change W, that is, the derivative of ZL with respect to WL.

play03:33

Likewise, you then consider the ratio of the change to AL to

play03:37

the tiny change in ZL that caused it, as well as the ratio

play03:40

between the final nudge to c and this intermediate nudge to AL.

play03:45

This right here is the chain rule, where multiplying together these

play03:50

three ratios gives us the sensitivity of c to small changes in WL.

play03:56

So on screen right now, there's a lot of symbols,

play03:59

and take a moment to make sure it's clear what they all are,

play04:02

because now we're going to compute the relevant derivatives.

play04:07

The derivative of c with respect to AL works out to be 2AL-y.

play04:13

Notice this means its size is proportional to the difference between the

play04:18

network's output and the thing we want it to be, so if that output was very different,

play04:22

even slight changes stand to have a big impact on the final cost function.

play04:27

The derivative of AL with respect to ZL is just the derivative of our sigmoid function,

play04:33

or whatever nonlinearity you choose to use.

play04:37

And the derivative of ZL with respect to WL comes out to be AL-1.

play04:45

Now I don't know about you, but I think it's easy to get stuck head down in the

play04:49

formulas without taking a moment to sit back and remind yourself of what they all mean.

play04:53

In the case of this last derivative, the amount that the small nudge to the

play04:58

weight influenced the last layer depends on how strong the previous neuron is.

play05:03

Remember, this is where the neurons-that-fire-together-wire-together idea comes in.

play05:09

And all of this is the derivative with respect to WL

play05:12

only of the cost for a specific single training example.

play05:16

Since the full cost function involves averaging together all

play05:19

those costs across many different training examples,

play05:23

its derivative requires averaging this expression over all training examples.

play05:28

And of course, that is just one component of the gradient vector,

play05:31

which itself is built up from the partial derivatives of the

play05:35

cost function with respect to all those weights and biases.

play05:40

But even though that's just one of the many partial derivatives we need,

play05:43

it's more than 50% of the work.

play05:46

The sensitivity to the bias, for example, is almost identical.

play05:50

We just need to change out this del z del w term for a del z del b.

play05:58

And if you look at the relevant formula, that derivative comes out to be 1.

play06:06

Also, and this is where the idea of propagating backwards comes in,

play06:10

you can see how sensitive this cost function is to the activation of the previous layer.

play06:15

Namely, this initial derivative in the chain rule expression,

play06:20

the sensitivity of z to the previous activation, comes out to be the weight WL.

play06:26

And again, even though we're not going to be able to directly influence

play06:30

that previous layer activation, it's helpful to keep track of,

play06:33

because now we can just keep iterating this same chain rule idea backwards

play06:38

to see how sensitive the cost function is to previous weights and previous biases.

play06:43

And you might think this is an overly simple example, since all layers have one neuron,

play06:47

and things are going to get exponentially more complicated for a real network.

play06:51

But honestly, not that much changes when we give the layers multiple neurons,

play06:55

really it's just a few more indices to keep track of.

play06:59

Rather than the activation of a given layer simply being AL,

play07:02

it's also going to have a subscript indicating which neuron of that layer it is.

play07:07

Let's use the letter k to index the layer L-1, and j to index the layer L.

play07:15

For the cost, again we look at what the desired output is,

play07:18

but this time we add up the squares of the differences between these last layer

play07:23

activations and the desired output.

play07:26

That is, you take a sum over ALj minus Yj squared.

play07:33

Since there's a lot more weights, each one has to have a couple

play07:36

more indices to keep track of where it is, so let's call the weight

play07:41

of the edge connecting this kth neuron to the jth neuron, WLjk.

play07:45

Those indices might feel a little backwards at first,

play07:48

but it lines up with how you'd index the weight matrix I talked about in the part 1 video.

play07:53

Just as before, it's still nice to give a name to the relevant weighted sum,

play07:57

like z, so that the activation of the last layer is just your special function,

play08:02

like the sigmoid, applied to z.

play08:04

You can see what I mean, where all of these are essentially the same equations we had

play08:09

before in the one-neuron-per-layer case, it's just that it looks a little more

play08:13

complicated.

play08:15

And indeed, the chain-ruled derivative expression describing how

play08:19

sensitive the cost is to a specific weight looks essentially the same.

play08:23

I'll leave it to you to pause and think about each of those terms if you want.

play08:28

What does change here, though, is the derivative of the

play08:32

cost with respect to one of the activations in the layer L-1.

play08:37

In this case, the difference is that the neuron influences

play08:40

the cost function through multiple different paths.

play08:44

That is, on the one hand, it influences AL0, which plays a role in the cost function,

play08:50

but it also has an influence on AL1, which also plays a role in the cost function,

play08:55

and you have to add those up.

play08:59

And that, well, that's pretty much it.

play09:03

Once you know how sensitive the cost function is to the

play09:06

activations in this second-to-last layer, you can just repeat

play09:09

the process for all the weights and biases feeding into that layer.

play09:13

So pat yourself on the back!

play09:15

If all of this makes sense, you have now looked deep into the heart of backpropagation,

play09:20

the workhorse behind how neural networks learn.

play09:23

These chain rule expressions give you the derivatives that determine each component in

play09:28

the gradient that helps minimize the cost of the network by repeatedly stepping downhill.

play09:34

If you sit back and think about all that, this is a lot of layers of complexity to

play09:38

wrap your mind around, so don't worry if it takes time for your mind to digest it all.

Rate This

5.0 / 5 (0 votes)

Related Tags
Machine LearningBackpropagationCalculusNeural NetworksCost FunctionChain RuleSensitivity AnalysisWeight AdjustmentLearning ProcessDeep Learning