What is backpropagation really doing? | Chapter 3, Deep learning

3Blue1Brown
3 Nov 201712:47

Summary

TLDRThis script delves into backpropagation, the foundational algorithm for neural network learning. It offers an intuitive explanation without formulas, followed by a deeper dive into the calculus for those interested. The video covers the process of adjusting weights and biases to minimize cost functions, using the example of recognizing handwritten digits. It also touches on the practical aspects of training with large datasets, like MNIST, and the use of mini-batches for efficient stochastic gradient descent.

Takeaways

  • 🧠 Backpropagation is the algorithm that enables neural networks to learn by calculating the gradient of the cost function with respect to the weights and biases.
  • πŸ“ˆ The cost function measures the error of the network's predictions against the actual target values, and minimizing this function is the goal of learning.
  • πŸ” An intuitive understanding of backpropagation involves recognizing how each training example influences the adjustments of weights and biases to decrease the cost.
  • πŸ€– The script emphasizes the importance of understanding the role of each component in the backpropagation process, despite the complexity of the notation and calculations.
  • πŸ”’ The magnitude of the gradient components indicates the sensitivity of the cost function to changes in the corresponding weights and biases.
  • πŸ“‰ The negative gradient points in the direction that will most efficiently reduce the cost, guiding the adjustments made during gradient descent.
  • πŸ”„ The process of backpropagation involves moving backwards through the network, calculating the desired changes in activations and weights from the output layer to the input layer.
  • πŸ”— The concept of Hebbian theory is mentioned as a loose analogy to how weights are adjusted in neural networks, with stronger connections forming between co-active neurons.
  • πŸ’‘ The adjustments to weights and biases are proportional to the influence they have on the cost function, with larger adjustments made to more influential weights.
  • πŸ“š The script suggests that a deep understanding of backpropagation requires both an intuitive grasp of the process and a mathematical understanding of the underlying calculus.
  • πŸ’» Practical implementation of backpropagation often uses mini-batches of training data for computational efficiency, a technique known as stochastic gradient descent.

Q & A

  • What is the core algorithm behind how neural networks learn?

    -The core algorithm behind how neural networks learn is backpropagation.

  • What is the purpose of backpropagation in neural networks?

    -The purpose of backpropagation is to compute the gradient of the cost function with respect to the weights and biases, which indicates how these parameters should be adjusted to minimize the cost and improve the network's performance.

  • How is the cost function defined for a single training example in the context of neural networks?

    -The cost for a single training example is defined as the sum of the squares of the differences between the network's output and the desired output, for each component.

  • What is gradient descent and how does it relate to learning in neural networks?

    -Gradient descent is an optimization algorithm used to minimize a cost function by iteratively moving in the direction of the negative gradient. In neural networks, learning involves finding the weights and biases that minimize the cost function using gradient descent.

  • How does the magnitude of the gradient vector components relate to the sensitivity of the cost function to changes in weights and biases?

    -The magnitude of each component in the gradient vector indicates the sensitivity of the cost function to changes in the corresponding weight or bias. A larger magnitude means the cost function is more sensitive to changes in that particular weight or bias.

  • What is the role of the activation function in determining the output of a neuron in a neural network?

    -The activation function, such as sigmoid or ReLU, is used to transform the weighted sum of inputs to a neuron, along with a bias, into the neuron's output. It introduces non-linearity into the network, allowing it to learn complex patterns.

  • What is the Hebbian theory in neuroscience, and how is it related to the learning process in neural networks?

    -The Hebbian theory in neuroscience suggests that neurons that fire together wire together, meaning that connections between co-active neurons strengthen. This concept loosely relates to the learning process in neural networks, where weights between active neurons are increased during training.

  • Why is it necessary to adjust the weights and biases in proportion to their influence on the cost function?

    -Adjusting the weights and biases in proportion to their influence on the cost function ensures that the most significant changes are made to the parameters that have the greatest impact on reducing the cost, leading to more efficient learning.

  • What is the concept of 'propagating backwards' in the context of backpropagation?

    -The concept of 'propagating backwards' refers to the process of moving through the network from the output layer to the input layer, calculating the desired changes in weights and biases based on the error at each layer, which is then used to update the parameters.

  • What is a mini-batch in the context of stochastic gradient descent, and why is it used?

    -A mini-batch is a subset of the training data used to compute a step in stochastic gradient descent. It is used to approximate the gradient of the cost function and to provide computational efficiency by not needing to process the entire dataset for each update.

  • What is the significance of having a large amount of labeled training data in machine learning?

    -Having a large amount of labeled training data is crucial for machine learning because it allows the model to learn from a diverse set of examples, improving its ability to generalize and perform well on unseen data.

Outlines

00:00

🧠 Understanding Backpropagation and Neural Networks

This paragraph introduces the concept of backpropagation, which is the fundamental learning algorithm for neural networks. It emphasizes the importance of understanding the neural network structure, the process of feeding forward information, and the concept of gradient descent. The explanation begins with an intuitive walkthrough of the algorithm without delving into complex formulas, and it sets the stage for a deeper mathematical exploration in a subsequent video. The paragraph also explains the cost function, which is used to measure the performance of the network, and how learning involves adjusting weights and biases to minimize this cost. The negative gradient of the cost function, which indicates the direction for efficient cost reduction, is highlighted as a key component of backpropagation. The explanation aims to clarify the intuition behind the algorithm, emphasizing the sensitivity of the cost function to changes in weights and biases, and how these adjustments are proportional to the desired output.

05:00

πŸ” Deep Dive into Backpropagation Mechanics

This paragraph delves deeper into the mechanics of backpropagation, focusing on how a single training example influences the adjustment of weights and biases within a neural network. It discusses the three avenues for increasing a neuron's activation: adjusting the bias, modifying the weights, and changing the activations from the previous layer. The paragraph highlights the differential impact of weights based on the activation levels of the preceding layer, drawing a parallel to Hebbian theory in neuroscience. The concept of propagating desired changes backward through the network is introduced, where the adjustments from the output layer are cascaded back to the earlier layers. The process of averaging these adjustments over multiple training examples to approximate the negative gradient of the cost function is explained, setting the groundwork for the practical application of backpropagation in training neural networks.

10:03

πŸ“ˆ Stochastic Gradient Descent and Training Data Importance

The final paragraph discusses the practical implementation of backpropagation through stochastic gradient descent, a technique that uses mini-batches of training data to approximate the gradient of the cost function. This method is presented as a computationally efficient alternative to using the entire dataset for each gradient descent step. The paragraph emphasizes the importance of having a large amount of labeled training data, using the MNIST database of handwritten digits as an example, and touches on the common challenge of acquiring sufficient training data in machine learning. It concludes by summarizing the key points of backpropagation, its role in adjusting the weights and biases of a neural network, and the iterative process of convergence towards a local minimum of the cost function.

Mindmap

Keywords

πŸ’‘Backpropagation

Backpropagation is a fundamental algorithm used in training neural networks. It involves the calculation of gradients of a loss function with respect to the weights of the network. The script explains backpropagation as the process of determining how a single training example influences the adjustment of weights and biases in the network. It is crucial for understanding how neural networks learn from data, as it allows the network to efficiently decrease the cost function by adjusting parameters in the direction that most effectively reduces the error.

πŸ’‘Neural Networks

Neural networks are a set of algorithms designed to recognize patterns. They are inspired by the human brain and are composed of interconnected nodes or 'neurons'. In the script, the focus is on a neural network with multiple layers, including input, hidden, and output layers, used for recognizing handwritten digits. The network's architecture and the flow of information through it are central to the video's theme of explaining how the network learns through backpropagation.

πŸ’‘Gradient Descent

Gradient descent is an optimization algorithm used to minimize a function by iteratively moving in the direction of the steepest descent, as defined by the negative of the gradient. In the context of the video, gradient descent is used to minimize the cost function of a neural network by adjusting the weights and biases. The script emphasizes the importance of finding the negative gradient of the cost function to efficiently decrease it, which is a key part of the backpropagation process.

πŸ’‘Cost Function

A cost function, in the context of machine learning, measures the performance of a model. It calculates the error between the predicted outputs and the actual target values. The script describes how the cost function is used in training neural networks, specifically by squaring the differences between the network's output and the desired output for each training example and averaging these values to get the total cost.

πŸ’‘Activation

Activation in a neural network refers to the output of a neuron, which is determined by a weighted sum of inputs plus a bias, passed through an activation function like sigmoid or ReLU. The script discusses how the activations in the output layer are influenced by the weights and biases, and how changes in these parameters can adjust the activations to better match the target values during training.

πŸ’‘Weights

In a neural network, weights are the numeric values that represent the strength of the connections between neurons. The script explains how the adjustment of weights is a critical part of the learning process, as they determine the influence of each neuron's activation on the subsequent layer. The backpropagation algorithm calculates how these weights should be changed to minimize the cost function.

πŸ’‘Biases

Biases are parameters in a neural network that allow the activation function to be shifted, providing an additional degree of freedom. The script mentions biases as part of the network's learnable parameters, which, along with weights, are adjusted during backpropagation to minimize the cost function and improve the network's predictions.

πŸ’‘Hebbian Theory

Hebbian theory, often summarized as 'neurons that fire together, wire together,' is a neuroscientific theory that suggests that connections between neurons strengthen when they are both activated at the same time. In the script, this theory is mentioned as a loose analogy to how neural networks strengthen connections between active neurons during learning, particularly when discussing weight adjustments in backpropagation.

πŸ’‘Mini-Batches

Mini-batches are subsets of the training dataset used in the training process to improve computational efficiency. The script describes how, instead of using the entire dataset to calculate the gradient for each step of gradient descent, mini-batches are used to approximate the gradient. This technique, known as stochastic gradient descent, allows for faster computation while still converging towards the minimum of the cost function.

πŸ’‘Stochastic Gradient Descent

Stochastic gradient descent is a variant of the gradient descent algorithm that uses mini-batches to approximate the gradient of the cost function. The script explains that this method is employed to speed up the training process by taking quicker, less precise steps towards minimizing the cost function, as opposed to calculating the exact gradient using all training examples, which would be computationally expensive.

πŸ’‘MNIST Database

The MNIST database is a large dataset of handwritten digits that is commonly used for training various image processing systems. The script points out the importance of having a large, labeled dataset like MNIST for training neural networks. It emphasizes the challenge of obtaining sufficient training data, which is essential for the success of machine learning models.

Highlights

Backpropagation is the core algorithm for how neural networks learn.

An intuitive walkthrough of the algorithm without formulas is provided for better understanding.

Neural networks are explained with the classic example of recognizing handwritten digits.

The importance of understanding gradient descent for learning in neural networks is emphasized.

Learning involves finding weights and biases that minimize a cost function.

The cost function is calculated by averaging the squared differences between network output and desired output.

The negative gradient of the cost function indicates how to change weights and biases to decrease the cost efficiently.

Backpropagation computes the gradient for adjusting weights and biases.

The gradient vector's magnitude represents the sensitivity of the cost function to each weight and bias.

Notation and index chasing can be confusing in understanding backpropagation.

Each training example's effect on weight and bias adjustment is explained.

Desired changes in output activations are proportional to how far they are from target values.

Adjustments in weights and biases are influenced by the activation levels of the previous layer.

Hebbian theory is mentioned as a loose analogy to how artificial neural networks learn.

The process of propagating desired changes backward through the network is described.

Averaging desired changes from all training examples gives the negative gradient of the cost function.

Stochastic gradient descent is used for computational efficiency by using mini-batches of training data.

The necessity of a large amount of training data for machine learning algorithms is highlighted.

The MNIST database is cited as an example of a valuable resource for training data in machine learning.

Transcripts

play00:04

Here, we tackle backpropagation, the core algorithm behind how neural networks learn.

play00:09

After a quick recap for where we are, the first thing I'll do is an intuitive walkthrough

play00:13

for what the algorithm is actually doing, without any reference to the formulas.

play00:17

Then, for those of you who do want to dive into the math,

play00:20

the next video goes into the calculus underlying all this.

play00:23

If you watched the last two videos, or if you're just jumping in with the appropriate

play00:27

background, you know what a neural network is, and how it feeds forward information.

play00:31

Here, we're doing the classic example of recognizing handwritten digits whose pixel

play00:36

values get fed into the first layer of the network with 784 neurons,

play00:39

and I've been showing a network with two hidden layers having just 16 neurons each,

play00:44

and an output layer of 10 neurons, indicating which digit the network is choosing

play00:48

as its answer.

play00:50

I'm also expecting you to understand gradient descent,

play00:53

as described in the last video, and how what we mean by learning is

play00:56

that we want to find which weights and biases minimize a certain cost function.

play01:02

As a quick reminder, for the cost of a single training example,

play01:05

you take the output the network gives, along with the output you wanted it to give,

play01:10

and add up the squares of the differences between each component.

play01:15

Doing this for all of your tens of thousands of training examples and

play01:18

averaging the results, this gives you the total cost of the network.

play01:22

And as if that's not enough to think about, as described in the last video,

play01:26

the thing that we're looking for is the negative gradient of this cost function,

play01:30

which tells you how you need to change all of the weights and biases,

play01:34

all of these connections, so as to most efficiently decrease the cost.

play01:43

Backpropagation, the topic of this video, is an

play01:45

algorithm for computing that crazy complicated gradient.

play01:49

And the one idea from the last video that I really want you to hold

play01:52

firmly in your mind right now is that because thinking of the gradient

play01:56

vector as a direction in 13,000 dimensions is, to put it lightly,

play01:59

beyond the scope of our imaginations, there's another way you can think about it.

play02:04

The magnitude of each component here is telling you how

play02:07

sensitive the cost function is to each weight and bias.

play02:11

For example, let's say you go through the process I'm about to describe,

play02:15

and you compute the negative gradient, and the component associated with the weight on

play02:20

this edge here comes out to be 3.2, while the component associated with this edge here

play02:25

comes out as 0.1.

play02:26

The way you would interpret that is that the cost of the function is 32 times

play02:30

more sensitive to changes in that first weight,

play02:33

so if you were to wiggle that value just a little bit,

play02:35

it's going to cause some change to the cost, and that change is 32 times greater

play02:40

than what the same wiggle to that second weight would give.

play02:48

Personally, when I was first learning about backpropagation,

play02:51

I think the most confusing aspect was just the notation and the index chasing of it all.

play02:56

But once you unwrap what each part of this algorithm is really doing,

play02:59

each individual effect it's having is actually pretty intuitive,

play03:02

it's just that there's a lot of little adjustments getting layered on top of each other.

play03:07

So I'm going to start things off here with a complete disregard for the notation,

play03:11

and just step through the effects each training example has on the weights and biases.

play03:17

Because the cost function involves averaging a certain cost per example over

play03:21

all the tens of thousands of training examples,

play03:24

the way we adjust the weights and biases for a single gradient descent step also

play03:29

depends on every single example.

play03:31

Or rather, in principle it should, but for computational efficiency we'll do a little

play03:35

trick later to keep you from needing to hit every single example for every step.

play03:39

In other cases, right now, all we're going to do is focus

play03:42

our attention on one single example, this image of a 2.

play03:46

What effect should this one training example have

play03:49

on how the weights and biases get adjusted?

play03:52

Let's say we're at a point where the network is not well trained yet,

play03:56

so the activations in the output are going to look pretty random,

play03:59

maybe something like 0.5, 0.8, 0.2, on and on.

play04:02

We can't directly change those activations, we

play04:04

only have influence on the weights and biases.

play04:07

But it's helpful to keep track of which adjustments

play04:10

we wish should take place to that output layer.

play04:13

And since we want it to classify the image as a 2,

play04:16

we want that third value to get nudged up while all the others get nudged down.

play04:22

Moreover, the sizes of these nudges should be proportional

play04:25

to how far away each current value is from its target value.

play04:30

For example, the increase to that number 2 neuron's activation

play04:33

is in a sense more important than the decrease to the number 8 neuron,

play04:37

which is already pretty close to where it should be.

play04:42

So zooming in further, let's focus just on this one neuron,

play04:45

the one whose activation we wish to increase.

play04:48

Remember, that activation is defined as a certain weighted sum of all

play04:52

the activations in the previous layer, plus a bias,

play04:55

which is all then plugged into something like the sigmoid squishification function,

play05:00

or a ReLU.

play05:01

So there are three different avenues that can

play05:04

team up together to help increase that activation.

play05:07

You can increase the bias, you can increase the weights,

play05:10

and you can change the activations from the previous layer.

play05:14

Focusing on how the weights should be adjusted,

play05:17

notice how the weights actually have differing levels of influence.

play05:21

The connections with the brightest neurons from the preceding layer have the

play05:25

biggest effect since those weights are multiplied by larger activation values.

play05:31

So if you were to increase one of those weights,

play05:33

it actually has a stronger influence on the ultimate cost function than increasing

play05:38

the weights of connections with dimmer neurons,

play05:40

at least as far as this one training example is concerned.

play05:44

Remember, when we talk about gradient descent,

play05:46

we don't just care about whether each component should get nudged up or down,

play05:50

we care about which ones give you the most bang for your buck.

play05:55

This, by the way, is at least somewhat reminiscent of a theory in

play05:58

neuroscience for how biological networks of neurons learn, Hebbian theory,

play06:02

often summed up in the phrase, neurons that fire together wire together.

play06:07

Here, the biggest increases to weights, the biggest strengthening of connections,

play06:11

happens between neurons which are the most active,

play06:14

and the ones which we wish to become more active.

play06:17

In a sense, the neurons that are firing while seeing a 2 get more

play06:21

strongly linked to those are the ones firing when thinking about a 2.

play06:25

To be clear, I'm not in a position to make statements one way or another about

play06:29

whether artificial networks of neurons behave anything like biological brains,

play06:33

and this fires together wire together idea comes with a couple meaningful asterisks,

play06:37

but taken as a very loose analogy, I find it interesting to note.

play06:41

Anyway, the third way we can help increase this neuron's

play06:45

activation is by changing all the activations in the previous layer.

play06:49

Namely, if everything connected to that digit 2 neuron with a positive

play06:53

weight got brighter, and if everything connected with a negative weight got dimmer,

play06:57

then that digit 2 neuron would become more active.

play07:02

And similar to the weight changes, you're going to get the most bang for your buck

play07:06

by seeking changes that are proportional to the size of the corresponding weights.

play07:12

Now of course, we cannot directly influence those activations,

play07:15

we only have control over the weights and biases.

play07:17

But just as with the last layer, it's helpful

play07:20

to keep a note of what those desired changes are.

play07:24

But keep in mind, zooming out one step here, this

play07:26

is only what that digit 2 output neuron wants.

play07:29

Remember, we also want all the other neurons in the last layer to become less active,

play07:33

and each of those other output neurons has its own thoughts about

play07:37

what should happen to that second to last layer.

play07:42

So, the desire of this digit 2 neuron is added together with the

play07:46

desires of all the other output neurons for what should happen to this

play07:51

second to last layer, again in proportion to the corresponding weights,

play07:56

and in proportion to how much each of those neurons needs to change.

play08:01

This right here is where the idea of propagating backwards comes in.

play08:05

By adding together all these desired effects, you basically get a

play08:09

list of nudges that you want to happen to this second to last layer.

play08:14

And once you have those, you can recursively apply the same process to the

play08:17

relevant weights and biases that determine those values,

play08:20

repeating the same process I just walked through and moving backwards through the network.

play08:28

And zooming out a bit further, remember that this is all just how a single

play08:33

training example wishes to nudge each one of those weights and biases.

play08:37

If we only listened to what that 2 wanted, the network would

play08:40

ultimately be incentivized just to classify all images as a 2.

play08:44

So what you do is go through this same backprop routine for every other training example,

play08:49

recording how each of them would like to change the weights and biases,

play08:53

and average together those desired changes.

play09:01

This collection here of the averaged nudges to each weight and bias is,

play09:05

loosely speaking, the negative gradient of the cost function referenced

play09:10

in the last video, or at least something proportional to it.

play09:14

I say loosely speaking only because I have yet to get quantitatively precise

play09:18

about those nudges, but if you understood every change I just referenced,

play09:22

why some are proportionally bigger than others,

play09:24

and how they all need to be added together, you understand the mechanics for

play09:28

what backpropagation is actually doing.

play09:33

By the way, in practice, it takes computers an extremely long time to

play09:38

add up the influence of every training example every gradient descent step.

play09:43

So here's what's commonly done instead.

play09:45

You randomly shuffle your training data and then divide it into a whole

play09:48

bunch of mini-batches, let's say each one having 100 training examples.

play09:52

Then you compute a step according to the mini-batch.

play09:56

It's not going to be the actual gradient of the cost function,

play10:00

which depends on all of the training data, not this tiny subset,

play10:03

so it's not the most efficient step downhill, but each mini-batch does give

play10:07

you a pretty good approximation, and more importantly,

play10:09

it gives you a significant computational speedup.

play10:12

If you were to plot the trajectory of your network under the relevant cost surface,

play10:17

it would be a little more like a drunk man stumbling aimlessly down a hill but taking

play10:21

quick steps, rather than a carefully calculating man determining the exact downhill

play10:25

direction of each step before taking a very slow and careful step in that direction.

play10:31

This technique is referred to as stochastic gradient descent.

play10:35

There's a lot going on here, so let's just sum it up for ourselves, shall we?

play10:40

Backpropagation is the algorithm for determining how a single training

play10:44

example would like to nudge the weights and biases,

play10:47

not just in terms of whether they should go up or down,

play10:50

but in terms of what relative proportions to those changes cause the

play10:53

most rapid decrease to the cost.

play10:56

A true gradient descent step would involve doing this for all your tens of

play11:00

thousands of training examples and averaging the desired changes you get.

play11:04

But that's computationally slow, so instead you randomly subdivide the

play11:08

data into mini-batches and compute each step with respect to a mini-batch.

play11:14

Repeatedly going through all of the mini-batches and making these adjustments,

play11:17

you will converge towards a local minimum of the cost function,

play11:21

which is to say your network will end up doing a really good job on the training examples.

play11:27

So with all of that said, every line of code that would go into implementing backprop

play11:32

actually corresponds with something you have now seen, at least in informal terms.

play11:37

But sometimes knowing what the math does is only half the battle,

play11:40

and just representing the damn thing is where it gets all muddled and confusing.

play11:44

So for those of you who do want to go deeper, the next video goes through the same

play11:48

ideas that were just presented here, but in terms of the underlying calculus,

play11:52

which should hopefully make it a little more familiar as you see the topic in other

play11:55

resources.

play11:57

Before that, one thing worth emphasizing is that for this algorithm to work,

play12:00

and this goes for all sorts of machine learning beyond just neural networks,

play12:04

you need a lot of training data.

play12:06

In our case, one thing that makes handwritten digits such a nice example is that

play12:10

there exists the MNIST database, with so many examples that have been labeled by humans.

play12:15

So a common challenge that those of you working in machine learning will be familiar with

play12:19

is just getting the labeled training data you actually need,

play12:21

whether that's having people label tens of thousands of images,

play12:24

or whatever other data type you might be dealing with.

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
BackpropagationNeural NetworksMachine LearningGradient DescentCost FunctionMNIST DatabaseHebbian TheoryStochastic DescentArtificial IntelligenceLearning Algorithm