But what is a neural network? | Chapter 1, Deep learning
Summary
TLDR这段视频脚本深入探讨了神经网络的基本概念和工作原理,特别是它们如何识别手写数字。视频从一个低分辨率的数字“3”图像开始,展示了人类大脑如何轻松识别不同的数字,即使它们的像素值差异很大。接着,脚本介绍了构建一个能够处理28x28像素网格并输出0到10之间数字的程序的挑战,这引出了机器学习和神经网络的重要性。视频的目标是帮助观众理解神经网络的结构,并可视化其数学原理,而不仅仅是作为一个流行词汇。通过构建一个简单的神经网络模型,脚本解释了网络是如何通过多个层次的神经元(称为激活)来识别手写数字的。每个神经元代表一个数字,并且网络通过调整这些神经元之间的连接权重和偏置来“学习”。视频还讨论了网络中的隐藏层可能如何识别数字的子组件,如边缘和环。最后,脚本提到了sigmoid函数和ReLU(修正线性单元)在神经网络激活中的作用,并预告了下一期视频将深入探讨网络如何通过数据学习适当的权重和偏置。
Takeaways
- 🧠 大脑能够轻松识别低分辨率图像中的数字,这展示了人类视觉识别的惊人能力。
- 📈 编写一个能够识别28x28像素图像中数字的程序是一项艰巨的任务,这突显了机器学习和神经网络的重要性。
- 🌟 神经网络的结构和学习过程将通过视频以数学的角度进行介绍,以帮助观众理解。
- 🔍 神经网络的设计灵感来源于人脑,特别是神经元如何链接在一起。
- 🔢 输入层的神经元数量与像素点对应,每个神经元代表一个像素的灰度值。
- 📊 输出层有10个神经元,每个神经元代表一个数字,其激活值表示系统认为图像与特定数字的匹配程度。
- 🤔 隐藏层的作用尚不清楚,但它们在识别数字的过程中起到关键作用。
- 🔗 每一层的激活值决定了下一层的激活值,这是神经网络作为信息处理机制的核心。
- 📉 通过训练,神经网络能够学习到如何将像素组合成边缘、模式或数字。
- 🎯 权重和偏置是神经网络中的关键参数,它们决定了网络如何响应输入图像。
- 🤝 神经网络的学习过程涉及调整成千上万的权重和偏置,以找到解决问题的最佳设置。
- 📚 线性代数是理解神经网络的关键,特别是矩阵和向量乘法的概念。
Q & A
为什么人脑能够轻松识别不同分辨率和风格中的数字3?
-人脑的视觉皮层非常发达,它能够识别和解析不同的视觉模式,即使像素值和激活的光敏细胞在不同的图像中差异很大,大脑仍然能够将它们识别为相同的数字3。
编写一个能够识别28x28像素网格中手写数字的程序为何变得非常困难?
-因为需要设计一个能够理解像素模式并将其转换为0到10之间数字的复杂算法,这涉及到大量的参数调整和深度学习知识。
神经网络的结构是如何启发于人脑的?
-神经网络的结构模仿了人脑中的神经元网络,其中神经元是持有0到1之间数值的单元,通过层级结构处理信息,类似于大脑中神经元的激活模式。
神经网络的第一层包含多少个神经元,它们代表什么?
-神经网络的第一层包含784个神经元,每个神经元代表输入图像中一个28x28像素点的灰度值。
输出层有10个神经元,它们代表什么含义?
-输出层的10个神经元每个代表一个数字,从0到9。它们的激活值表示系统认为给定图像与某个数字的对应程度。
隐藏层在神经网络中扮演什么角色?
-隐藏层是神经网络中的中间层,它们处理输入层的信息,并为输出层提供更高级的特征表示,帮助网络识别更复杂的模式。
如何通过数学方式表示神经网络中一层激活值对下一层的影响?
-通过权重矩阵和激活向量的矩阵乘法,再加上偏置向量,然后通过sigmoid函数将结果压缩到0和1之间。
为什么现代神经网络较少使用sigmoid函数?
-因为sigmoid函数在训练时存在一些困难,而现代网络更倾向于使用ReLU(修正线性单元)函数,它更容易训练,并且对于深度神经网络表现更好。
网络中的权重和偏置是如何决定神经元激活的?
-权重决定了神经元如何响应前一层的激活模式,而偏置则决定了神经元开始显著激活的阈值。通过调整权重和偏置,可以训练网络识别特定的模式。
为什么说神经网络的训练过程是寻找合适的权重和偏置设置?
-因为神经网络的性能依赖于这些参数,通过训练数据调整这些参数,可以使网络更好地识别和预测目标,解决特定的问题。
为什么理解矩阵和矩阵向量乘法对于学习机器学习很重要?
-矩阵和矩阵向量乘法是神经网络中信息传递和变换的基础,它们在表示和计算神经网络的层间转换时非常关键,且许多机器学习库对矩阵运算进行了优化。
如何将神经网络中的复杂数学表达式简化以便于理解和编程实现?
-通过将权重矩阵和偏置向量表示为符号,并利用矩阵向量乘法和sigmoid函数的组合,可以非常紧凑和高效地表达神经网络的激活转换过程。
Outlines
😀 神经网络的基本概念和结构
第一段介绍了大脑如何轻松识别不同像素值的数字3,引出了机器学习和神经网络的重要性。接着,作者解释了神经网络的结构,包括输入层、隐藏层和输出层,以及它们之间的连接。特别指出,神经网络的每个神经元都包含一个介于0和1之间的数字,称为激活值,这个值代表了像素的灰度或者与特定数字的关联程度。最后,作者提到了神经网络的灵感来源——人脑,以及如何通过训练神经网络来识别手写数字。
🧠 神经网络的工作原理和期望行为
第二段深入探讨了神经网络的工作原理,特别是隐藏层可能如何智能地处理信息。作者提出了一个期望,即中间层的每个神经元可能对应于识别数字的不同子组件,例如数字9和8顶部的循环。此外,还讨论了如何通过识别图像中的边缘和小片段来逐步构建对整个数字的认识。作者强调了理解这些层次化结构的重要性,以及它们在图像识别之外的其他智能任务中的应用潜力。
🔍 神经网络的参数调整和训练过程
第三段详细介绍了如何通过调整神经网络中的权重和偏置来识别图像中的特定模式。权重是神经元之间连接的数值,而偏置则是在激活函数应用前加在加权和上的一个数值。作者解释了如何使用sigmoid函数将加权和的输出压缩到0和1之间,并讨论了如何通过添加偏置来调整神经元的激活阈值。此外,还提到了如何将这些复杂的参数集合简化为矩阵和向量的形式,以便于理解和编程实现。
📚 神经网络的数学基础和未来展望
第四段强调了神经网络的数学基础,特别是线性代数在机器学习中的重要性。作者将神经网络描述为一个复杂的函数,它接受784个输入值并输出10个数字。同时,作者提出了通过观察数据来学习适当权重和偏置的问题,这将在下一个视频中进行讨论。此外,还提到了现代网络中sigmoid函数的使用已经减少,取而代之的是ReLU(修正线性单元)函数,因为它更容易训练。最后,作者感谢了支持视频的Patreon赞助者,并预告了接下来关于神经网络学习过程的内容。
Mindmap
Keywords
💡神经网络
💡激活值
💡隐藏层
💡权重
💡偏置
💡sigmoid函数
💡ReLU
💡矩阵向量乘法
💡反向传播
💡特征提取
💡参数
Highlights
大脑能够毫不费力地识别低分辨率图像中的数字3,展示了人类视觉皮层的强大能力。
即使像素值差异很大,大脑也能识别出不同的图像表示相同的数字3。
编写一个程序来识别28x28像素网格中的数字,从简单到极其困难的转变。
机器学习和神经网络在当前和未来的重要性和相关性。
介绍神经网络的实际结构,将其视为数学的一部分,而非仅仅是一个流行词。
通过两个视频介绍,构建一个能够识别手写数字的神经网络。
神经网络的灵感来源于大脑,但具体是如何工作的?
神经网络从输入图像的784个像素开始,每个像素对应一个神经元。
输出层有10个神经元,每个神经元代表一个数字。
隐藏层是神经网络中未知的部分,它们如何帮助识别数字仍然是一个谜。
网络的操作是一层的激活值决定下一层的激活值。
训练后的神经网络能够识别数字,这意味着输入图像会触发特定的激活模式。
中间层可能捕捉到数字的子组件,如9的顶部环和8的上下环。
希望网络能够检测到边缘和模式,这对于其他图像识别任务也非常有用。
设计神经网络时,需要确定每一层之间连接的权重和偏差。
权重和偏差是网络中可以调整的参数,总共约有13,000个。
学习过程是指计算机找到所有权重和偏差的有效设置,以解决手头的问题。
现代神经网络中,sigmoid函数的使用已经较少,ReLU(修正线性单元)更为常用。
Transcripts
This is a 3.
It's sloppily written and rendered at an extremely low resolution of 28x28 pixels,
but your brain has no trouble recognizing it as a 3.
And I want you to take a moment to appreciate how
crazy it is that brains can do this so effortlessly.
I mean, this, this and this are also recognizable as 3s,
even though the specific values of each pixel is very different from one
image to the next.
The particular light-sensitive cells in your eye that are firing when
you see this 3 are very different from the ones firing when you see this 3.
But something in that crazy-smart visual cortex of yours resolves these as representing
the same idea, while at the same time recognizing other images as their own distinct
ideas.
But if I told you, hey, sit down and write for me a program that takes in a grid of
28x28 pixels like this and outputs a single number between 0 and 10,
telling you what it thinks the digit is, well the task goes from comically trivial to
dauntingly difficult.
Unless you've been living under a rock, I think I hardly need to motivate the relevance
and importance of machine learning and neural networks to the present and to the future.
But what I want to do here is show you what a neural network actually is,
assuming no background, and to help visualize what it's doing,
not as a buzzword but as a piece of math.
My hope is that you come away feeling like the structure itself is motivated,
and to feel like you know what it means when you read,
or you hear about a neural network quote-unquote learning.
This video is just going to be devoted to the structure component of that,
and the following one is going to tackle learning.
What we're going to do is put together a neural
network that can learn to recognize handwritten digits.
This is a somewhat classic example for introducing the topic,
and I'm happy to stick with the status quo here,
because at the end of the two videos I want to point you to a couple good
resources where you can learn more, and where you can download the code that
does this and play with it on your own computer.
There are many many variants of neural networks,
and in recent years there's been sort of a boom in research towards these variants,
but in these two introductory videos you and I are just going to look at the simplest
plain vanilla form with no added frills.
This is kind of a necessary prerequisite for understanding any of the more powerful
modern variants, and trust me it still has plenty of complexity for us to wrap our minds
around.
But even in this simplest form it can learn to recognize handwritten digits,
which is a pretty cool thing for a computer to be able to do.
And at the same time you'll see how it does fall
short of a couple hopes that we might have for it.
As the name suggests neural networks are inspired by the brain, but let's break that down.
What are the neurons, and in what sense are they linked together?
Right now when I say neuron all I want you to think about is a thing that holds a number,
specifically a number between 0 and 1.
It's really not more than that.
For example the network starts with a bunch of neurons corresponding to
each of the 28x28 pixels of the input image, which is 784 neurons in total.
Each one of these holds a number that represents the grayscale value of the
corresponding pixel, ranging from 0 for black pixels up to 1 for white pixels.
This number inside the neuron is called its activation,
and the image you might have in mind here is that each neuron is lit up when its
activation is a high number.
So all of these 784 neurons make up the first layer of our network.
Now jumping over to the last layer, this has 10 neurons,
each representing one of the digits.
The activation in these neurons, again some number that's between 0 and 1,
represents how much the system thinks that a given image corresponds with a given digit.
There's also a couple layers in between called the hidden layers,
which for the time being should just be a giant question mark for
how on earth this process of recognizing digits is going to be handled.
In this network I chose two hidden layers, each one with 16 neurons,
and admittedly that's kind of an arbitrary choice.
To be honest I chose two layers based on how I want to motivate the structure
in just a moment, and 16, well that was just a nice number to fit on the screen.
In practice there is a lot of room for experiment with a specific structure here.
The way the network operates, activations in one
layer determine the activations of the next layer.
And of course the heart of the network as an information processing mechanism comes down
to exactly how those activations from one layer bring about activations in the next layer.
It's meant to be loosely analogous to how in biological networks of neurons,
some groups of neurons firing cause certain others to fire.
Now the network I'm showing here has already been trained to recognize digits,
and let me show you what I mean by that.
It means if you feed in an image, lighting up all 784 neurons of the input layer
according to the brightness of each pixel in the image,
that pattern of activations causes some very specific pattern in the next layer
which causes some pattern in the one after it,
which finally gives some pattern in the output layer.
And the brightest neuron of that output layer is the network's choice,
so to speak, for what digit this image represents.
And before jumping into the math for how one layer influences the next,
or how training works, let's just talk about why it's even reasonable
to expect a layered structure like this to behave intelligently.
What are we expecting here?
What is the best hope for what those middle layers might be doing?
Well, when you or I recognize digits, we piece together various components.
A 9 has a loop up top and a line on the right.
An 8 also has a loop up top, but it's paired with another loop down low.
A 4 basically breaks down into three specific lines, and things like that.
Now in a perfect world, we might hope that each neuron in the second
to last layer corresponds with one of these subcomponents,
that anytime you feed in an image with, say, a loop up top,
like a 9 or an 8, there's some specific neuron whose activation is going to be close to 1.
And I don't mean this specific loop of pixels,
the hope would be that any generally loopy pattern towards the top sets off this neuron.
That way, going from the third layer to the last one just requires
learning which combination of subcomponents corresponds to which digits.
Of course, that just kicks the problem down the road,
because how would you recognize these subcomponents,
or even learn what the right subcomponents should be?
And I still haven't even talked about how one layer influences the next,
but run with me on this one for a moment.
Recognizing a loop can also break down into subproblems.
One reasonable way to do this would be to first
recognize the various little edges that make it up.
Similarly, a long line, like the kind you might see in the digits 1 or 4 or 7,
is really just a long edge, or maybe you think of it as a certain pattern of several
smaller edges.
So maybe our hope is that each neuron in the second layer of
the network corresponds with the various relevant little edges.
Maybe when an image like this one comes in, it lights up all of the
neurons associated with around 8 to 10 specific little edges,
which in turn lights up the neurons associated with the upper loop
and a long vertical line, and those light up the neuron associated with a 9.
Whether or not this is what our final network actually does is another question,
one that I'll come back to once we see how to train the network,
but this is a hope that we might have, a sort of goal with the layered structure
like this.
Moreover, you can imagine how being able to detect edges and patterns
like this would be really useful for other image recognition tasks.
And even beyond image recognition, there are all sorts of intelligent
things you might want to do that break down into layers of abstraction.
Parsing speech, for example, involves taking raw audio and picking out distinct sounds,
which combine to make certain syllables, which combine to form words,
which combine to make up phrases and more abstract thoughts, etc.
But getting back to how any of this actually works,
picture yourself right now designing how exactly the activations in one layer
might determine the next.
The goal is to have some mechanism that could conceivably combine pixels into edges,
or edges into patterns, or patterns into digits.
And to zoom in on one very specific example, let's say the
hope is for one particular neuron in the second layer to pick
up on whether or not the image has an edge in this region here.
The question at hand is what parameters should the network have?
What dials and knobs should you be able to tweak so that it's expressive
enough to potentially capture this pattern, or any other pixel pattern,
or the pattern that several edges can make a loop, and other such things?
Well, what we'll do is assign a weight to each one of the
connections between our neuron and the neurons from the first layer.
These weights are just numbers.
Then take all of those activations from the first layer
and compute their weighted sum according to these weights.
I find it helpful to think of these weights as being organized into a
little grid of their own, and I'm going to use green pixels to indicate positive weights,
and red pixels to indicate negative weights, where the brightness of
that pixel is some loose depiction of the weight's value.
Now if we made the weights associated with almost all of the pixels zero
except for some positive weights in this region that we care about,
then taking the weighted sum of all the pixel values really just amounts
to adding up the values of the pixel just in the region that we care about.
And if you really wanted to pick up on whether there's an edge here,
what you might do is have some negative weights associated with the surrounding pixels.
Then the sum is largest when those middle pixels
are bright but the surrounding pixels are darker.
When you compute a weighted sum like this, you might come out with any number,
but for this network what we want is for activations to be some value between 0 and 1.
So a common thing to do is to pump this weighted sum into some function
that squishes the real number line into the range between 0 and 1.
And a common function that does this is called the sigmoid function,
also known as a logistic curve.
Basically very negative inputs end up close to 0,
positive inputs end up close to 1, and it just steadily increases around the input 0.
So the activation of the neuron here is basically a
measure of how positive the relevant weighted sum is.
But maybe it's not that you want the neuron to
light up when the weighted sum is bigger than 0.
Maybe you only want it to be active when the sum is bigger than say 10.
That is, you want some bias for it to be inactive.
What we'll do then is just add in some other number like negative 10 to this
weighted sum before plugging it through the sigmoid squishification function.
That additional number is called the bias.
So the weights tell you what pixel pattern this neuron in the second
layer is picking up on, and the bias tells you how high the weighted
sum needs to be before the neuron starts getting meaningfully active.
And that is just one neuron.
Every other neuron in this layer is going to be connected to
all 784 pixel neurons from the first layer, and each one of
those 784 connections has its own weight associated with it.
Also, each one has some bias, some other number that you add
on to the weighted sum before squishing it with the sigmoid.
And that's a lot to think about!
With this hidden layer of 16 neurons, that's a total of 784 times 16 weights,
along with 16 biases.
And all of that is just the connections from the first layer to the second.
The connections between the other layers also have
a bunch of weights and biases associated with them.
All said and done, this network has almost exactly 13,000 total weights and biases.
13,000 knobs and dials that can be tweaked and
turned to make this network behave in different ways.
So when we talk about learning, what that's referring to is
getting the computer to find a valid setting for all of these
many many numbers so that it'll actually solve the problem at hand.
One thought experiment that is at once fun and kind of horrifying is to imagine sitting
down and setting all of these weights and biases by hand,
purposefully tweaking the numbers so that the second layer picks up on edges,
the third layer picks up on patterns, etc.
I personally find this satisfying rather than just treating the network as a total
black box, because when the network doesn't perform the way you anticipate,
if you've built up a little bit of a relationship with what those weights and biases
actually mean, you have a starting place for experimenting with how to change the
structure to improve.
Or when the network does work but not for the reasons you might expect,
digging into what the weights and biases are doing is a good way to challenge
your assumptions and really expose the full space of possible solutions.
By the way, the actual function here is a little cumbersome to write down,
don't you think?
So let me show you a more notationally compact way that these connections are represented.
This is how you'd see it if you choose to read up more about neural networks. 214 00:13:41,380 --> 00:13:40,520 Organize all of the activations from one layer into a column as a vector.
Then organize all of the weights as a matrix, where each row of that matrix corresponds
to the connections between one layer and a particular neuron in the next layer.
What that means is that taking the weighted sum of the activations in
the first layer according to these weights corresponds to one of the
terms in the matrix vector product of everything we have on the left here.
By the way, so much of machine learning just comes down to having a
good grasp of linear algebra, so for any of you who want a nice visual
understanding for matrices and what matrix vector multiplication means,
take a look at the series I did on linear algebra, especially chapter 3.
Back to our expression, instead of talking about adding the bias to each one of
these values independently, we represent it by organizing all those biases into a vector,
and adding the entire vector to the previous matrix vector product.
Then as a final step, I'll wrap a sigmoid around the outside here,
and what that's supposed to represent is that you're going to apply the
sigmoid function to each specific component of the resulting vector inside.
So once you write down this weight matrix and these vectors as their own symbols,
you can communicate the full transition of activations from one layer to the next in an
extremely tight and neat little expression, and this makes the relevant code both a lot
simpler and a lot faster, since many libraries optimize the heck out of matrix
multiplication.
Remember how earlier I said these neurons are simply things that hold numbers?
Well of course the specific numbers that they hold depends on the image you feed in,
so it's actually more accurate to think of each neuron as a function,
one that takes in the outputs of all the neurons in the previous layer and spits out a
number between 0 and 1.
Really the entire network is just a function, one that takes in
784 numbers as an input and spits out 10 numbers as an output.
It's an absurdly complicated function, one that involves 13,000 parameters
in the forms of these weights and biases that pick up on certain patterns,
and which involves iterating many matrix vector products and the sigmoid
squishification function, but it's just a function nonetheless.
And in a way it's kind of reassuring that it looks complicated.
I mean if it were any simpler, what hope would we have
that it could take on the challenge of recognizing digits?
And how does it take on that challenge?
How does this network learn the appropriate weights and biases just by looking at data?
Well that's what I'll show in the next video, and I'll also dig a little
more into what this particular network we're seeing is really doing.
Now is the point I suppose I should say subscribe to stay notified
about when that video or any new videos come out,
but realistically most of you don't actually receive notifications from YouTube, do you?
Maybe more honestly I should say subscribe so that the neural networks
that underlie YouTube's recommendation algorithm are primed to believe
that you want to see content from this channel get recommended to you.
Anyway, stay posted for more.
Thank you very much to everyone supporting these videos on Patreon.
I've been a little slow to progress in the probability series this summer,
but I'm jumping back into it after this project,
so patrons you can look out for updates there.
To close things off here I have with me Lisha Li who did her PhD work on the
theoretical side of deep learning and who currently works at a venture capital
firm called Amplify Partners who kindly provided some of the funding for this video.
So Lisha one thing I think we should quickly bring up is this sigmoid function.
As I understand it early networks use this to squish the relevant weighted
sum into that interval between zero and one, you know kind of motivated
by this biological analogy of neurons either being inactive or active.
Exactly.
But relatively few modern networks actually use sigmoid anymore.
Yeah.
It's kind of old school right?
Yeah or rather ReLU seems to be much easier to train.
And ReLU, ReLU stands for rectified linear unit?
Yes it's this kind of function where you're just taking a max of zero
and a where a is given by what you were explaining in the video and
what this was sort of motivated from I think was a partially by a
biological analogy with how neurons would either be activated or not.
And so if it passes a certain threshold it would be the identity function but if it did
not then it would just not be activated so it'd be zero so it's kind of a simplification.
Using sigmoids didn't help training or it was very difficult
to train at some point and people just tried ReLU and it happened
to work very well for these incredibly deep neural networks.
All right thank you Lisha.
浏览更多相关视频
The Internet: Crash Course Computer Science #29
Geoffrey Hinton: The Foundations of Deep Learning
Lecture 1.1 — Why do we need machine learning — [ Deep Learning | Geoffrey Hinton | UofT ]
Lecture 1.2 — What are neural networks — [ Deep Learning | Geoffrey Hinton | UofT ]
Computer Vision: Crash Course Computer Science #35
Lecture 1.3 — Some simple models of neurons — [ Deep Learning | Geoffrey Hinton | UofT ]
5.0 / 5 (0 votes)