The Most Important Algorithm in Machine Learning
Summary
TLDR本视频深入探讨了反向传播算法——机器学习领域的核心算法。通过详细解释其工作原理、历史发展以及如何从零开始构建,视频强调了反向传播算法在解决各种问题中的重要性。同时,视频也提出了关于其与生物大脑学习机制的差异性问题,并预告了下一部分将探讨生物大脑中的突触可塑性,以及这些差异对机器学习算法的启示。
Takeaways
- 🧠 反向传播是机器学习领域的核心算法,它使得人工神经网络能够通过学习数据来解决问题。
- 📈 反向传播算法的基本概念是梯度下降,通过调整参数来最小化损失函数。
- 🔄 反向传播的过程涉及到构建计算图,并通过链式法则来计算每个节点对损失的影响。
- 📚 反向传播的历史可以追溯到17世纪,但现代形式的算法是在20世纪70年代由Seppo Linar首次提出的。
- 🌟 1986年,David Rumelhart、Geoffrey Hinton和Ronald Williams的论文使得反向传播算法得到了广泛的应用。
- 🧬 尽管人工神经网络在结构和训练数据上与生物大脑不同,但反向传播算法在理解大脑学习机制方面提供了有价值的参考。
- 🔍 反向传播算法的核心在于能够高效地计算复杂函数的导数,这是通过构建计算图并应用链式法则实现的。
- 🛠️ 通过反向传播算法,我们可以对神经网络中的每个参数进行优化,以提高模型在特定任务上的表现。
- 📊 在机器学习中,损失函数是一个重要的概念,它衡量了模型预测值与实际值之间的差异。
- 🎯 反向传播算法使得我们能够通过梯度下降来找到损失函数的最小值,从而优化模型参数。
- 🤔 视频还提出了关于生物大脑学习机制的问题,即大脑是否使用类似于反向传播的机制,或者采用了完全不同的算法。
Q & A
反向传播算法在机器学习领域的作用是什么?
-反向传播算法是机器学习领域的基础算法,它使得人工神经网络能够通过训练数据进行学习。这个算法通过计算损失函数对每个参数的梯度,指导网络参数的调整,从而最小化损失函数,提高模型的预测能力。
反向传播算法的基本原理是什么?
-反向传播算法的基本原理是梯度下降。它通过计算损失函数对网络参数的偏导数(梯度),然后将参数沿着梯度的反方向进行更新,以此来最小化损失函数。这个过程通过链式法则逐步展开,从输出层一直反向传播到输入层。
为什么说反向传播算法与生物大脑的学习机制存在本质区别?
-尽管反向传播算法在模拟神经网络学习方面取得了巨大成功,但它与生物大脑的学习机制存在本质区别。生物大脑通过突触可塑性进行学习,这是一个分布式和并行的过程,而反向传播算法则是一个基于梯度的、自上而下的、迭代优化的过程。
反向传播算法的发明归功于谁?
-反向传播算法的发明权没有明确的归属,因为相关的概念可以追溯到17世纪。但是,第一个现代形式的反向传播算法被认为由Seppo Linar在1970年的硕士论文中发表,尽管他没有明确引用任何神经网络。
在机器学习中,损失函数的作用是什么?
-损失函数在机器学习中的作用是提供一个量化模型预测误差的方法。通过最小化损失函数,可以调整模型参数,使得模型的预测结果更接近真实数据,从而提高模型的性能。
如何理解梯度和梯度下降?
-梯度是一个向量,它指向函数增长最快的方向,其大小表示增长的速度。梯度下降是一种优化算法,它通过沿着梯度的反方向(即下降最快的方向)迭代调整参数,以此来最小化目标函数。
为什么说链式法则是机器学习领域的核心?
-链式法则允许我们计算复杂函数组合的导数。在机器学习中,模型通常由多个简单函数组合而成,链式法则使我们能够有效地计算这些组合函数相对于各个参数的导数,这是执行梯度下降和训练模型的基础。
神经网络中的激活函数有什么作用?
-激活函数在神经网络中引入非线性,使得网络能够学习和模拟更加复杂的函数。没有激活函数,神经网络无论有多少层,本质上还是线性模型,无法处理复杂的非线性问题。
什么是梯度消失问题,它是如何影响神经网络训练的?
-梯度消失问题是指在深层神经网络中,梯度在反向传播过程中逐渐变小,以至于对网络参数的更新几乎没有影响。这会导致网络训练停滞,因为参数不再发生变化,模型无法继续学习。
如何理解损失函数中的均方误差(MSE)?
-均方误差(MSE)是衡量模型预测值与实际值差异的常用损失函数。它计算每个数据点的预测误差,然后对这些误差进行平方和平均。MSE越小,表示模型的预测结果与实际数据越接近,模型的性能越好。
为什么说神经网络能够近似任何函数?
-这是由于神经网络的通用近似定理,它指出一个具有足够数量神经元的前馈神经网络,理论上可以以任意精度近似任何连续函数。这使得神经网络在处理各种复杂问题时具有很强的能力。
Outlines
🤖 机器学习中的反向传播算法
本段介绍了机器学习系统中普遍使用的反向传播算法,这是一种在不同架构和数据训练下,几乎所有机器学习模型共同采用的训练程序基础算法。尽管人工神经网络与生物大脑在学习方式上存在根本差异,但反向传播算法是机器学习领域的基石。视频将分两部分,第一部分将探讨反向传播在人工系统中的概念,解释其工作原理和如何从零开始开发;第二部分将讨论生物大脑中的突触可塑性,并探讨反向传播是否与生物学相关,以及大脑可能使用的其他算法。
📈 损失函数与模型训练
这段内容讨论了如何通过损失函数来训练一个网络模型。损失函数是一个数值量度,用于衡量模型拟合数据的好坏。为了找到最佳拟合曲线,需要最小化损失函数。文中通过一个具体的例子,解释了如何使用多项式来拟合数据点,并介绍了如何通过调整多项式的系数来最小化损失。此外,还介绍了如何构建一个机器(Curve Fitter 6000)来帮助找到最小化损失的参数配置。
🔍 函数的微分与优化
本段深入探讨了函数微分的概念以及如何利用微分进行优化。微分是函数在某一点的斜率,表示函数在该点的局部变化率。通过计算函数的导数(即斜率),我们可以了解如何调整参数以减少损失。文中通过一个简化的例子,说明了如何使用导数来找到最小化损失函数的参数值。此外,还介绍了如何通过梯度下降法在高维空间中找到损失函数的最小值。
🧠 反向传播与大脑学习的比较
这段内容讨论了反向传播算法与生物大脑学习机制之间的差异。虽然反向传播在机器学习中非常有效,但它与大脑的学习方式并不相同。视频的下一部分将探讨生物大脑中的突触可塑性,讨论反向传播是否适用于生物学,并考虑大脑可能采用的其他学习算法。
🔧 构建计算图与梯度计算
本段详细描述了如何构建计算图并计算梯度,这是反向传播算法的核心。计算图是一个表示模型中所有操作的图形结构,每个节点代表一个简单的数学操作。通过将复杂函数分解为简单函数的组合,并应用链式法则,我们可以计算出损失函数对每个参数的偏导数。这些偏导数告诉我们如何调整参数以最小化损失。文中通过一个具体的例子,展示了如何从输出层向输入层反向传播,计算每个节点的梯度,并使用这些梯度来进行参数更新。
🌟 训练机器学习模型的循环过程
最后一段总结了训练机器学习模型的循环过程,即通过反复的前向传播和后向传播来优化参数。前向传播用于计算损失函数,后向传播用于计算梯度并更新参数。这个过程不断重复,直到找到最小化损失的参数配置。文中还提出了关于大脑学习方式的问题,预告了下一部分将探讨生物学习机制与机器学习之间的联系。
Mindmap
Keywords
💡机器学习
💡反向传播
💡损失函数
💡梯度下降
💡神经网络
💡参数
💡计算图
💡链式法则
💡激活函数
💡学习率
Highlights
几乎所有机器学习系统都有一个共同点,即反向传播算法。
反向传播是解决不同问题、不同架构和不同数据训练的神经网络背后的基础。
尽管人工网络和生物大脑在结构和功能上有所不同,但反向传播算法是它们之间的根本区别。
反向传播算法的现代形式首次出现在1970年Seppo Linar的硕士论文中。
1986年,David Rumelhart、Geoffrey Hinton和Ronald Williams发表了一篇关于反向传播的重要论文。
反向传播使得多层感知器能够成功解决问题并学习到隐藏神经元层的有意义表示。
训练神经网络的基本概念是最小化损失函数。
损失函数是一个数值,量化了模型拟合数据的好坏。
通过梯度下降算法,我们可以高效地找到最小化损失函数的参数配置。
梯度向量指向函数增长最快的方向,而我们通过向相反方向调整参数来最小化损失。
反向传播算法通过链式法则来计算复杂函数的导数。
链式法则是机器学习领域的核心,它允许我们将复杂函数分解为简单可微操作。
在计算图中,我们可以通过反向传播来找到每个参数对损失的影响。
神经网络由多层可微运算组成,这使得我们可以使用反向传播来优化网络参数。
尽管人工网络能够解决复杂问题,但它们与生物大脑的学习机制可能完全不同。
下一部分将探讨生物大脑中的突触可塑性以及它与反向传播的关联性。
机器学习中的优化问题可以通过梯度下降和反向传播有效地解决。
反向传播算法不仅适用于神经网络,也适用于其他可以通过可微操作分解的模型架构。
通过反向传播,我们可以构建足够大的神经网络来近似任何函数,从而解决各种问题。
Transcripts
what do nearly all machine Learning
Systems have in common from GPT and M
journey to Alpha fold and various models
of the brain despite being designed to
solve different problems having
completely different architectures and
being trained on different data there is
something that unites all of them a
single algorithm that runs under the
hood of the training procedures in all
of those cases this algorithm called
back propagation is the foundation of
the entire field of machine learning
although its details are often
overlooked surprisingly what enables
artificial networks to learn is also
what makes them fundamentally different
from the brain and incompatible with
Biology this video is the first in a
two-part Series today we will explore
the concept of back propagation in
arcial systems and develop an intuitive
understanding of what it is why it works
and how you could have developed it from
scratch
yourself in the next video we will focus
on synaptic plasticity enabling learning
in biological brains and discuss whether
back propagation is biologically
relevant and if not what kind of
algorithms the brain may be using
instead if you're interested stay
tuned despite its transformative imp
impact it's hard to say who invented
back propagation in the first place as
certain Concepts can be traced back to
liins in 17th century however it is
believed that the first modern
formulation of the algorithm still in
use today was published by sepo linar in
his master's thesis in 1970 although he
did not reference any neural networks
explicitly another significant Milestone
occurred in 1986 when David rumelhart
Joffrey Hinton and Ronald Williams
published a paper titled learning
representations by back propagating
errors they applied the back propagation
algorithm to multi-layer perceptrons a
type of a neural network and
demonstrated for the first time that
training with back propagation enables
the network to successfully solve
problems and develop meaningful
representations at the hidden neuron
level capturing important regularities
in the task as the field progressed
researchers scaled up these models
significantly and introduced various
architectures but the fundamental
principles of training remained largely
unchanged to gain a comprehensive
understanding of what exactly it means
to train a network let's try to build
the concept of back propagation from the
ground up consider the following problem
suppose you have collected a set of
points XY on the plane and you want to
describe their relation ship to achieve
this you need to fit a curve y of X that
best represents the data since there are
infinitely many possible functions we
need to make some assumptions for
instance let's assume we want to find a
smooth approximation of the data using a
polom of degree 5 that means that the
resulting curve we're looking for will
be a combination of a constant term a
polinomial of degree Z a straight line a
parabola and so on up to a power of five
each weighted by specific coefficients
in other words the equation for the
curve is as follows where each K is some
arbitrary real number our job then
becomes finding the configuration of k0
through K5 which leads to the best
fitting curve to make the problem
totally unambiguous we need to agree on
what the best curve even means while you
you can just visually inspect the data
points and estimate whether a given
curve captures the pattern or not this
approach is highly subjective and
impractical when dealing with large data
sets instead we need an objective
measurement a numerical value that
quantifies the quality of a fit one
popular method is to measure the square
distance between data points and the
fitted curve a high value suggests that
the data points are significantly far
from the curve indicating a poor
approximation conversely low values
indicate a better fit as the curve
closely aligns with the data points this
measurement is commonly referred to as a
loss and the objective is to minimize it
now notice that for a fixed data this
distance the value of the loss depends
only on the defining characteristics of
the curve in our case the Coe ients from
k0 through
K5 this means that it is effectively a
function of parameters so people usually
refer to it as a loss function it's
important not to confuse two different
functions we are implicitly dealing with
here the first one is the function y of
X which has one input number and one
output number and defines the curve
itself it has this polinomial form given
by K's there are infinitely many such
functions and we would like to find the
best one to achieve this we introduce a
loss function which instead has six
inputs numbers k0 through K5 and for
each configuration it constructs the
corresponding curve y calculates the
distance between observed data points
and the curve and outputs a single
number the particular value of the loss
our job then becomes finding the
configuration of KS that yields a
minimum loss or minimizing the loss
function with respect to the
coefficients then plugging these optimal
cases into the general equation for the
Curve will give us the best curve
described in the data all right great
but how do we find this magic
configuration of case that minimizes the
loss well we might need some help let's
build a machine called Curve fitter 6000
designed to simplify manual calculations
it is equipped with six adjustable knobs
for k0 through K5 which we can freely
turn to begin we initialize the machine
with our data points and then for each
setting of The Knobs it will evaluate
the curve y ofx compute the distance
from it to the data points and print out
the value of the loss function now we
can begin twisting the knobs in order to
find the minimum loss for example let's
start with some initial setting and
slightly noge Noob number one to the
right the resultant curve changed as
well and we can see that the value of
the loss function slightly decreased
great it means we are on the right track
let's turn knob number one in the same
direction once again uh-oh this time the
fit gets worse and the loss function
increases apparently that last noge was
a bit too much so let's revert the knob
to the previous position and try knob
two and we can keep doing this
iteratively many many times nudging each
individual knob one at a time to see
whether the resulting curve is a better
fit this is a so-called random
perturbation method since we are
essentially wandering in the dark not
knowing in advance how each adjustment
will affect the loss function this would
certainly work but it's not very
efficient is there a way we can be more
intelligent about the knob adjustments
in the most General case when the
machine is a complete Black Box nothing
better than a random perturbation is
guaranteed to exist however a great deal
of computations including what's carried
out under the hood of our curve fitter
have a special property to them
something called differentiability that
allows us to compute the optimal knob
setting much more efficiently we will
dive deeper into what differentiability
means in just a minute but for now let's
quickly see the big picture overview of
where we are going our goal would be to
upgrade the machine so that it would
have a tiny screen next to each knob and
for any configuration those screens
should say which direction you need to
nudge each knob in order to decrease the
loss function and by how much
think about it for a second we are
essentially asking the machine to
predict the future and estimate the
effect of the noob adjustment on the
loss function without actually
performing that adjustment calculating
the loss and then reverting the knob
back like we did previously wouldn't
this glance into the future violate some
sort of principle after all we are
jumping to the result of the computation
without performing in it sounds like
cheating right well it turns out that
this idea lies on a very simple
mathematical foundation so let's spend
the next few minutes building it up from
scratch all right let's consider a
simpler case first where we freeze five
out of six knobs for example suppose
someone tells you that the rest of them
are already in the optimal position so
all you need to do is to find the best
value for one remaining kn
essentially the machine now has only one
variable parameter K1 that we can tweak
and so the loss function is also a
simpler function which accepts one
number the knob setting and outputs
another number the loss value as a
function of one variable it can be
conveniently visualized as a graph in a
two-dimensional plane which captures the
relationship between the input and the
output for example it may have this
shape right here and our goal goal is to
find this value of K1 which corresponds
to the minimum of the loss function but
we don't have access to the true
underlying shape all we can do is to set
the knob at a chosen position and kind
of query the machine for the value of
the loss in other words we can only
sample individual points along the
function we're trying to minimize and we
are essentially blind to how the
function behaves in between the known
points before we sample them but suppose
we would like to know something more
about the function not just each value
at each point for example whether at
this point the function is going up or
down this information will ultimately
guide our adjustments because if you
know that the function is going down as
you increase the input turning the knob
to the right is a safe bad since you are
guaranteed to decrease the loss with
this manipulation let's put this notion
of going up or down around a point on a
stronger mathematical ground suppose we
have just sampled the point x KN y KN on
this graph what we can do is increase
the input by a small amount Delta X this
new adjusted input will result in a new
value of y which will differ from the
old value by some Delta y this Delta
depends on the magnitude of our
adjustment for example if we take a step
Delta X which is 10 times smaller Delta
y will also be approximately 10 times as
small this is why it makes sense to take
the ratio Delta y over Delta X the
amount of change in the output per unit
change in the input graphically this
ratio corresponds to a slope of a
straight line going through the points X
not Y and X Plus Delta X Y KN plus Delta
Y no notice that as we take smaller and
smaller steps this straight line will
more and more accurately align with the
graph in the neighborhood of the point x
y KN let's take a limit of this ratio as
Delta X goes to infinitely small values
then this limiting case value which this
ratio converges to for infinitesimally
small Delta X's is what is called the
derivative of a function and it is dened
Ed by dy/ DX visually the derivative of
a function at some point is the slope of
the line that is tangent to the graph
and thus corresponds to the
instantaneous rate of change or
steepness of that function around that
point but different points along the
graph might have different stiffness
values so the derivative of the entire
function is not a single number in fact
the derivative Dy by DX X is itself a
function of X that takes an arbitrary
value of x and outputs the local
steepness of Y ofx at that point this
definition assigns to every function its
derivative Alter Ego another function
operating on the same input domain which
carries information about the steepness
of the original function there is a bit
of a subtlety strictly speaking the
derivative may not exist if the function
doesn't have a steepness around some
point for example if it has sharp
corners or
discontinuities however for the
remainder of the video we are going to
assume that all functions we are dealing
with are smooth so that the derivative
always
exists this is a reasonable claim
because we can control what sort of
functions go into our models when we
build them and people usually restrict
everything to smooth or differential
functions to make all the math work out
nicely all right great now along with
the underlying loss as a function of K1
which is hidden from us we can also
reason about its derivative another
function of K1 which we also don't know
that is equal to the steepness of the
loss function at that
point let's suppose that similarly to
how we can query the loss function by
running our machine and obtaining
individual samples
there is a mechanism for us to sample
the derivative function as
well so for every input value of K1 the
machine will output the value of the
loss and the local steepness of the loss
function around that point notice that
this derivative information is exactly
the sort of look into the future we were
looking for to make smarter knob
adjustments for example let's use it to
efficiently find the the optimal value
of K1 what we can do is the following
first start at some random position ask
the machine for a value of the loss and
the derivative of the loss function at
that
position take a tiny step in the
direction opposite of the derivative if
the derivative is negative it means that
the function is going down and so if we
want to arrive at the minimum we need to
move in the direction of increas in
value of K1 repeat this procedure until
you reach the point where the derivative
is zero which essentially corresponds to
the minimum where the tangent line is
flat essentially each adjustment in such
a guided fashion Works kind of like a
ball rolling down the hill along the
graph until it reaches a
valley although in the beginning we
froze five out of six knobs for
Simplicity this process is easily
carried out to higher
dimensions for example suppose now we
are free to tweak two different knobs K1
and
K2 the loss would become a function of
two variables which can be visualized as
a surface but what about the derivative
recall that by definition the derivative
at each point tells us how the output
changes per unit change of the input but
now we have two different inputs should
we nudge only K1 K2 or
both essentially our function will have
two different derivatives that are
usually called partial derivatives
because of this ambiguity which input to
notch namely when we have two knobs the
derivative of the loss function with
respect to parameter K1 is written like
this it is how much the output changes
per unit change in K1 if you hold K2
constant and conversely this expression
tells you the rate of change of the
output if you hold K1 constant and
slightly noge K2 geometrically you can
imagine slicing the surface with planes
parallel to the axis intersecting at the
point of Interest K1 K2 so that each of
the two cross-sections is like a
one-dimensional graph of the loss as a
function of one variable while the other
one is kept constant then the slope of a
tangent line at each cross-section will
give you a corresponding partial
derivative of the loss at that point
while thinking about partial derivatives
as two separate surfaces one for each
variable is a perfectly valid way people
usually plug the two different values
into a vector called a gradient Vector
essentially this is a mapping from two
input values to another two numbers
where the first signifies how much the
output changes per tiny change in the
first input and similarly for the second
input geometrically this Vector points
in the direction of steepest Ascent so
if you want to minimize a function like
in the case for our loss we need to take
steps in in the direction opposite to
this
gradient this iterative procedure of
noding the parameters in the direction
opposite of the gradient Vector is
called gradient descent which you have
probably heard of this is analogous to a
ball rolling down the hill for the
two-dimensional case and the partial
derivatives essentially tell you which
direction is downhill going Beyond two
Dimensions is impossible to visualize
directly but the math stays exactly the
same for instance if we are now free to
tweak all the six knobs the loss
function is a hyper surface in Six
Dimensions and the gradient Vector now
has six numbers packed into it but it
still points in the direction of
steepest Ascent so if we iteratively
take small steps in the direction
opposite to it we are going to roll the
ball down the hill in Six Dimensions and
eventually reach the minimum of the loss
function great let's back up a bit
remember how we were looking for ways to
add screens next to each knob that would
give us the direction of optimal
adjustment well it is essentially
nothing more but the components of the
gradient Vector if at a particular
configuration the partial derivative of
the loss with respect to K1 is positive
it means that increasing K1 will lead to
increased loss so we need to decrease
the value of the knob by turning it to
the left and similarly for all other
parameters this is how the derivatives
serve as these windows into the future
by providing us with information about
local behavior of the function and once
we have a way of accessing the
derivative we can perform gradient
descent and efficiently find the minimum
of the loss function thus solving the
optimization problem however there is an
elephant in a room so far we have
implicitly assumed the derivative
information is given to us or that we
can sample the derivative at a given
point similarly to how we sample the
loss function Itself by running the
calculation of the machine but how do
you actually find the derivative as we
will see further this is the main
purpose of the back propagation
algorithm essentially the way we find
derivatives of arbitrarily complex
functions is the following first there
are a handful of building blocks to
begin with simple functions derivatives
of which are known from calculus these
are the kind of derivative formulas you
often memorize in college for example if
the function is linear it's pretty clear
that its derivative will be a constant
equal to the slope of that line
everywhere
which coincides with its own tangent
line a parabola x² becomes more steep as
you increase X and its derivative is
actually
2x in fact there is a more general
formula for the derivative of x to the^
of n similarly derivatives of the
exponent and logarithm can be written
down
explicitly but these are just individual
examples of simple well-known functions
in order to compute arbitrary
derivatives we need a way to combine
such Atomic building blocks
together there are a few rules how to do
it for instance the derivative of a sum
of two functions is the sum of the
derivatives there is also a formula for
the derivative of a product of two
functions this gives you a way to
compute things like the derivative of
3x^2 - e ^ of x but to complete the
picture and to be able to find
derivatives of almost everything we need
one other rule called The Chain rule
which Powers the entire field of machine
learning it tells you how to compute the
derivative of a combination of two
functions when one of them is an input
to another here is a way to reason about
this suppose you take one of those
simpler machines which receives a single
input x that you can vary with a knob
and spits out an output J of X now you
take a second machine of this kind which
performs a different function f of
x what would happen if you connect them
in sequence so that the output of the
first machine is fed into the second one
as an
input notice that such a construction
can be thought of as a single function
into the second function is J of X and
so the local rate of change of the
second machine is thus the derivative of
f evaluated at the point J of X now
imagine you nudge the knob X by a tiny
amount Delta that input nudge when it
comes out of the first machine will be
multiplied by the derivative of J since
the derivative is the rate of change in
the output per unit change of the input
so after the first function the output
will increase by Delta multiplied by the
derivative of J this expression is
essentially a tiny nudge in the input to
the second machine whose derivative at
that point is given by this expression
this means that for each Delta increase
in the input we bump the output by this
much hence the derivative when you
divide that by Delta will look like this
you can think about it as a set of three
interconnected Cog Wheels where the
first one represents the input knob X
and the other two wheels are functions J
of X and F of J of X respectively when
you Nodge the first wheel it induces a
nudge in the middle wheel and the
amplitude of that change is given by the
derivative of J which in turn causes the
Third Wheel to rotate and the amplitude
of that resulting nudge is given by
changing the derivatives together all
right great now we have a
straightforward way of obtaining a
derivative of any arbitrarily complex
function as long as it can be decomposed
into building blocks simple functions
with explicit derivative formulas such
as summations multiplications exponents
logarithms Etc but how can it be used to
find the best curve using our curve
fitter the big picture we are aiming for
is the following for each of our
parameter knobs we will write down its
effect on the loss in terms of simple
easily differentiable operations once we
have that sequence of building blocks no
matter how long we should be able to
sequentially apply the chain rule to
each of them in order to find the value
of the derivative of the loss function
with respect to each of the input knobs
and perform iterative gradient descent
to minimize the loss let's see an
example of this first we are going to
create a knob for each number the loss
function can possibly depend on this
obviously includes the parameters but
there is also the data itself
coordinates of points to which we are
fit in the curve in the first place now
during optimization the data points are
set in Stone so changing them in order
to obtain a lower loss would make no
sense however for conceptional purposes
we can think about these values as fixed
knobs set in one position so that we
cannot n them once we have all the
existing numbers being fed into the
machine we can start to break down the
loss
calculation Remember by definition it is
the sum of squar vertical distances from
each point to the curve parameterized by
case so for instance let's take the
first data point X1 y1 multiply the x
coordinate by K1 add that to the squared
value of X1 multiplied by K2 and so on
for other KS including the constant term
k0 this sum of weight and powers of X1
is the value of y predicted by the
current curve F of X1 let's call it y1
hat next we need to take the squared
difference between the actual value and
the predicted value this is how much the
first data point contributes to the
resulting value of the loss
function repeating the same procedure
for all remaining data points and
summing up the resulting squared
distances
gives us the overall total loss that we
are trying to minimize the computation
we just performed finding the value of
the loss for a given configuration of
parameter and data knobs is known as the
forward
step the entire sequence of calculations
can be visualized as this kind of
computational graph where each node is
some simple operation like addition or
multiplication forward step then
corresponds to computations flowing from
left to right but to perform
optimization we also need information
about gradients how each knob influences
the loss now we are going to do what's
known as the backward step and unroll
the sequence of calculations in reverse
order to find derivatives what makes the
backward possible is the fact that every
note in our compute graph is an easily
differentiable operation think of
individual nodes as these tiny machines
which simply add multiply or take Powers
we know their derivatives and because
their outputs are connected sequentially
we can apply the chain
rule this means that for each node we
can find its gradient the partial
derivative of of the output loss with
respect to that
node let's see how it can be done
consider a region of the compute graph
where two number nodes A and B are being
fed into a machine that performs
addition and its result a plus b is
further processed by the system to
compute the overall output L suppose we
already computed the gradient of a plus
b earlier so that we know how nding this
sum will affect the output the question
is what are individual gradients of A
and
B well intuitively if you nudge a by
some amount a + b will be nudged by the
same amount so the gradient or the
partial derivative of the loss with
respect to a is the same as the gradient
of the sum and similarly for B this can
be seen more form by writing down the
chain Rule and noticing that the
derivative of A+ B with respect to a is
just one in other words when you
encounter this situation in the compute
graph then the gradient of the sum just
simply propagates into the gradients of
the nodes that plug into the sum machine
another possible scenario is When A and
B are
multiplied just like before suppose we
know the gradient of the their product
because it was computed before in this
case individual noge to a will be scaled
by a factor of B so the product will be
nudged B times as much which propagates
into the output so whatever the
derivative of the output with respect to
the product of ab is the output
derivative with respect to a will get
Scaled by a factor of B and vice versa
for the gradient of B once again it can
be seen more formally by examining the
chain rule in other words the
multiplication node in the compute graph
distributes the downstream gradient
across incoming nodes by multiplying it
cross Ways by their
values similar rules can be easily
formulated for other building block
calculations such as raising a number to
a power or taking the
logarithm finally when a single node
takes part in multiple branches of the
compute graph gradients from the
corresponding branches are simply added
together indeed suppose you have the
following structure in the graph where
the same node a plugs into two different
operations that contribute to the
overall loss then if you nudge a by
Delta the output will be simultaneously
noodged by this derivative from the
first branch and this derivative from
the second second Branch so the overall
effect of ning a will be the sum of the
two gradients all right great now that
we have constructed a computational
graph and established how to process
individual chunks of it we can just
sequentially apply those rules starting
from the output and working our way
backwards for instance the rightmost
node in the graph is the resulting value
of the loss function how does the
incremental change in that node affect
the output well it is the output so its
gradient is by definition equal to one
next the loss function is the sum of
many Delta y's squared we know what to
do with the summation node it just
copies whatever the gradient value is to
the right of it into all incoming nodes
consequently the gradients of all Delta
y squared will also be equal to one each
of those nodes is this squared value of
the corresponding Delta Y and we know
how to differentiate this squaring
operation the derivative of the loss
function with respect to Delta y1 will
be 2 * the Delta y1 which is just the
number we found during the forward
calculation and we can keep doing this
propagation of sequential derivative
calculation backwards along our compute
graph until we reach the leftmost nodes
which are the data and parameter knobs
the Der derivatives of the loss with
respect to the input data don't really
matter but the derivatives with respect
to the parameters is exactly what we
want once these parameter gradients are
found we can perform one iteration of
gradient descent namely we're going to
slightly tweak the knobs in the
directions opposite to the gradient the
exact magnitude of each adjustment being
the negative product of the gradient and
some small number called The Learning
rate for example
0.01 note that after the adjustment is
performed the configuration of the
machine and the resulting loss are
different and so the old gradient values
we found no longer hold so we need to
run the forward and backward
calculations once again to obtain
updated gradients and the new decreased
loss performing this Loop of forward
pass backward pass nudge repeat is the
essence of training every modern machine
Learning System and exactly the same
algorithm is used today in even the most
complicated models as long as the
problem you're trying to solve with a
given model architecture can be
decomposed into individual operations
that are differentiable you can
sequentially apply the chain rule many
times to arrive at the optimal setting
of The parameters for instance a feed
forward neural network is essentially a
bunch of multiplications and summations
with a few nonlinear activation
functions sprinkled between the layers
each of those atomic computations is
differentiable so you can construct the
compute graph and run the backward path
on it to find how each parameter like
connection weights between neurons
influence the loss function and because
neural networks given enough neurons can
in theory approximate any function
imaginable we can create a large enough
sequence of these building block
mathematical machines to solve problems
such as classifying images and even
generating new text this seems like a
very elegant and efficient solution
after all if you want to solve the
optimization problem derivatives tell
you exactly which adjustments are
necessary but how similar is this to
what the brain actually does when we
learn to walk speak and read is the
brain also minimizing some sort of loss
function does it calculate derivatives
or could it be doing something totally
different in the next video we are going
to dive into the world of synaptic
plasticity and talk about how biological
neural networks learn in keeping with
the topic of biological learning I'd
like to take a moment to give a shout
out to shortform a longtime partner of
this channel short form is a platform
which
stay tuned for more interesting topics
coming up goodbye and thank you for the
interest in the
[Music]
brain
for
Weitere ähnliche Videos ansehen
Geoffrey Hinton is a genius | Jay McClelland and Lex Fridman
Backpropagation and the brain
Geoffrey Hinton Unpacks The Forward-Forward Algorithm
Lecture 1.2 — What are neural networks — [ Deep Learning | Geoffrey Hinton | UofT ]
7. Layered Knowledge Representations
Lecture 1.1 — Why do we need machine learning — [ Deep Learning | Geoffrey Hinton | UofT ]
5.0 / 5 (0 votes)