The Most Important Algorithm in Machine Learning

Artem Kirsanov
31 Mar 202440:08

Summary

TLDR本视频深入探讨了反向传播算法——机器学习领域的核心算法。通过详细解释其工作原理、历史发展以及如何从零开始构建,视频强调了反向传播算法在解决各种问题中的重要性。同时,视频也提出了关于其与生物大脑学习机制的差异性问题,并预告了下一部分将探讨生物大脑中的突触可塑性,以及这些差异对机器学习算法的启示。

Takeaways

  • 🧠 反向传播是机器学习领域的核心算法,它使得人工神经网络能够通过学习数据来解决问题。
  • 📈 反向传播算法的基本概念是梯度下降,通过调整参数来最小化损失函数。
  • 🔄 反向传播的过程涉及到构建计算图,并通过链式法则来计算每个节点对损失的影响。
  • 📚 反向传播的历史可以追溯到17世纪,但现代形式的算法是在20世纪70年代由Seppo Linar首次提出的。
  • 🌟 1986年,David Rumelhart、Geoffrey Hinton和Ronald Williams的论文使得反向传播算法得到了广泛的应用。
  • 🧬 尽管人工神经网络在结构和训练数据上与生物大脑不同,但反向传播算法在理解大脑学习机制方面提供了有价值的参考。
  • 🔍 反向传播算法的核心在于能够高效地计算复杂函数的导数,这是通过构建计算图并应用链式法则实现的。
  • 🛠️ 通过反向传播算法,我们可以对神经网络中的每个参数进行优化,以提高模型在特定任务上的表现。
  • 📊 在机器学习中,损失函数是一个重要的概念,它衡量了模型预测值与实际值之间的差异。
  • 🎯 反向传播算法使得我们能够通过梯度下降来找到损失函数的最小值,从而优化模型参数。
  • 🤔 视频还提出了关于生物大脑学习机制的问题,即大脑是否使用类似于反向传播的机制,或者采用了完全不同的算法。

Q & A

  • 反向传播算法在机器学习领域的作用是什么?

    -反向传播算法是机器学习领域的基础算法,它使得人工神经网络能够通过训练数据进行学习。这个算法通过计算损失函数对每个参数的梯度,指导网络参数的调整,从而最小化损失函数,提高模型的预测能力。

  • 反向传播算法的基本原理是什么?

    -反向传播算法的基本原理是梯度下降。它通过计算损失函数对网络参数的偏导数(梯度),然后将参数沿着梯度的反方向进行更新,以此来最小化损失函数。这个过程通过链式法则逐步展开,从输出层一直反向传播到输入层。

  • 为什么说反向传播算法与生物大脑的学习机制存在本质区别?

    -尽管反向传播算法在模拟神经网络学习方面取得了巨大成功,但它与生物大脑的学习机制存在本质区别。生物大脑通过突触可塑性进行学习,这是一个分布式和并行的过程,而反向传播算法则是一个基于梯度的、自上而下的、迭代优化的过程。

  • 反向传播算法的发明归功于谁?

    -反向传播算法的发明权没有明确的归属,因为相关的概念可以追溯到17世纪。但是,第一个现代形式的反向传播算法被认为由Seppo Linar在1970年的硕士论文中发表,尽管他没有明确引用任何神经网络。

  • 在机器学习中,损失函数的作用是什么?

    -损失函数在机器学习中的作用是提供一个量化模型预测误差的方法。通过最小化损失函数,可以调整模型参数,使得模型的预测结果更接近真实数据,从而提高模型的性能。

  • 如何理解梯度和梯度下降?

    -梯度是一个向量,它指向函数增长最快的方向,其大小表示增长的速度。梯度下降是一种优化算法,它通过沿着梯度的反方向(即下降最快的方向)迭代调整参数,以此来最小化目标函数。

  • 为什么说链式法则是机器学习领域的核心?

    -链式法则允许我们计算复杂函数组合的导数。在机器学习中,模型通常由多个简单函数组合而成,链式法则使我们能够有效地计算这些组合函数相对于各个参数的导数,这是执行梯度下降和训练模型的基础。

  • 神经网络中的激活函数有什么作用?

    -激活函数在神经网络中引入非线性,使得网络能够学习和模拟更加复杂的函数。没有激活函数,神经网络无论有多少层,本质上还是线性模型,无法处理复杂的非线性问题。

  • 什么是梯度消失问题,它是如何影响神经网络训练的?

    -梯度消失问题是指在深层神经网络中,梯度在反向传播过程中逐渐变小,以至于对网络参数的更新几乎没有影响。这会导致网络训练停滞,因为参数不再发生变化,模型无法继续学习。

  • 如何理解损失函数中的均方误差(MSE)?

    -均方误差(MSE)是衡量模型预测值与实际值差异的常用损失函数。它计算每个数据点的预测误差,然后对这些误差进行平方和平均。MSE越小,表示模型的预测结果与实际数据越接近,模型的性能越好。

  • 为什么说神经网络能够近似任何函数?

    -这是由于神经网络的通用近似定理,它指出一个具有足够数量神经元的前馈神经网络,理论上可以以任意精度近似任何连续函数。这使得神经网络在处理各种复杂问题时具有很强的能力。

Outlines

00:00

🤖 机器学习中的反向传播算法

本段介绍了机器学习系统中普遍使用的反向传播算法,这是一种在不同架构和数据训练下,几乎所有机器学习模型共同采用的训练程序基础算法。尽管人工神经网络与生物大脑在学习方式上存在根本差异,但反向传播算法是机器学习领域的基石。视频将分两部分,第一部分将探讨反向传播在人工系统中的概念,解释其工作原理和如何从零开始开发;第二部分将讨论生物大脑中的突触可塑性,并探讨反向传播是否与生物学相关,以及大脑可能使用的其他算法。

05:01

📈 损失函数与模型训练

这段内容讨论了如何通过损失函数来训练一个网络模型。损失函数是一个数值量度,用于衡量模型拟合数据的好坏。为了找到最佳拟合曲线,需要最小化损失函数。文中通过一个具体的例子,解释了如何使用多项式来拟合数据点,并介绍了如何通过调整多项式的系数来最小化损失。此外,还介绍了如何构建一个机器(Curve Fitter 6000)来帮助找到最小化损失的参数配置。

10:02

🔍 函数的微分与优化

本段深入探讨了函数微分的概念以及如何利用微分进行优化。微分是函数在某一点的斜率,表示函数在该点的局部变化率。通过计算函数的导数(即斜率),我们可以了解如何调整参数以减少损失。文中通过一个简化的例子,说明了如何使用导数来找到最小化损失函数的参数值。此外,还介绍了如何通过梯度下降法在高维空间中找到损失函数的最小值。

15:04

🧠 反向传播与大脑学习的比较

这段内容讨论了反向传播算法与生物大脑学习机制之间的差异。虽然反向传播在机器学习中非常有效,但它与大脑的学习方式并不相同。视频的下一部分将探讨生物大脑中的突触可塑性,讨论反向传播是否适用于生物学,并考虑大脑可能采用的其他学习算法。

20:07

🔧 构建计算图与梯度计算

本段详细描述了如何构建计算图并计算梯度,这是反向传播算法的核心。计算图是一个表示模型中所有操作的图形结构,每个节点代表一个简单的数学操作。通过将复杂函数分解为简单函数的组合,并应用链式法则,我们可以计算出损失函数对每个参数的偏导数。这些偏导数告诉我们如何调整参数以最小化损失。文中通过一个具体的例子,展示了如何从输出层向输入层反向传播,计算每个节点的梯度,并使用这些梯度来进行参数更新。

25:08

🌟 训练机器学习模型的循环过程

最后一段总结了训练机器学习模型的循环过程,即通过反复的前向传播和后向传播来优化参数。前向传播用于计算损失函数,后向传播用于计算梯度并更新参数。这个过程不断重复,直到找到最小化损失的参数配置。文中还提出了关于大脑学习方式的问题,预告了下一部分将探讨生物学习机制与机器学习之间的联系。

Mindmap

Keywords

💡机器学习

机器学习是一种让计算机通过数据学习并做出决策或预测的技术。在视频中,它被用来描述各种系统如GPT和AlphaFold等的共同点,即它们都使用反向传播算法作为学习的基础。

💡反向传播

反向传播是一种用于训练人工神经网络的算法,通过计算损失函数相对于网络参数的梯度来更新参数,以此最小化损失函数。它是机器学习领域的核心算法之一。

💡损失函数

损失函数是一个衡量模型预测与实际数据之间差异的函数,机器学习的目标是通过调整参数来最小化这个函数。它反映了模型的性能。

💡梯度下降

梯度下降是一种优化算法,用于通过迭代调整参数来最小化损失函数。它依赖于计算损失函数相对于参数的梯度,并向梯度的反方向更新参数。

💡神经网络

神经网络是一种模仿人脑工作原理的计算模型,由多层神经元组成,可以处理复杂的数据模式和序列。它们在机器学习和人工智能领域中广泛应用。

💡参数

在机器学习模型中,参数是模型中的可调变量,它们决定了模型如何从输入数据映射到输出结果。通过训练过程,参数会被调整以优化模型性能。

💡计算图

计算图是一种表示机器学习模型中数据流向和操作的图形化方法。它将模型的每个操作表示为图中的一个节点,操作之间的数据流动表示为边。

💡链式法则

链式法则是微积分中的一个基本概念,用于计算复合函数的导数。在机器学习中,它用于计算由多个简单函数组成的复杂函数的导数。

💡激活函数

激活函数是神经网络中的一种非线性函数,用于引入非线性因素,使得神经网络能够学习和模拟更加复杂的函数。

💡学习率

学习率是梯度下降算法中的一个超参数,用于控制每次参数更新时的步长大小。合理的学习率可以帮助模型更快收敛到最优解,而不至于在最优解附近震荡或错过最优解。

Highlights

几乎所有机器学习系统都有一个共同点,即反向传播算法。

反向传播是解决不同问题、不同架构和不同数据训练的神经网络背后的基础。

尽管人工网络和生物大脑在结构和功能上有所不同,但反向传播算法是它们之间的根本区别。

反向传播算法的现代形式首次出现在1970年Seppo Linar的硕士论文中。

1986年,David Rumelhart、Geoffrey Hinton和Ronald Williams发表了一篇关于反向传播的重要论文。

反向传播使得多层感知器能够成功解决问题并学习到隐藏神经元层的有意义表示。

训练神经网络的基本概念是最小化损失函数。

损失函数是一个数值,量化了模型拟合数据的好坏。

通过梯度下降算法,我们可以高效地找到最小化损失函数的参数配置。

梯度向量指向函数增长最快的方向,而我们通过向相反方向调整参数来最小化损失。

反向传播算法通过链式法则来计算复杂函数的导数。

链式法则是机器学习领域的核心,它允许我们将复杂函数分解为简单可微操作。

在计算图中,我们可以通过反向传播来找到每个参数对损失的影响。

神经网络由多层可微运算组成,这使得我们可以使用反向传播来优化网络参数。

尽管人工网络能够解决复杂问题,但它们与生物大脑的学习机制可能完全不同。

下一部分将探讨生物大脑中的突触可塑性以及它与反向传播的关联性。

机器学习中的优化问题可以通过梯度下降和反向传播有效地解决。

反向传播算法不仅适用于神经网络,也适用于其他可以通过可微操作分解的模型架构。

通过反向传播,我们可以构建足够大的神经网络来近似任何函数,从而解决各种问题。

Transcripts

play00:00

what do nearly all machine Learning

play00:02

Systems have in common from GPT and M

play00:05

journey to Alpha fold and various models

play00:07

of the brain despite being designed to

play00:10

solve different problems having

play00:12

completely different architectures and

play00:15

being trained on different data there is

play00:17

something that unites all of them a

play00:20

single algorithm that runs under the

play00:22

hood of the training procedures in all

play00:25

of those cases this algorithm called

play00:28

back propagation is the foundation of

play00:30

the entire field of machine learning

play00:33

although its details are often

play00:35

overlooked surprisingly what enables

play00:38

artificial networks to learn is also

play00:41

what makes them fundamentally different

play00:43

from the brain and incompatible with

play00:45

Biology this video is the first in a

play00:48

two-part Series today we will explore

play00:51

the concept of back propagation in

play00:53

arcial systems and develop an intuitive

play00:56

understanding of what it is why it works

play00:59

and how you could have developed it from

play01:01

scratch

play01:02

yourself in the next video we will focus

play01:05

on synaptic plasticity enabling learning

play01:08

in biological brains and discuss whether

play01:11

back propagation is biologically

play01:13

relevant and if not what kind of

play01:15

algorithms the brain may be using

play01:17

instead if you're interested stay

play01:27

tuned despite its transformative imp

play01:29

impact it's hard to say who invented

play01:32

back propagation in the first place as

play01:34

certain Concepts can be traced back to

play01:37

liins in 17th century however it is

play01:40

believed that the first modern

play01:41

formulation of the algorithm still in

play01:44

use today was published by sepo linar in

play01:47

his master's thesis in 1970 although he

play01:50

did not reference any neural networks

play01:53

explicitly another significant Milestone

play01:56

occurred in 1986 when David rumelhart

play01:59

Joffrey Hinton and Ronald Williams

play02:02

published a paper titled learning

play02:04

representations by back propagating

play02:06

errors they applied the back propagation

play02:09

algorithm to multi-layer perceptrons a

play02:12

type of a neural network and

play02:14

demonstrated for the first time that

play02:16

training with back propagation enables

play02:19

the network to successfully solve

play02:21

problems and develop meaningful

play02:23

representations at the hidden neuron

play02:24

level capturing important regularities

play02:27

in the task as the field progressed

play02:30

researchers scaled up these models

play02:33

significantly and introduced various

play02:35

architectures but the fundamental

play02:37

principles of training remained largely

play02:40

unchanged to gain a comprehensive

play02:43

understanding of what exactly it means

play02:44

to train a network let's try to build

play02:47

the concept of back propagation from the

play02:50

ground up consider the following problem

play02:53

suppose you have collected a set of

play02:55

points XY on the plane and you want to

play02:58

describe their relation ship to achieve

play03:01

this you need to fit a curve y of X that

play03:04

best represents the data since there are

play03:08

infinitely many possible functions we

play03:11

need to make some assumptions for

play03:13

instance let's assume we want to find a

play03:15

smooth approximation of the data using a

play03:18

polom of degree 5 that means that the

play03:22

resulting curve we're looking for will

play03:24

be a combination of a constant term a

play03:27

polinomial of degree Z a straight line a

play03:31

parabola and so on up to a power of five

play03:34

each weighted by specific coefficients

play03:37

in other words the equation for the

play03:39

curve is as follows where each K is some

play03:42

arbitrary real number our job then

play03:46

becomes finding the configuration of k0

play03:49

through K5 which leads to the best

play03:51

fitting curve to make the problem

play03:54

totally unambiguous we need to agree on

play03:56

what the best curve even means while you

play03:59

you can just visually inspect the data

play04:01

points and estimate whether a given

play04:04

curve captures the pattern or not this

play04:06

approach is highly subjective and

play04:08

impractical when dealing with large data

play04:11

sets instead we need an objective

play04:13

measurement a numerical value that

play04:16

quantifies the quality of a fit one

play04:19

popular method is to measure the square

play04:22

distance between data points and the

play04:24

fitted curve a high value suggests that

play04:27

the data points are significantly far

play04:30

from the curve indicating a poor

play04:33

approximation conversely low values

play04:36

indicate a better fit as the curve

play04:39

closely aligns with the data points this

play04:42

measurement is commonly referred to as a

play04:44

loss and the objective is to minimize it

play04:48

now notice that for a fixed data this

play04:51

distance the value of the loss depends

play04:55

only on the defining characteristics of

play04:57

the curve in our case the Coe ients from

play05:00

k0 through

play05:02

K5 this means that it is effectively a

play05:05

function of parameters so people usually

play05:08

refer to it as a loss function it's

play05:11

important not to confuse two different

play05:14

functions we are implicitly dealing with

play05:16

here the first one is the function y of

play05:19

X which has one input number and one

play05:22

output number and defines the curve

play05:24

itself it has this polinomial form given

play05:28

by K's there are infinitely many such

play05:31

functions and we would like to find the

play05:33

best one to achieve this we introduce a

play05:36

loss function which instead has six

play05:39

inputs numbers k0 through K5 and for

play05:43

each configuration it constructs the

play05:46

corresponding curve y calculates the

play05:48

distance between observed data points

play05:51

and the curve and outputs a single

play05:54

number the particular value of the loss

play05:58

our job then becomes finding the

play06:00

configuration of KS that yields a

play06:02

minimum loss or minimizing the loss

play06:05

function with respect to the

play06:07

coefficients then plugging these optimal

play06:10

cases into the general equation for the

play06:12

Curve will give us the best curve

play06:15

described in the data all right great

play06:18

but how do we find this magic

play06:20

configuration of case that minimizes the

play06:22

loss well we might need some help let's

play06:26

build a machine called Curve fitter 6000

play06:29

designed to simplify manual calculations

play06:33

it is equipped with six adjustable knobs

play06:36

for k0 through K5 which we can freely

play06:39

turn to begin we initialize the machine

play06:43

with our data points and then for each

play06:46

setting of The Knobs it will evaluate

play06:48

the curve y ofx compute the distance

play06:51

from it to the data points and print out

play06:55

the value of the loss function now we

play06:57

can begin twisting the knobs in order to

play07:00

find the minimum loss for example let's

play07:03

start with some initial setting and

play07:06

slightly noge Noob number one to the

play07:08

right the resultant curve changed as

play07:11

well and we can see that the value of

play07:13

the loss function slightly decreased

play07:16

great it means we are on the right track

play07:19

let's turn knob number one in the same

play07:21

direction once again uh-oh this time the

play07:24

fit gets worse and the loss function

play07:26

increases apparently that last noge was

play07:29

a bit too much so let's revert the knob

play07:32

to the previous position and try knob

play07:34

two and we can keep doing this

play07:36

iteratively many many times nudging each

play07:40

individual knob one at a time to see

play07:43

whether the resulting curve is a better

play07:45

fit this is a so-called random

play07:48

perturbation method since we are

play07:50

essentially wandering in the dark not

play07:53

knowing in advance how each adjustment

play07:56

will affect the loss function this would

play07:58

certainly work but it's not very

play08:00

efficient is there a way we can be more

play08:03

intelligent about the knob adjustments

play08:06

in the most General case when the

play08:08

machine is a complete Black Box nothing

play08:11

better than a random perturbation is

play08:13

guaranteed to exist however a great deal

play08:16

of computations including what's carried

play08:19

out under the hood of our curve fitter

play08:22

have a special property to them

play08:24

something called differentiability that

play08:26

allows us to compute the optimal knob

play08:28

setting much more efficiently we will

play08:31

dive deeper into what differentiability

play08:33

means in just a minute but for now let's

play08:36

quickly see the big picture overview of

play08:39

where we are going our goal would be to

play08:42

upgrade the machine so that it would

play08:45

have a tiny screen next to each knob and

play08:48

for any configuration those screens

play08:51

should say which direction you need to

play08:54

nudge each knob in order to decrease the

play08:56

loss function and by how much

play09:00

think about it for a second we are

play09:02

essentially asking the machine to

play09:04

predict the future and estimate the

play09:07

effect of the noob adjustment on the

play09:09

loss function without actually

play09:12

performing that adjustment calculating

play09:14

the loss and then reverting the knob

play09:16

back like we did previously wouldn't

play09:19

this glance into the future violate some

play09:22

sort of principle after all we are

play09:25

jumping to the result of the computation

play09:28

without performing in it sounds like

play09:30

cheating right well it turns out that

play09:34

this idea lies on a very simple

play09:36

mathematical foundation so let's spend

play09:38

the next few minutes building it up from

play09:42

scratch all right let's consider a

play09:45

simpler case first where we freeze five

play09:48

out of six knobs for example suppose

play09:50

someone tells you that the rest of them

play09:52

are already in the optimal position so

play09:55

all you need to do is to find the best

play09:57

value for one remaining kn

play10:00

essentially the machine now has only one

play10:02

variable parameter K1 that we can tweak

play10:06

and so the loss function is also a

play10:08

simpler function which accepts one

play10:10

number the knob setting and outputs

play10:13

another number the loss value as a

play10:15

function of one variable it can be

play10:18

conveniently visualized as a graph in a

play10:20

two-dimensional plane which captures the

play10:23

relationship between the input and the

play10:24

output for example it may have this

play10:27

shape right here and our goal goal is to

play10:30

find this value of K1 which corresponds

play10:33

to the minimum of the loss function but

play10:35

we don't have access to the true

play10:37

underlying shape all we can do is to set

play10:40

the knob at a chosen position and kind

play10:43

of query the machine for the value of

play10:45

the loss in other words we can only

play10:48

sample individual points along the

play10:50

function we're trying to minimize and we

play10:53

are essentially blind to how the

play10:55

function behaves in between the known

play10:57

points before we sample them but suppose

play11:00

we would like to know something more

play11:02

about the function not just each value

play11:04

at each point for example whether at

play11:07

this point the function is going up or

play11:10

down this information will ultimately

play11:13

guide our adjustments because if you

play11:16

know that the function is going down as

play11:18

you increase the input turning the knob

play11:20

to the right is a safe bad since you are

play11:23

guaranteed to decrease the loss with

play11:26

this manipulation let's put this notion

play11:28

of going up or down around a point on a

play11:32

stronger mathematical ground suppose we

play11:34

have just sampled the point x KN y KN on

play11:37

this graph what we can do is increase

play11:40

the input by a small amount Delta X this

play11:44

new adjusted input will result in a new

play11:47

value of y which will differ from the

play11:49

old value by some Delta y this Delta

play11:53

depends on the magnitude of our

play11:55

adjustment for example if we take a step

play11:58

Delta X which is 10 times smaller Delta

play12:01

y will also be approximately 10 times as

play12:05

small this is why it makes sense to take

play12:08

the ratio Delta y over Delta X the

play12:12

amount of change in the output per unit

play12:15

change in the input graphically this

play12:18

ratio corresponds to a slope of a

play12:20

straight line going through the points X

play12:23

not Y and X Plus Delta X Y KN plus Delta

play12:27

Y no notice that as we take smaller and

play12:31

smaller steps this straight line will

play12:34

more and more accurately align with the

play12:36

graph in the neighborhood of the point x

play12:39

y KN let's take a limit of this ratio as

play12:43

Delta X goes to infinitely small values

play12:47

then this limiting case value which this

play12:50

ratio converges to for infinitesimally

play12:53

small Delta X's is what is called the

play12:56

derivative of a function and it is dened

play12:59

Ed by dy/ DX visually the derivative of

play13:03

a function at some point is the slope of

play13:07

the line that is tangent to the graph

play13:10

and thus corresponds to the

play13:11

instantaneous rate of change or

play13:13

steepness of that function around that

play13:16

point but different points along the

play13:18

graph might have different stiffness

play13:21

values so the derivative of the entire

play13:24

function is not a single number in fact

play13:27

the derivative Dy by DX X is itself a

play13:30

function of X that takes an arbitrary

play13:34

value of x and outputs the local

play13:37

steepness of Y ofx at that point this

play13:40

definition assigns to every function its

play13:43

derivative Alter Ego another function

play13:46

operating on the same input domain which

play13:49

carries information about the steepness

play13:52

of the original function there is a bit

play13:54

of a subtlety strictly speaking the

play13:57

derivative may not exist if the function

play13:59

doesn't have a steepness around some

play14:02

point for example if it has sharp

play14:04

corners or

play14:06

discontinuities however for the

play14:08

remainder of the video we are going to

play14:10

assume that all functions we are dealing

play14:12

with are smooth so that the derivative

play14:15

always

play14:16

exists this is a reasonable claim

play14:19

because we can control what sort of

play14:21

functions go into our models when we

play14:24

build them and people usually restrict

play14:26

everything to smooth or differential

play14:29

functions to make all the math work out

play14:31

nicely all right great now along with

play14:34

the underlying loss as a function of K1

play14:37

which is hidden from us we can also

play14:40

reason about its derivative another

play14:43

function of K1 which we also don't know

play14:46

that is equal to the steepness of the

play14:48

loss function at that

play14:50

point let's suppose that similarly to

play14:53

how we can query the loss function by

play14:55

running our machine and obtaining

play14:57

individual samples

play14:59

there is a mechanism for us to sample

play15:02

the derivative function as

play15:04

well so for every input value of K1 the

play15:08

machine will output the value of the

play15:11

loss and the local steepness of the loss

play15:14

function around that point notice that

play15:17

this derivative information is exactly

play15:19

the sort of look into the future we were

play15:22

looking for to make smarter knob

play15:24

adjustments for example let's use it to

play15:27

efficiently find the the optimal value

play15:29

of K1 what we can do is the following

play15:34

first start at some random position ask

play15:37

the machine for a value of the loss and

play15:40

the derivative of the loss function at

play15:42

that

play15:43

position take a tiny step in the

play15:46

direction opposite of the derivative if

play15:49

the derivative is negative it means that

play15:51

the function is going down and so if we

play15:54

want to arrive at the minimum we need to

play15:57

move in the direction of increas in

play15:59

value of K1 repeat this procedure until

play16:03

you reach the point where the derivative

play16:05

is zero which essentially corresponds to

play16:08

the minimum where the tangent line is

play16:10

flat essentially each adjustment in such

play16:13

a guided fashion Works kind of like a

play16:16

ball rolling down the hill along the

play16:18

graph until it reaches a

play16:22

valley although in the beginning we

play16:24

froze five out of six knobs for

play16:27

Simplicity this process is easily

play16:30

carried out to higher

play16:32

dimensions for example suppose now we

play16:34

are free to tweak two different knobs K1

play16:37

and

play16:38

K2 the loss would become a function of

play16:42

two variables which can be visualized as

play16:45

a surface but what about the derivative

play16:48

recall that by definition the derivative

play16:51

at each point tells us how the output

play16:54

changes per unit change of the input but

play16:58

now we have two different inputs should

play17:00

we nudge only K1 K2 or

play17:04

both essentially our function will have

play17:08

two different derivatives that are

play17:11

usually called partial derivatives

play17:13

because of this ambiguity which input to

play17:16

notch namely when we have two knobs the

play17:19

derivative of the loss function with

play17:21

respect to parameter K1 is written like

play17:25

this it is how much the output changes

play17:29

per unit change in K1 if you hold K2

play17:33

constant and conversely this expression

play17:36

tells you the rate of change of the

play17:38

output if you hold K1 constant and

play17:42

slightly noge K2 geometrically you can

play17:45

imagine slicing the surface with planes

play17:49

parallel to the axis intersecting at the

play17:51

point of Interest K1 K2 so that each of

play17:55

the two cross-sections is like a

play17:58

one-dimensional graph of the loss as a

play18:01

function of one variable while the other

play18:03

one is kept constant then the slope of a

play18:06

tangent line at each cross-section will

play18:09

give you a corresponding partial

play18:11

derivative of the loss at that point

play18:14

while thinking about partial derivatives

play18:16

as two separate surfaces one for each

play18:19

variable is a perfectly valid way people

play18:23

usually plug the two different values

play18:26

into a vector called a gradient Vector

play18:30

essentially this is a mapping from two

play18:32

input values to another two numbers

play18:36

where the first signifies how much the

play18:38

output changes per tiny change in the

play18:41

first input and similarly for the second

play18:45

input geometrically this Vector points

play18:48

in the direction of steepest Ascent so

play18:51

if you want to minimize a function like

play18:55

in the case for our loss we need to take

play18:58

steps in in the direction opposite to

play19:00

this

play19:01

gradient this iterative procedure of

play19:04

noding the parameters in the direction

play19:07

opposite of the gradient Vector is

play19:10

called gradient descent which you have

play19:12

probably heard of this is analogous to a

play19:15

ball rolling down the hill for the

play19:16

two-dimensional case and the partial

play19:19

derivatives essentially tell you which

play19:22

direction is downhill going Beyond two

play19:25

Dimensions is impossible to visualize

play19:28

directly but the math stays exactly the

play19:31

same for instance if we are now free to

play19:34

tweak all the six knobs the loss

play19:37

function is a hyper surface in Six

play19:40

Dimensions and the gradient Vector now

play19:43

has six numbers packed into it but it

play19:46

still points in the direction of

play19:48

steepest Ascent so if we iteratively

play19:51

take small steps in the direction

play19:54

opposite to it we are going to roll the

play19:57

ball down the hill in Six Dimensions and

play20:00

eventually reach the minimum of the loss

play20:03

function great let's back up a bit

play20:06

remember how we were looking for ways to

play20:08

add screens next to each knob that would

play20:11

give us the direction of optimal

play20:14

adjustment well it is essentially

play20:16

nothing more but the components of the

play20:19

gradient Vector if at a particular

play20:21

configuration the partial derivative of

play20:24

the loss with respect to K1 is positive

play20:27

it means that increasing K1 will lead to

play20:31

increased loss so we need to decrease

play20:34

the value of the knob by turning it to

play20:36

the left and similarly for all other

play20:40

parameters this is how the derivatives

play20:43

serve as these windows into the future

play20:46

by providing us with information about

play20:49

local behavior of the function and once

play20:52

we have a way of accessing the

play20:54

derivative we can perform gradient

play20:56

descent and efficiently find the minimum

play20:59

of the loss function thus solving the

play21:02

optimization problem however there is an

play21:05

elephant in a room so far we have

play21:08

implicitly assumed the derivative

play21:10

information is given to us or that we

play21:13

can sample the derivative at a given

play21:15

point similarly to how we sample the

play21:18

loss function Itself by running the

play21:20

calculation of the machine but how do

play21:22

you actually find the derivative as we

play21:25

will see further this is the main

play21:27

purpose of the back propagation

play21:29

algorithm essentially the way we find

play21:32

derivatives of arbitrarily complex

play21:34

functions is the following first there

play21:36

are a handful of building blocks to

play21:39

begin with simple functions derivatives

play21:41

of which are known from calculus these

play21:45

are the kind of derivative formulas you

play21:47

often memorize in college for example if

play21:50

the function is linear it's pretty clear

play21:53

that its derivative will be a constant

play21:55

equal to the slope of that line

play21:58

everywhere

play21:59

which coincides with its own tangent

play22:01

line a parabola x² becomes more steep as

play22:06

you increase X and its derivative is

play22:09

actually

play22:10

2x in fact there is a more general

play22:13

formula for the derivative of x to the^

play22:16

of n similarly derivatives of the

play22:19

exponent and logarithm can be written

play22:21

down

play22:22

explicitly but these are just individual

play22:25

examples of simple well-known functions

play22:28

in order to compute arbitrary

play22:31

derivatives we need a way to combine

play22:34

such Atomic building blocks

play22:36

together there are a few rules how to do

play22:39

it for instance the derivative of a sum

play22:43

of two functions is the sum of the

play22:45

derivatives there is also a formula for

play22:48

the derivative of a product of two

play22:50

functions this gives you a way to

play22:53

compute things like the derivative of

play22:55

3x^2 - e ^ of x but to complete the

play22:59

picture and to be able to find

play23:02

derivatives of almost everything we need

play23:05

one other rule called The Chain rule

play23:08

which Powers the entire field of machine

play23:11

learning it tells you how to compute the

play23:14

derivative of a combination of two

play23:16

functions when one of them is an input

play23:19

to another here is a way to reason about

play23:22

this suppose you take one of those

play23:25

simpler machines which receives a single

play23:28

input x that you can vary with a knob

play23:31

and spits out an output J of X now you

play23:36

take a second machine of this kind which

play23:39

performs a different function f of

play23:42

x what would happen if you connect them

play23:45

in sequence so that the output of the

play23:48

first machine is fed into the second one

play23:51

as an

play23:52

input notice that such a construction

play23:56

can be thought of as a single function

play24:58

into the second function is J of X and

play25:01

so the local rate of change of the

play25:04

second machine is thus the derivative of

play25:07

f evaluated at the point J of X now

play25:11

imagine you nudge the knob X by a tiny

play25:14

amount Delta that input nudge when it

play25:17

comes out of the first machine will be

play25:20

multiplied by the derivative of J since

play25:23

the derivative is the rate of change in

play25:25

the output per unit change of the input

play25:28

so after the first function the output

play25:30

will increase by Delta multiplied by the

play25:34

derivative of J this expression is

play25:37

essentially a tiny nudge in the input to

play25:40

the second machine whose derivative at

play25:42

that point is given by this expression

play25:46

this means that for each Delta increase

play25:49

in the input we bump the output by this

play25:53

much hence the derivative when you

play25:56

divide that by Delta will look like this

play26:00

you can think about it as a set of three

play26:03

interconnected Cog Wheels where the

play26:05

first one represents the input knob X

play26:09

and the other two wheels are functions J

play26:12

of X and F of J of X respectively when

play26:15

you Nodge the first wheel it induces a

play26:18

nudge in the middle wheel and the

play26:20

amplitude of that change is given by the

play26:23

derivative of J which in turn causes the

play26:26

Third Wheel to rotate and the amplitude

play26:29

of that resulting nudge is given by

play26:32

changing the derivatives together all

play26:35

right great now we have a

play26:37

straightforward way of obtaining a

play26:39

derivative of any arbitrarily complex

play26:42

function as long as it can be decomposed

play26:45

into building blocks simple functions

play26:48

with explicit derivative formulas such

play26:51

as summations multiplications exponents

play26:53

logarithms Etc but how can it be used to

play26:57

find the best curve using our curve

play27:00

fitter the big picture we are aiming for

play27:03

is the following for each of our

play27:05

parameter knobs we will write down its

play27:08

effect on the loss in terms of simple

play27:10

easily differentiable operations once we

play27:14

have that sequence of building blocks no

play27:17

matter how long we should be able to

play27:19

sequentially apply the chain rule to

play27:22

each of them in order to find the value

play27:24

of the derivative of the loss function

play27:27

with respect to each of the input knobs

play27:30

and perform iterative gradient descent

play27:32

to minimize the loss let's see an

play27:35

example of this first we are going to

play27:38

create a knob for each number the loss

play27:41

function can possibly depend on this

play27:44

obviously includes the parameters but

play27:47

there is also the data itself

play27:49

coordinates of points to which we are

play27:52

fit in the curve in the first place now

play27:55

during optimization the data points are

play27:57

set in Stone so changing them in order

play28:00

to obtain a lower loss would make no

play28:04

sense however for conceptional purposes

play28:07

we can think about these values as fixed

play28:10

knobs set in one position so that we

play28:13

cannot n them once we have all the

play28:17

existing numbers being fed into the

play28:19

machine we can start to break down the

play28:22

loss

play28:23

calculation Remember by definition it is

play28:26

the sum of squar vertical distances from

play28:30

each point to the curve parameterized by

play28:33

case so for instance let's take the

play28:36

first data point X1 y1 multiply the x

play28:41

coordinate by K1 add that to the squared

play28:45

value of X1 multiplied by K2 and so on

play28:49

for other KS including the constant term

play28:53

k0 this sum of weight and powers of X1

play28:57

is the value of y predicted by the

play29:00

current curve F of X1 let's call it y1

play29:04

hat next we need to take the squared

play29:07

difference between the actual value and

play29:10

the predicted value this is how much the

play29:13

first data point contributes to the

play29:16

resulting value of the loss

play29:18

function repeating the same procedure

play29:21

for all remaining data points and

play29:24

summing up the resulting squared

play29:27

distances

play29:28

gives us the overall total loss that we

play29:32

are trying to minimize the computation

play29:35

we just performed finding the value of

play29:38

the loss for a given configuration of

play29:41

parameter and data knobs is known as the

play29:44

forward

play29:45

step the entire sequence of calculations

play29:49

can be visualized as this kind of

play29:52

computational graph where each node is

play29:56

some simple operation like addition or

play30:00

multiplication forward step then

play30:02

corresponds to computations flowing from

play30:05

left to right but to perform

play30:08

optimization we also need information

play30:11

about gradients how each knob influences

play30:15

the loss now we are going to do what's

play30:17

known as the backward step and unroll

play30:21

the sequence of calculations in reverse

play30:23

order to find derivatives what makes the

play30:27

backward possible is the fact that every

play30:30

note in our compute graph is an easily

play30:33

differentiable operation think of

play30:35

individual nodes as these tiny machines

play30:38

which simply add multiply or take Powers

play30:42

we know their derivatives and because

play30:45

their outputs are connected sequentially

play30:48

we can apply the chain

play30:50

rule this means that for each node we

play30:54

can find its gradient the partial

play30:57

derivative of of the output loss with

play31:00

respect to that

play31:01

node let's see how it can be done

play31:05

consider a region of the compute graph

play31:08

where two number nodes A and B are being

play31:11

fed into a machine that performs

play31:13

addition and its result a plus b is

play31:17

further processed by the system to

play31:19

compute the overall output L suppose we

play31:24

already computed the gradient of a plus

play31:27

b earlier so that we know how nding this

play31:30

sum will affect the output the question

play31:33

is what are individual gradients of A

play31:37

and

play31:38

B well intuitively if you nudge a by

play31:41

some amount a + b will be nudged by the

play31:45

same amount so the gradient or the

play31:48

partial derivative of the loss with

play31:49

respect to a is the same as the gradient

play31:53

of the sum and similarly for B this can

play31:56

be seen more form by writing down the

play31:59

chain Rule and noticing that the

play32:01

derivative of A+ B with respect to a is

play32:05

just one in other words when you

play32:07

encounter this situation in the compute

play32:10

graph then the gradient of the sum just

play32:14

simply propagates into the gradients of

play32:16

the nodes that plug into the sum machine

play32:20

another possible scenario is When A and

play32:23

B are

play32:24

multiplied just like before suppose we

play32:26

know the gradient of the their product

play32:29

because it was computed before in this

play32:31

case individual noge to a will be scaled

play32:35

by a factor of B so the product will be

play32:38

nudged B times as much which propagates

play32:42

into the output so whatever the

play32:45

derivative of the output with respect to

play32:48

the product of ab is the output

play32:51

derivative with respect to a will get

play32:54

Scaled by a factor of B and vice versa

play32:58

for the gradient of B once again it can

play33:00

be seen more formally by examining the

play33:03

chain rule in other words the

play33:06

multiplication node in the compute graph

play33:09

distributes the downstream gradient

play33:11

across incoming nodes by multiplying it

play33:15

cross Ways by their

play33:17

values similar rules can be easily

play33:19

formulated for other building block

play33:22

calculations such as raising a number to

play33:24

a power or taking the

play33:26

logarithm finally when a single node

play33:29

takes part in multiple branches of the

play33:32

compute graph gradients from the

play33:34

corresponding branches are simply added

play33:36

together indeed suppose you have the

play33:38

following structure in the graph where

play33:40

the same node a plugs into two different

play33:44

operations that contribute to the

play33:46

overall loss then if you nudge a by

play33:49

Delta the output will be simultaneously

play33:52

noodged by this derivative from the

play33:54

first branch and this derivative from

play33:57

the second second Branch so the overall

play34:00

effect of ning a will be the sum of the

play34:02

two gradients all right great now that

play34:06

we have constructed a computational

play34:08

graph and established how to process

play34:11

individual chunks of it we can just

play34:14

sequentially apply those rules starting

play34:17

from the output and working our way

play34:20

backwards for instance the rightmost

play34:23

node in the graph is the resulting value

play34:25

of the loss function how does the

play34:27

incremental change in that node affect

play34:30

the output well it is the output so its

play34:33

gradient is by definition equal to one

play34:36

next the loss function is the sum of

play34:39

many Delta y's squared we know what to

play34:41

do with the summation node it just

play34:44

copies whatever the gradient value is to

play34:46

the right of it into all incoming nodes

play34:49

consequently the gradients of all Delta

play34:52

y squared will also be equal to one each

play34:56

of those nodes is this squared value of

play34:58

the corresponding Delta Y and we know

play35:01

how to differentiate this squaring

play35:03

operation the derivative of the loss

play35:05

function with respect to Delta y1 will

play35:08

be 2 * the Delta y1 which is just the

play35:11

number we found during the forward

play35:13

calculation and we can keep doing this

play35:15

propagation of sequential derivative

play35:18

calculation backwards along our compute

play35:21

graph until we reach the leftmost nodes

play35:24

which are the data and parameter knobs

play35:27

the Der derivatives of the loss with

play35:29

respect to the input data don't really

play35:31

matter but the derivatives with respect

play35:33

to the parameters is exactly what we

play35:36

want once these parameter gradients are

play35:39

found we can perform one iteration of

play35:42

gradient descent namely we're going to

play35:44

slightly tweak the knobs in the

play35:46

directions opposite to the gradient the

play35:49

exact magnitude of each adjustment being

play35:52

the negative product of the gradient and

play35:55

some small number called The Learning

play35:57

rate for example

play35:59

0.01 note that after the adjustment is

play36:02

performed the configuration of the

play36:04

machine and the resulting loss are

play36:07

different and so the old gradient values

play36:10

we found no longer hold so we need to

play36:14

run the forward and backward

play36:16

calculations once again to obtain

play36:19

updated gradients and the new decreased

play36:22

loss performing this Loop of forward

play36:25

pass backward pass nudge repeat is the

play36:30

essence of training every modern machine

play36:32

Learning System and exactly the same

play36:35

algorithm is used today in even the most

play36:38

complicated models as long as the

play36:40

problem you're trying to solve with a

play36:42

given model architecture can be

play36:45

decomposed into individual operations

play36:48

that are differentiable you can

play36:50

sequentially apply the chain rule many

play36:52

times to arrive at the optimal setting

play36:55

of The parameters for instance a feed

play36:57

forward neural network is essentially a

play37:00

bunch of multiplications and summations

play37:02

with a few nonlinear activation

play37:04

functions sprinkled between the layers

play37:07

each of those atomic computations is

play37:10

differentiable so you can construct the

play37:12

compute graph and run the backward path

play37:15

on it to find how each parameter like

play37:19

connection weights between neurons

play37:21

influence the loss function and because

play37:23

neural networks given enough neurons can

play37:26

in theory approximate any function

play37:29

imaginable we can create a large enough

play37:31

sequence of these building block

play37:34

mathematical machines to solve problems

play37:36

such as classifying images and even

play37:39

generating new text this seems like a

play37:41

very elegant and efficient solution

play37:44

after all if you want to solve the

play37:46

optimization problem derivatives tell

play37:49

you exactly which adjustments are

play37:51

necessary but how similar is this to

play37:54

what the brain actually does when we

play37:56

learn to walk speak and read is the

play37:59

brain also minimizing some sort of loss

play38:02

function does it calculate derivatives

play38:05

or could it be doing something totally

play38:07

different in the next video we are going

play38:10

to dive into the world of synaptic

play38:12

plasticity and talk about how biological

play38:14

neural networks learn in keeping with

play38:17

the topic of biological learning I'd

play38:20

like to take a moment to give a shout

play38:21

out to shortform a longtime partner of

play38:24

this channel short form is a platform

play38:27

which

play39:28

stay tuned for more interesting topics

play39:30

coming up goodbye and thank you for the

play39:32

interest in the

play39:52

[Music]

play39:56

brain

play40:06

for

Rate This

5.0 / 5 (0 votes)

相关标签
机器学习反向传播神经网络生物大脑学习机制算法差异多层感知器梯度下降计算图优化问题