Heroes of Deep Learning: Andrew Ng interviews Geoffrey Hinton
Summary
TLDR在这段深入的访谈中,深度学习领域的先驱Jeff Hinton分享了他个人的故事以及对人工智能和机器学习的贡献。Hinton从高中时期对大脑如何存储记忆的好奇开始,经历了学习生理学、物理学、哲学和心理学的转变,最终在爱丁堡大学投身于人工智能研究。他坚持自己对神经网络的信念,即使在遭遇反对和职业挑战时也未曾放弃。Hinton详细讨论了他与David Rumelhart和Ron Williams共同开发反向传播算法的历史,以及他们如何克服困难在《自然》杂志上发表有关词向量和语义特征学习的论文。此外,Hinton还探讨了他对深度信念网络、变分方法和玻尔兹曼机的研究,以及他对深度学习未来的看法,包括对胶囊网络的当前研究,这是一种新的神经网络结构,旨在提高模型的泛化能力。他还提供了对有志于进入深度学习领域的人们的建议,强调了直觉的重要性和持续编程的必要性。
Takeaways
- 🧠 对深度学习的贡献:杰夫·辛顿(Geoff Hinton)被誉为深度学习之父,对深度学习领域做出了巨大贡献。
- 📚 个人故事:辛顿的学术兴趣始于高中时期,对大脑如何存储记忆的好奇心引导他进入了人工智能和机器学习领域。
- 🔄 学术转变:辛顿在剑桥大学学习生理学和物理学,后转向哲学,最终选择了心理学,但发现心理学理论无法充分解释大脑的工作方式,于是转向了人工智能。
- 🤝 合作与冲突:在爱丁堡大学与他人就神经网络和符号AI的研究方向有过争论,但他坚持自己对神经网络的信念。
- 📈 神经网络的复兴:在20世纪80年代,辛顿与David Rumelhart和Ron Williams共同开发了反向传播算法,尽管这一算法之前已有人发明,但他们的工作推动了社区对这一算法的接受。
- 🏆 重要成就:辛顿特别自豪的成就是与Tero Aila和Yoshua Bengio在玻尔兹曼机上的工作,以及受限玻尔兹曼机和深度置信网络的开发。
- 🔧 技术创新:辛顿对ReLU(Rectified Linear Unit)的工作表明,它几乎等同于一系列逻辑单元的堆叠,这一发现对ReLU的普及起到了推动作用。
- 🧐 大脑与学习算法:辛顿认为,如果反向传播是一个优秀的学习算法,那么进化过程中很可能已经实现了它的某种形式,尽管可能不是完全相同。
- 📉 多时间尺度处理:辛顿讨论了他在深度学习中处理多时间尺度问题的想法,包括他在1973年提出的“快速权重”概念。
- ⚙️ 胶囊网络:辛顿正在推动胶囊网络的概念,这是一种新的深度学习网络结构,旨在更好地处理多维实体和提高模型的泛化能力。
- 📚 研究建议:辛顿建议新研究者阅读适量的文献以形成直觉,然后信任并追随这些直觉,即使它们可能与主流观点相悖。
Q & A
杰夫·辛顿(Geoff Hinton)在高中时期是如何对人工智能和神经网络产生兴趣的?
-在高中时,杰夫·辛顿的一个同学向他介绍了全息图的概念,以及大脑可能使用全息图方式存储记忆的理论。这激发了他对大脑如何存储记忆的好奇,从而对人工智能和神经网络产生了兴趣。
辛顿在剑桥大学最初学习了哪些科目?
-辛顿在剑桥大学最初学习了生理学和物理学,他是当时唯一一个同时修这两个学科的本科生。
在辛顿的研究生涯中,他是如何从心理学转向人工智能领域的?
-辛顿最初对心理学感兴趣,但后来觉得心理学的理论过于简单,无法充分解释大脑的工作机制。之后他尝试了哲学,但发现哲学缺乏辨别真伪的方法。最终,他决定转向人工智能领域,并前往爱丁堡大学学习。
辛顿在加州的研究环境与英国有何不同?
-在英国,神经网络被视为过时的东西,而在加州,人们如唐纳德·诺曼(Don Norman)和大卫·鲁梅尔哈特(David Rumelhart)对神经网络的想法非常开放,这使得辛顿能够更自由地探索和研究。
辛顿和大卫·鲁梅尔哈特是如何发展出反向传播算法的?
-辛顿、鲁梅尔哈特和罗恩·威廉姆斯(Ron Williams)共同发展了反向传播算法,尽管后来发现其他研究者也独立发明了这一算法,但辛顿他们的工作帮助社区广泛接受了这一算法。
反向传播算法为何在1986年的论文中得到了广泛接受?
-辛顿和同事们在1986年的论文中展示了反向传播算法能够学习词汇的表示,并且通过这些表示可以理解单词的语义特征。这篇论文被《自然》杂志接受,标志着反向传播算法被广泛接受的转折点。
辛顿认为他在神经网络和深度学习领域中哪项工作最美丽?
-辛顿认为他与特里·西诺西(Teresi Hinton)在玻尔兹曼机上的工作最美丽。他们发现了一个简单的学习算法,适用于大型密集连接网络,并且每个突触只需要知道与之直接相连的两个神经元的行为。
受限玻尔兹曼机(Restricted Boltzmann Machines, RBMs)在实际应用中有哪些成功案例?
-受限玻尔兹曼机在Netflix的比赛中被用作获胜算法的一部分。此外,从2007年开始,受限玻尔兹曼机和深度受限玻尔兹曼机的工作对神经网络和深度学习的复兴起到了重要作用。
辛顿与布拉德福德·尼尔(Bradford Neal)在变分方法上做了哪些工作?
-辛顿和尼尔在变分方法上的工作表明,不需要进行完美的期望最大化(EM)算法,可以通过进行近似的EM来大幅提高算法的效果。他们还在1993年发表了第一篇变分贝叶斯方法的论文,展示了如何通过高斯分布来近似真实的后验,并在神经网络中实现这一过程。
辛顿对ReLU(Rectified Linear Unit)激活函数的看法是什么?
-辛顿认为ReLU激活函数与一系列逻辑单元(logistic units)几乎等效,这有助于ReLU的普及。他还提到了在Google的一次演讲中,他展示了如何使用ReLU和单位矩阵初始化来训练具有300个隐藏层的网络,并且能够非常高效地进行训练。
辛顿如何看待胶囊(capsules)的概念,以及它在深度学习中的作用?
-辛顿认为胶囊是一种新的表示方法,它能够表示具有多个属性的特征实例。胶囊通过“通过协议的路由”(routing by agreement)来实现特征的组合,这可能对提高神经网络的泛化能力、处理视角变化和图像分割等方面非常有帮助。
Outlines
😀 深度学习的起源与个人历程
在这段访谈中,Jeff 被赞誉为深度学习的教父,他分享了自己是如何对人工智能、机器学习和神经网络产生兴趣的。他回忆起在高中时,一位同学向他介绍了全息图的概念,这激发了他对大脑如何存储记忆的好奇。在大学期间,他先后学习了生理学、物理学、哲学和心理学,但发现这些领域无法充分解释大脑的工作机制。之后,他转行成为木匠,最终决定投身于人工智能领域,并在爱丁堡大学师从神经网络领域的先驱。Jeff 坚持自己对神经网络的信念,最终获得了人工智能博士学位。他的工作在加州得到了认可,特别是与 David Rumelhart 的合作,他们共同发展了著名的反向传播算法。
🤖 反向传播算法与词嵌入
Jeff 讨论了反向传播算法的重要性,这是一种利用链式法则来获取导数的算法。尽管这个算法并非他首创,但他和同事们的工作帮助社区广泛接受了这一算法。他们通过训练一个模型来学习单词的表示,并通过这些表示来理解单词的语义特征。这项工作不仅展示了如何从图结构或树结构的信息中提取特征,还展示了如何使用这些特征来推导出新的一致信息。此外,Jeff 还提到了计算机硬件的发展,如 GPU 和超级计算机,对深度学习发展的重要推动作用。
🧠 神经网络的创新与 Boltzmann 机
Jeff 谈到了他在神经网络领域的多项创新,包括与 Terry Sejnowski 合作开发的 Boltzmann 机。他们发现了一个简单的学习算法,适用于大型、密集连接的网络,这个算法在大脑中可能也有所体现。他还提到了受限 Boltzmann 机(RBM)和深度置信网络(DBN),这些工作对神经网络和深度学习的复兴起到了重要作用。此外,他还讨论了变分方法和变分贝叶斯方法,这些方法在统计学中有着广泛的应用。
🔄 深度学习中的快速权重与递归
Jeff 提到了他在深度学习中关于快速权重和递归的思想。快速权重可以快速适应并迅速衰减,从而能够保持短期记忆。他展示了如何使用这些权重进行真正的递归,即在递归调用中重用表示事物的神经元和权重。他还讨论了如何在递归调用结束时,通过快速权重恢复神经元的活动状态。这项工作从他在研究生第一年的一个想法开始,经过了大约40年的发展。
💡 胶囊网络与多维实体表示
Jeff 介绍了他关于胶囊网络的思考,这是一种新的深度学习架构,用于更好地表示多维实体。在胶囊网络中,每个胶囊代表一个具有多个属性的特征实例。这种方法允许网络通过“通过协议进行分组”的方式,对不同特征的部分进行分组和关联,从而提高网络的泛化能力。他相信胶囊网络能够改善从有限数据中学习的能力,更好地处理视角变化,并提高统计效率。
📚 深度学习的学习建议与未来方向
Jeff 分享了他对想要进入深度学习领域的人的建议。他建议阅读适量的文献以培养直觉,然后相信自己的直觉。他还推荐复制已发表的论文来深入理解,因为这样可以发现使论文结果工作的所有小技巧。此外,他鼓励持续编程,因为这是理解问题和解决问题的关键。他提到,尽管目前学术界在深度学习方面的培训能力有限,但这种情况将会改变,而且大公司在培训方面发挥了重要作用。他还讨论了他对学术研究与加入顶尖公司之间的选择的看法。
🌐 计算机使用方式的变革与 AI 的未来
在访谈的最后,Jeff 讨论了计算机使用方式的根本变化,即从编程转向展示。他认为这是一种全新的与计算机交互的方式,而且计算机科学部门需要适应这种变化。他预见,未来将有更多的人通过展示而非编程来使计算机执行任务。他还提到了深度学习专业化的创建,以及他通过 Coursera 提供的深度学习课程,这些工作对推广深度学习教育起到了重要作用。
Mindmap
Keywords
💡深度学习
💡神经网络
💡反向传播
💡Boltzmann机
💡深度信念网络
💡变分法
💡ReLU激活函数
💡dropout
💡胶囊网络
💡梯度下降
💡生成对抗网络
Highlights
杰夫·辛顿(Geoff Hinton)被誉为深度学习之父,对深度学习领域做出了巨大贡献。
辛顿在高中时期就对大脑如何存储记忆产生了兴趣,这启发了他后续在人工智能领域的探索。
在剑桥大学,辛顿最初学习生理学和物理学,后转向心理学,最终选择了人工智能。
辛顿在爱丁堡大学学习人工智能期间,坚持研究神经网络,尽管当时并不被主流所接受。
1982年,辛顿与大卫·鲁梅尔哈特(David Rumelhart)合作,提出了反向传播算法。
反向传播算法的推广得益于辛顿在1986年发表的论文,该论文展示了算法学习词向量的能力。
辛顿认为,如果反向传播算法非常有效,那么进化过程中大脑很可能已经实现了类似的机制。
辛顿与特里·西诺西(Terry Sejnowski)合作,提出了玻尔兹曼机,这是一种简单且有效的学习算法。
受限玻尔兹曼机(RBM)在实际应用中非常有效,如在Netflix比赛中获胜的模型中就有使用。
辛顿提出了深度信念网络(Deep Belief Networks),这是一种结合了神经网络和概率图模型的方法。
辛顿与布拉德福德·尼尔(Bradford Neal)合作,改进了变分EM算法,使之在神经网络中更有效。
辛顿对ReLU(Rectified Linear Unit)激活函数的数学性质进行了研究,推动了其在深度学习中的广泛应用。
辛顿在Google工作期间,提出了使用ReLU和恒等矩阵初始化来训练深层网络的方法。
胶囊网络(Capsule Networks)是辛顿近期的研究重点,旨在提高神经网络的泛化能力。
胶囊网络通过“通过协议路由”(routing by agreement)机制,能更好地处理视角变化和图像分割任务。
辛顿认为,尽管当前监督学习非常成功,但无监督学习在未来将更加重要。
辛顿建议新研究者阅读少量文献以形成直觉,然后信任并追随自己的直觉。
辛顿鼓励通过编程实践来深入理解深度学习算法,并不断尝试和验证自己的想法。
辛顿认为,深度学习正在引领一场类似于第二次工业革命的技术变革。
辛顿强调,与传统的符号主义AI不同,现代AI认为思维可以是神经活动的向量表示,而非符号表达。
Transcripts
welcome Jeff and thank you for doing
this interview with deep
learning.ai thank you for inviting me um
I think that at this point you more than
anyone else on this planet has invented
so many of the ideas behind deep
learning and uh a lot of people have
been calling you The Godfather of deep
learning although it wasn't until we're
just chatting a few minutes ago that I
realize you think I'm the first one to
call you that uh which which I'm quite
happy to have done but what I want to
ask is many people know you as a legend
I want to ask about your personal story
behind a legend so how did you get
involved in going way back how did you
get involved in Ai and machine learning
and neon
networks so when I was at high school um
I had a classmate who was always better
than me at everything um he was a
brilliant mathematician and he came into
school one day and said did you know the
brain uses hologram
and um I guess that was about
1966 and I said sort of What's a
hologram and he explained that in a
hologram you can chop off half of it and
you still get the whole picture and that
memories in the brain might be
distributed over the whole brain and so
I guess he'd read about lashley's
experiments where you chop out bits of a
rat's brain and discover it's very hard
to find one bit where it stores one
particular memory
um so that's what first got me
interested in how does the brain store
memories and then when I went to
University I started off studying
physiology and physics I think when I
was at Cambridge I was the only
undergraduate doing physiology and
physics um and then I gave up on that
and tried to do philosophy um because I
thought that might give me more insight
but that seemed to me actually lacking
in ways of distinguishing when they said
something false
and so then I switched to
psychology um and and in Psychology they
had very very simple theories and it
seemed to me it was sort of hopelessly
inadequate for explaining what the brain
was doing so then I took some time off
and became a carpenter um and then I
decided I'd try Ai and I went off to
Edinburgh to study AI with longit
Higgins and he had done very nice work
on neural networks and he' just given up
on neural networks um and been very
impressed by winterr thesis so when I
arrived he thought I was kind of doing
this oldfashioned stuff and I ought to
start on symbolic Ai and we had a lot of
fights about that but I just kept on
doing what I believed in oh and then
what um I eventually got a PhD in Ai and
then I couldn't get a job in Britain um
but I saw this very nice advertisement
for um Sloan fellowships in
California and I managed to get one of
those and I went to California and
everything was different there um so in
Britain neural Nets was regarded as kind
of
silly and in California Don Norman and
David
rumelhart um were very open
to uh ideas about neural Nets it was the
first time I'd been somewhere where
thinking about how the brain works and
thinking about how that my relate to
psychology was seen as a very positive
thing and it was a lot of fun
in particular collaborating with David
rart was great I see right so this was
when you were at UCSD and you and rart
around what 1982 wound up writing you
know the seminal backrop paper right so
so actually it was more complicated than
that um what Happ so in I think around
early
1982 um David DRL har and me um and Ron
Williams um but between us developed the
backdrop algorithm it was mainly David
ramh heart's idea um we discovered later
that many other people have invented it
um David Parker had invented it um
probably after us but before we
published um Paul wbos had published it
already quite a few years earlier but
nobody paid it much attention and there
were other people who developed very
similar algorithms it's not clear what's
meant by backprop but using the chain
rule to get derivatives was not a novel
idea why do you think it was your paper
that helped so much the community lash
on to back Prof it feels like your paper
marked an inflection in the acceptance
of this algorithm whoever accepted it so
we managed to get a paper into nature in
1986 and I did quite a lot of political
work to get the paper accepted I figured
out that one of the referees was
probably going to be Stuart southernland
who was a well-known psychologist in
Britain
and I went and talked to him for a long
time and explained to him exactly what
was going on and he was very impressed
by the fact that we showed that backprop
could learn representations for words
and you could look at those
representations which were little
vectors and you could understand the
meaning of the individual features so we
actually trained it on little triples of
words about family trees like um Mary
has mother Victoria and you'd give it
the first two words and it would have to
predict the last word and after you
trained it you could see all sorts of
features in the representations of the
individual words like the nationality of
the person and their what generation
they were which branch of the family
tree they were in and so on um that was
what made Stuart southernland really
impressed with it and I think that was
why the paper got accepted very early
word embeddings and you're already
seeing features learned features of
semantic meanings emerg from their
training algorithm yes so from a
psychologist point of view what was
interesting was it unified two
completely different strands of ideas
about what knowledge was like so there
was the old psychologist view that a
concept is just a big bundle of features
and there's lots of evidence for that
and then there was the AI view of the
time which is a far more structuralist
view which was that a concept is how it
relates to other Concepts and to capture
concept you'd have to do something like
a graph structure or maybe a semantic
net and what this back propagation
example showed was you could give it the
information that would go to a graph
structure or in this case a family tree
and it could convert that information
into features in such a way that it
could then use the features to derive
new consistent information I generalize
but the crucial thing was this to and
fro between the graphical representation
or the the tree structured
representation of the family tree and a
representation of the people as big
feature vectors and the fact that you
could from the graph likee
representation you could get to the
feature vectors and from the feature
vectors you could get more of the graph
likee representation so this is
1986 in the early '90s Benjo showed that
you could actually take real data you
could take English text and apply the
same technique there and get embeddings
for real words from English text and
that impressed people a lot I guess
recently we've been talking a lot about
how fast computers like gpus um and
supercomputers is driving deep learning
I didn't realize that back in between
1986 to the early 90s it sounds like
between you and djo there was already
the beginnings of this trend yes there
was a huge advance I mean in in 1986 I
was using a list machine which was less
than a tenth of a um megga flop and by
about
1993 or thereabouts people were seeing
like 10 megga flops so it was a factor
of 100 and that's the point at which it
was easy to use because computers were
just getting faster over the past
several decades you've invented so many
pieces of neuron networks and deep
learning um I'm actually curious of all
of the things you've invented which are
the ones you're still most excited about
today
so I think the most beautiful one is the
work I did with teres inosi on boltimore
machines wow so we discovered there was
this really really simple learning
algorithm that applied to great big
densely connected Nets where you could
only see a few of the nodes so it would
learn hidden representations and it was
a very simple algorithm and it looked
like the kind of thing you should be
able to get in a brain because each
synapse only needed to know about the
behavior of the two neurons it was
directly connected to
and the information that was
propagated was the same there were two
different phases which we called wake
and sleep but in the two different
phases you're propagating information in
just the same way whereas in something
like back propagation there's a forward
pass and a backward pass and they work
differently they're sending different
kinds of
signals right so I think that's the most
beautiful thing and for many years it
looked like just like a curiosity
because it looked like it was much too
slow but then later on I got rid of a
little bit of the beauty and instead of
letting things settle down just use one
iteration in a in a somewhat simpler net
and that gave restricted bols machines
which actually worked effectively in
practice so in the Netflix competition
for example um restricted bolts machines
were one of the ingredients of the
winning entry in fact a lot of the um
recent Resurgence of neuron netor deep
learning starting about I guess 2007 was
the uh restricted baser machine and deep
restricted B Machine work that you and
your lab did yes so that's another of
the pieces of work I'm very happy with
the idea of that you could train a
restricted bolts machine which just had
one layer of hidden features and you
could learn one layer of features and
then you could treat those features as
data and do it again and then you could
treat the new features you'd learned as
data and do it again as many times as
you liked um so that was nice it worked
in practice and then YY realized that
the whole thing could be treated as a
single model but it was a weird kind of
model it was a model where at the top
you had a restricted bolts machine but
below that you had a sigmoid belief net
which was something that Radford Neil
had invented many years earlier so it
was a directed model and what we'
managed to come up with by training
these restricted BS machines was an
efficient way of doing inference and
sigmo belief Nets so around that time
time there were people doing neural Nets
who would use densely connected Nets but
didn't have any good ways of doing
probabilistic inference in them and you
had people doing graphical models um
like Mike Jordan um who could do
inference properly but only in sparsely
connected
Nets and what we managed to show was
there's a way of learning these deep
belief Nets so that there's an
approximate form of inference that's
very fast it just happens in a single
forward pass and that was a very
beautiful result and you could guarantee
that each time you learned an extra
layer of
features there was a bound each time you
learned a new layer you got a new bound
and the new bound was always better than
the old bound y the variational bound
showing that as you add layers the the
the yes yeah I remember that so that was
the second thing that I was really
excited by and I guess the third thing
was the work I did with Bradford Neil on
variational methods um um it turns out
people in statistics had done similar
work earlier but we didn't know about
that
um so we managed to make em work a whole
lot better by showing you didn't need to
do a perfect Eep you could do an
approximate Eep and EM was a big
algorithm in statistics and we showed a
big generalization of it and in
particular in 1993 I guess um with Van
Camp I did a paper that was I think the
first variational Baye paper where we
showed that you could actually
um do a version of basian learning that
was far more tractable by approximating
the true posterior with a gausin oh and
you could do that in neural net and I
was very excited by that I see wow right
yep I think I remember all of these
papers uh the the new and Hinton
approximate em paper right spend many
hours reading over that um and I think
you know some of the albs you use today
or some of the albums that lots of
people use almost every day are what
things like dropouts or um I guess value
activations so came from your
group um yes and no so other people had
thought about rectified linear units and
um we actually did some work with
restricted bolts machine showing that a
relu was almost exactly equivalent to a
whole stack of logistic units and that's
one of the things that helped reu catch
on I was really curious about that the
ru paper had a lot of math showing that
this function can be approximated to
this really complicated formula did you
do that math so your paper would get
Acceptance in academic conference or did
all that math really influence the
development of Max of zero and
X that was one of the cases where
actually the math was important to the
development of the idea so I knew about
rectified linear units obviously and I
knew about logistic units and because of
the work on boltzman machines all of the
basic work was done using logistic units
and so the question was could the
learning algorithm work in something
with rectified linear units and by
showing the rectified L linear units
were almost exactly equivalent to a
stack of logistic units um we showed
that all the math would go
through I see and it provided
inspiration but today tons of people use
r and it just works without yeah without
the same without necessarily needing to
understand the same
motivation yeah one thing I noticed
later when I went to Google um I guess
in 2014 I gave a talk at Google about um
using relus and in initializing with the
identity Matrix because the nice thing
about relus is if you keep replicating
the hidden layers and you initialize
with the identity um it just copies the
pattern in the layer below
and so I was showing that you could
train networks with 300 Hidden layers
and you could train them really
efficiently um if you initialize with
the identity but I didn't pursue that
any further and I really regret not
pursuing that we published one paper
with quar Lee showing you could
initialize Rec and na jely showing you
could initialize recurrent Nets like
that but I should have pursued it
further because later on um these
residual networks were really um that
kind of thing over the years I've heard
you you talk a lot about the brain I've
heard you talk about the relationship
between back Prof and the Brain what are
your current thoughts on
that um I'm actually working on a paper
on that right now um I guess my main
thought is this if it turns out that
backprop is a really good algorithm for
doing learning then for sure Evolution
could have figured out how to implement
it I mean you have cells that can turn
into eyeballs or teeth now if cells can
do that um they can for sure Implement
back propagation and presumably there's
huge selective pressure for it so I
think the neuroscientist's idea that it
doesn't look plausible is just silly
there may be some subtle implementation
of it and I think the brain probably has
something that may not be exactly back
propagation but is quite close to it and
over the years I come up with a number
of ideas about how this might work so in
1987 working with J
mcland um I came up with the
recirculation algorithm where the idea
is um you send information round a
loop and you try to make it so that
things don't change as information goes
around this Loop so the simplest version
would be you have um input units and
hidden units and you send information
from the input to the hidden and then
back to the input and then back to the
hidden and then back to the input and so
on and what you want you want to train
an autoencoder but you want to train it
without having to do back propagation so
you just train it to try and get rid of
all variation in the activities so the
the idea is that the learning rule for a
synapse
is change the weight in proportion to
the Press synaptic input and in
proportion to the rate of change of the
post synaptic input but in recirculation
you're trying to make the post synaptic
input you're trying to make the old one
be good and the new one be bad so you're
changing it in that direction and we
invented this algorithm before
neuroscientists come up with Spike time
dependent plasticity Spike time
dependent plasticity is actually the
same algorithm but the other way around
where the new thing is good and the old
thing is bad in the learning rule so
you're changing the weight in proportion
to the prec activity times the new
postoptic activity minus the old one
um later on I realized in
2007 that if you took a stack of bolts
restricted bolts machines and you
trained it
up um after it was trained you then had
exactly the right conditions for
implementing back propagation by just
trying to reconstruct if you looked at
the Reconstruction error that
reconstruction error would actually tell
you the derivative of the discriminative
performance and I at the first deep
learning workshop at nips in 2007 I gave
a talk about that um that was almost
completely ignored um later on Yoshua
benio um took up the idea and that's
actually done quite a lot more work on
that and I've been doing more work on it
myself and I think this idea that if you
have a stack of
autoencoders then you can get
derivatives by sending activity
backwards and looking at reconstruction
errors is a really interesting idea may
well be how the brain does it um one
other topic that I know you thought a
lot about and that uh I hear you're
still working on is how to deal with
multiple time scales in deep learning so
can can you share your thoughts on that
yes so actually that goes back to my
first year as a graduate student the
first talk I ever gave was about using
um what I called Fast weights so weights
that adapt rapidly but Decay rapidly and
therefore can hold short-term memory and
I I showed in a very simple system in
1973 that you could do true recursion
with those weights and what I mean by
true recursion is that
the the neurons that are used for
representing things get reused for
representing things in the recursive
call and the weights that are used for
representing knowledge get reused in the
recursive call and so that leaves the
question of when you pop out of a
recursive call how do you remember what
it was you in the middle of doing
where's that memory because you use the
neurons for the recursive call and the
answer is you can put that memory into
fast weights and you can recover the
activity states of the neurons from
those fast weights and more recently
working with Jimmy bar we actually got a
paper in nips about using fast weights
for recursion like that see um so that
was quite a big gap I the first model
was unpublished in
1973 and then Jimmy Bar's model was in
2015 I think or 2016 so it's about 40
years later and I guess one other idea I
appr you talk about for quite a few
years now over five years I think is
capsules where where are you with
that okay so um I'm back to the state
I'm used to being in which is I have
this idea I really believe in and nobody
else believes it and I submit papers
about it and they all get rejected um
but I really believe in this idea and
I'm just going to keep pushing it so it
hinges
on
um there's a couple of key ideas one is
about how you represent
multi-dimensional
entities
and you can represent multi-dimensional
entities by just a little Vector of
activities as long as you know there's
only one of them so the idea is in each
region of the image you'll assume
there's at most one of a particular kind
of
feature and then you'll use a bunch of
neurons and their activities will
represent the different aspects of that
feature like within that region exactly
what are it X and Y coordinates what
orientation is it at how fast is it
moving what color is it how bright is it
and stuff like that so you can use a
whole bunch of neurons to to represent
different dimensions of the same thing
provided there's only one of them
um that's a very different way of doing
representation from what we're normally
used to in neuron Nets normally in
neural Nets we just have a great big
laay and all the units go off and do
whatever they do but you don't think of
bundling them up into little groups that
represent different coordinates of the
same thing so I think there's I think
there should be this extra structure and
then the other the other idea that goes
with that so so this means in the
distributed representation you partition
the representation to have different
subsets to to represent right rather I
call each of those subsets a capsule and
the idea is a capsule is able to
represent an instance of a feature but
only one um and it represents all the
different properties of that feature so
it's a it's a feature that has lots of
properties as opposed to a normal neuron
in a normal neural net which is just has
one scal of property sure I see yep
right and then what you can do if you've
got that is you can do something that
normal neuronet are very bad at which is
you can do um what I call rooting by
agreement so let's suppose you want to
do
segmentation and you have something that
might be a mouth and something else that
might be a
nose and you want to know if you should
put them together to make one one thing
so the idea is you'd have a capsule for
a mouth that has the parameters of the
mouth and you have a Capal for a nose
that has the parameters of the nose and
then to decide whether to put them
together or not you get each of them to
vote for what the parameters should be
for a face see now if the mouth and the
nose are in the right spatial
relationship they will
agree so when you get two capsules at
one level voting for the same set of
parameters at the next level up you can
assume they're probably right because
agreement in a high dimensional space is
very
unlikely and that's a very different way
of doing filtering than what we normally
use in neural
Nets
so I think this rooting by agreement is
going to be crucial for getting neural
Nets to generalize much better from
limited data I think it' be very good at
dealing with changes in Viewpoint very
good at doing segmentation and I'm
hoping it'll be much more statistically
efficient than what we currently do in
your nets which is if you want to deal
with changes in Viewpoint you just give
it a whole bunch of changes in Viewpoint
and and train it on them all I see right
right so rather than feif for only
supervised learning you could learn this
in some different way well I've still
plan to do it with supervised learning
but the mechanics of the forward pass
are very different it's not a pure
forward pass in the sense that there's
little little bits of iteration going on
where you you think you find a MTH and
you think you found a nose and you do a
little bit of iteration to decide
whether they should really go together
to make a face I see and you can do back
props through all that iteration I see
so you can train it all discriminatively
I see
and um we're working on that now at my
group in Toronto so I now have a little
Google team in Toronto part of the brain
team see yep I see and that's what I'm
excited about right now oh I see great
yeah look forward to that paper when
that comes out yeah if it comes
out you know you work in deep learning
for several decades I'm actually really
curious how has your thinking your
understanding of AI you know changed
over these
years so I guess um a lot of my
intellectual history has been around
back propagation and how to use back
propagation how to make use of its power
um so to begin with in the mid 80s we
were using it for discriminative
learning it was working well I then
decided by the early '90s that actually
most human learning was going to be
unsupervised learning and I got much
more interested in unsupervised learning
and that's when I worked on things like
the Wake sleep algorithm um and your
comments at that time really influenced
my thinking as well so when I was
leading Google bra our first project
spent a lot of work in unsupervised
learning because of
inluence right um and I may have misled
you that is in the long run I think UNS
supervised learning is going to be
absolutely crucial yeah but you have to
sort of face
reality um and what's worked over the
last 10 years or so is supervised
learning discriminative training where
you have labels or you're trying to
predict the next thing in a series so
that access the label and that's worked
incredibly well
and I still believe that unsupervised
learning is going to be crucial and
things will work incredibly much better
than they do now when we get that
working properly but we haven't yet yeah
y I think Mo many of the senior people
in deep learning including myself remain
very excited about it it's just none of
us really have almost any idea how to do
it yet maybe you do I don't feel like I
do um
variational Auto encoders where you use
the reparameterization trick seemed to
me a really nice idea and generative
adversarial Nets also seem to me to be a
really nice idea I think generative
adversarial Nets are one of the sort of
biggest ideas in deep learning that's
really new I see yeah um I'm hoping I
can make capsules that successful but
right now genitor adversarial Nets I
think have been a big breakthrough what
happened to sparity and Slow Fe feates
which were two of the other principles
for building on supervised
models
um I was never as big on sparity as you
were see okay um
but slow features I think is a mistake
um you shouldn't say slow the basic idea
is right but you shouldn't go for
features that don't change you should go
for features that change in predictable
ways I see so here's the sort of basic
princi about how you model anything
um you take your
measurements and you apply nonlinear
transformations to your measurements
until you get to a representation as a
state Vector in which the action is
linear so you don't just pretend it's
linear like you do with common filters
but you actually find a transformation
from the observables to the underlying
variables where linear operations like
Matrix multiplies on the underlying
variables will do the work so for
example if you want to change viewpoints
if you want to produce an image from
another Viewpoint what you should do is
go from the pixels to coordinates and
once you got to the coordinate
representation which is the kind of
thing I'm hoping capsules will find um
you can then do a matrix multiply to
change Viewpoint and then you can map it
back to pixels right that's why you did
all that that's a very very general
principle that's why you did all that
work on face syn right we take a face
and compress it a very low dimensional
vector and show you can FID that and get
back other
faces um I had a student who worked on
that I didn't do much work on that
myself but I see I'm sure you still get
out all the time if someone wants to
break into deep learning um what should
they do so what advice would you have
I'm sure you given a lot of advice to
people in one-on-one settings but you
know for the global audience of people
watching this video what advice would
you have for them to get into deep
okay so my advice is sort of read the
literature but don't read too much of it
um so this is advice I got from my
advisor um which is very unlike what
most people say most people say you
should spend several years reading the
literature and then you should start
working on your own ideas um and that
may be true for some researchers but for
Creative researchers I think what you
want to do is read a little bit of the
literature and notice something that you
think everybody is doing wrong and
contrarian in that sense you look at it
and it just doesn't feel
right and then figure out how to do it
right and then when people tell you
that's no good just keep at it um and I
have a very good principle for helping
people keep at it which is either your
intuitions are good or they're not if
your intuitions are good you should
follow them and you'll eventually be
successful if if your intuitions are not
good it doesn't matter what you
do right inspiring advice so might as
well go for it you might as well trust
your intuitions there's no point not
trusting them see yeah there you know I
usually advise people to not just read
but replicate publish papers and maybe
that puts a natural limiter on how many
you could do because replicating results
is pretty timec consuming yeah yes it's
true that when you try and replicate a
public paper you discover all the little
tricks necessary to make it to work I
see the other the other advice I have is
never stop programming I see because if
you give a student something to do if
they're a bad student they'll come back
and say it didn't work and the reason it
didn't work it'll be some little
decision they made um that they didn't
realize was crucial and if you give it
to a good student like YY for example
you can give him anything and he'll come
back and he'll say it
works I remember doing this once and I
said but wait a minute y um since we
last talk I realized it couldn't
possibly work for the following reason
and you said oh yeah well I realized
that right away so I assumed you didn't
mean
that yeah that's great yeah um let's see
uh any any other advice for people that
want to break into Ai and deep
learning I think that's basically read
enough so you start developing
intuitions and then trust your
intuitions that's see cool yeah and go
go for it see cool and don't be too
worried if everybody else says this
nonsense and I guess there's no way to
know if others are right or wrong when
they say it's nonsense but you just have
to go for it and then find out right but
there is one way there's one thing which
is if you think it's a really good idea
and other people tell you it's complete
nonsense um then you know you're really
on to something so one example of that
is when Radford and I first came up with
variational methods
um I sent mail explaining it to a former
student of mine called Peter Brown who
knew a lot about
em um and he showed it to people who
worked with him called The deetra
Brothers they were twins I think yes yes
and he then told me later what they said
and they said to him um either this
guy's drunk or he's just stupid um so
they really really thought it was not
now it could have been partly the way I
explained it because I explained it in
intuitive terms see but when people when
you have what you think is a good idea
and other people think it's complete
rubbish that's the sign of a really good
idea oh I see unless you're
wrong oh and and and research topics you
know new grad students should work on
what capsules and maybe unsupervised
learning any
other one good piece of advice for new
grad students is
see if you can find an
advisor who has beliefs similar to yours
because if you work on stuff that your
advisor feels deeply about you'll get a
lot of good advice and time from your
advisor if you work on stuff your
advisor's not interested in all you'll
get is you'll get some advice but it
won't be nearly so
useful and uh uh last one on advice for
Learners um how do you feel about people
entering a PhD program versus joining
you know a top company or a to research
group in a
corporation yeah it's complicated I
think right now what's happening is
there aren't enough academics trained in
deep learning to educate all the people
we need educated in universities there
just isn't The Faculty bandwidth there
um but I think that's going to be
temporary I think what's happened is
depart most departments are being very
slow to understand the kind of
revolution going on I kind of agree with
you that it's it's not quite a Second
Industrial Revolution but it's something
on nearly that scale and there's a huge
sea change going on basically because
our relationship to computers has
changed instead of programming them we
now show them and they figure it out
that's a completely different way of
using computers and computer science
departments are built around the idea of
programming computers and they don't
understand that sort of
this showing computers is going to be as
big as programming computers and so they
don't understand that half the people in
the department should be people who get
computers to do things by showing them I
see right so my own my own
Department refuses to acknowledge that
um it should have lots and lots of
people doing this it thinks they've got
they got a couple and maybe a few more
but not too
many I and in that situation you have to
rely on the big companies to do quite a
lot of the training so Google is now
training people we call Brain residents
um I suspect the universities will
eventually catch up I see yeah right in
fact uh maybe a lot of students have
figured this out a lot of top PHD
programs you know over half the P the
applicant are actually wanting to work
on showing rather than programming yes
yeah cool yeah yeah in fact you to give
credit where do you where whereas a deep
learning. a is creating deep learning
specialization far as I know the first
deep learning Mook was actually yours
Tau on corera back in 2012 as well and
and and somewhat strangely that's when
you first publish the RMS prop algorithm
which also took
off right yes well as as you know um
that was because you invited me to do
the Muk and then when I was very dubious
about doing it you kept pushing me to do
it so it was very good that I did and
although it was a lot of work
yes I yes thank you for doing that I
remember you complaining to me how much
work it was and you staying up late at
night but I think you know many many
Learners have benefited for your first
move and I still very grateful to you
for it so that's good yeah yeah over the
years I've seen you embroid in debates
about paradigms for AI uh and what it
has been a paradigm shift for AI what do
you all can you share your thoughts on
that uh yes happily um so I think in the
early days back in the 50s um people
like Fon nman and shuring didn't believe
in symbolic AI they were far more
inspired by the brain Unfortunately they
both died much too young um and their
voice wasn't heard and in the early days
of AI people were completely convinced
that the representations you needed for
intelligence were symbolic expressions
of some kind sort of cleaned up logic um
where you could do non- monotonic things
and not quite logic but something like
logic and that the essence of
intelligence was
reasoning what's happened now is there's
a completely different view which is
that um what a thought is is just a
great big Vector of neural
activity so contrast that with the
thought being a symbolic expression and
I think the people who thought that
thoughts were symbolic Expressions just
made a huge
mistake what comes in is a string of
words and what comes out is a string of
words and because of that strings of
words are the obvious way to represent
things so they thought what must be in
between was a string of words or
something like a string of words and I
think what's in between is nothing like
a string of words I think the idea that
thoughts must be in some kind of
language is as silly as the idea that
understanding the layer of a spatial
scene must be in pixels pixels come in
in and if we could if we had a dot
matrix printer attached to us then
pixels would come out um but what's in
between isn't
pixels and so I think thoughts are just
these great big vectors and the big
vectors have causal Powers they cause
other big vectors and that's utterly
unlike the standard AI view that
thoughts are symbolic Expressions I see
yep I guess AI is certainly coming round
to this new point of view these days
some of it let see I think a lot of
people in a still think thoughts have to
be symbolic Expressions thank you very
much for doing this interview it's
fascinating to hear how deep learning
has evolved over the years as well as
how you're still helping drive it into
the future so thank you Jeff well thank
you for giving me this opportunity okay
thank
you
浏览更多相关视频
![](https://i.ytimg.com/vi/kuvFoXzTK3E/hq720.jpg)
Prof. Chris Bishop's NEW Deep Learning Textbook!
![](https://i.ytimg.com/vi/2EDP4v-9TUA/hq720.jpg)
Season 2 Ep 22 Geoff Hinton on revolutionizing artificial intelligence... again
![](https://i.ytimg.com/vi/sitHS6UDMJc/hq720.jpg)
Possible End of Humanity from AI? Geoffrey Hinton at MIT Technology Review's EmTech Digital
![](https://i.ytimg.com/vi/fWX2Vf1AfTM/hq720.jpg)
Geoffrey Hinton | Ilya's AI tutor Talent and intuition have already led to today’s large AI models
![](https://i.ytimg.com/vi/lLBbsif2Xt4/hq720.jpg)
Geoffrey Hinton is a genius | Jay McClelland and Lex Fridman
![](https://i.ytimg.com/vi/WtvOAWGahUQ/hq720.jpg)
【人工智能】图灵奖得主Yoshua Bengio最新访谈 | 不应该只有Scaling Law | 深度学习三巨头 | 学术生涯 | 神经网络 | 系统2 | AI safety | 如何科学研究
5.0 / 5 (0 votes)