Heroes of Deep Learning: Andrew Ng interviews Geoffrey Hinton

Preserve Knowledge
8 Aug 201739:45

Summary

TLDR在这段深入的访谈中,深度学习领域的先驱Jeff Hinton分享了他个人的故事以及对人工智能和机器学习的贡献。Hinton从高中时期对大脑如何存储记忆的好奇开始,经历了学习生理学、物理学、哲学和心理学的转变,最终在爱丁堡大学投身于人工智能研究。他坚持自己对神经网络的信念,即使在遭遇反对和职业挑战时也未曾放弃。Hinton详细讨论了他与David Rumelhart和Ron Williams共同开发反向传播算法的历史,以及他们如何克服困难在《自然》杂志上发表有关词向量和语义特征学习的论文。此外,Hinton还探讨了他对深度信念网络、变分方法和玻尔兹曼机的研究,以及他对深度学习未来的看法,包括对胶囊网络的当前研究,这是一种新的神经网络结构,旨在提高模型的泛化能力。他还提供了对有志于进入深度学习领域的人们的建议,强调了直觉的重要性和持续编程的必要性。

Takeaways

  • 🧠 对深度学习的贡献:杰夫·辛顿(Geoff Hinton)被誉为深度学习之父,对深度学习领域做出了巨大贡献。
  • 📚 个人故事:辛顿的学术兴趣始于高中时期,对大脑如何存储记忆的好奇心引导他进入了人工智能和机器学习领域。
  • 🔄 学术转变:辛顿在剑桥大学学习生理学和物理学,后转向哲学,最终选择了心理学,但发现心理学理论无法充分解释大脑的工作方式,于是转向了人工智能。
  • 🤝 合作与冲突:在爱丁堡大学与他人就神经网络和符号AI的研究方向有过争论,但他坚持自己对神经网络的信念。
  • 📈 神经网络的复兴:在20世纪80年代,辛顿与David Rumelhart和Ron Williams共同开发了反向传播算法,尽管这一算法之前已有人发明,但他们的工作推动了社区对这一算法的接受。
  • 🏆 重要成就:辛顿特别自豪的成就是与Tero Aila和Yoshua Bengio在玻尔兹曼机上的工作,以及受限玻尔兹曼机和深度置信网络的开发。
  • 🔧 技术创新:辛顿对ReLU(Rectified Linear Unit)的工作表明,它几乎等同于一系列逻辑单元的堆叠,这一发现对ReLU的普及起到了推动作用。
  • 🧐 大脑与学习算法:辛顿认为,如果反向传播是一个优秀的学习算法,那么进化过程中很可能已经实现了它的某种形式,尽管可能不是完全相同。
  • 📉 多时间尺度处理:辛顿讨论了他在深度学习中处理多时间尺度问题的想法,包括他在1973年提出的“快速权重”概念。
  • ⚙️ 胶囊网络:辛顿正在推动胶囊网络的概念,这是一种新的深度学习网络结构,旨在更好地处理多维实体和提高模型的泛化能力。
  • 📚 研究建议:辛顿建议新研究者阅读适量的文献以形成直觉,然后信任并追随这些直觉,即使它们可能与主流观点相悖。

Q & A

  • 杰夫·辛顿(Geoff Hinton)在高中时期是如何对人工智能和神经网络产生兴趣的?

    -在高中时,杰夫·辛顿的一个同学向他介绍了全息图的概念,以及大脑可能使用全息图方式存储记忆的理论。这激发了他对大脑如何存储记忆的好奇,从而对人工智能和神经网络产生了兴趣。

  • 辛顿在剑桥大学最初学习了哪些科目?

    -辛顿在剑桥大学最初学习了生理学和物理学,他是当时唯一一个同时修这两个学科的本科生。

  • 在辛顿的研究生涯中,他是如何从心理学转向人工智能领域的?

    -辛顿最初对心理学感兴趣,但后来觉得心理学的理论过于简单,无法充分解释大脑的工作机制。之后他尝试了哲学,但发现哲学缺乏辨别真伪的方法。最终,他决定转向人工智能领域,并前往爱丁堡大学学习。

  • 辛顿在加州的研究环境与英国有何不同?

    -在英国,神经网络被视为过时的东西,而在加州,人们如唐纳德·诺曼(Don Norman)和大卫·鲁梅尔哈特(David Rumelhart)对神经网络的想法非常开放,这使得辛顿能够更自由地探索和研究。

  • 辛顿和大卫·鲁梅尔哈特是如何发展出反向传播算法的?

    -辛顿、鲁梅尔哈特和罗恩·威廉姆斯(Ron Williams)共同发展了反向传播算法,尽管后来发现其他研究者也独立发明了这一算法,但辛顿他们的工作帮助社区广泛接受了这一算法。

  • 反向传播算法为何在1986年的论文中得到了广泛接受?

    -辛顿和同事们在1986年的论文中展示了反向传播算法能够学习词汇的表示,并且通过这些表示可以理解单词的语义特征。这篇论文被《自然》杂志接受,标志着反向传播算法被广泛接受的转折点。

  • 辛顿认为他在神经网络和深度学习领域中哪项工作最美丽?

    -辛顿认为他与特里·西诺西(Teresi Hinton)在玻尔兹曼机上的工作最美丽。他们发现了一个简单的学习算法,适用于大型密集连接网络,并且每个突触只需要知道与之直接相连的两个神经元的行为。

  • 受限玻尔兹曼机(Restricted Boltzmann Machines, RBMs)在实际应用中有哪些成功案例?

    -受限玻尔兹曼机在Netflix的比赛中被用作获胜算法的一部分。此外,从2007年开始,受限玻尔兹曼机和深度受限玻尔兹曼机的工作对神经网络和深度学习的复兴起到了重要作用。

  • 辛顿与布拉德福德·尼尔(Bradford Neal)在变分方法上做了哪些工作?

    -辛顿和尼尔在变分方法上的工作表明,不需要进行完美的期望最大化(EM)算法,可以通过进行近似的EM来大幅提高算法的效果。他们还在1993年发表了第一篇变分贝叶斯方法的论文,展示了如何通过高斯分布来近似真实的后验,并在神经网络中实现这一过程。

  • 辛顿对ReLU(Rectified Linear Unit)激活函数的看法是什么?

    -辛顿认为ReLU激活函数与一系列逻辑单元(logistic units)几乎等效,这有助于ReLU的普及。他还提到了在Google的一次演讲中,他展示了如何使用ReLU和单位矩阵初始化来训练具有300个隐藏层的网络,并且能够非常高效地进行训练。

  • 辛顿如何看待胶囊(capsules)的概念,以及它在深度学习中的作用?

    -辛顿认为胶囊是一种新的表示方法,它能够表示具有多个属性的特征实例。胶囊通过“通过协议的路由”(routing by agreement)来实现特征的组合,这可能对提高神经网络的泛化能力、处理视角变化和图像分割等方面非常有帮助。

Outlines

00:00

😀 深度学习的起源与个人历程

在这段访谈中,Jeff 被赞誉为深度学习的教父,他分享了自己是如何对人工智能、机器学习和神经网络产生兴趣的。他回忆起在高中时,一位同学向他介绍了全息图的概念,这激发了他对大脑如何存储记忆的好奇。在大学期间,他先后学习了生理学、物理学、哲学和心理学,但发现这些领域无法充分解释大脑的工作机制。之后,他转行成为木匠,最终决定投身于人工智能领域,并在爱丁堡大学师从神经网络领域的先驱。Jeff 坚持自己对神经网络的信念,最终获得了人工智能博士学位。他的工作在加州得到了认可,特别是与 David Rumelhart 的合作,他们共同发展了著名的反向传播算法。

05:00

🤖 反向传播算法与词嵌入

Jeff 讨论了反向传播算法的重要性,这是一种利用链式法则来获取导数的算法。尽管这个算法并非他首创,但他和同事们的工作帮助社区广泛接受了这一算法。他们通过训练一个模型来学习单词的表示,并通过这些表示来理解单词的语义特征。这项工作不仅展示了如何从图结构或树结构的信息中提取特征,还展示了如何使用这些特征来推导出新的一致信息。此外,Jeff 还提到了计算机硬件的发展,如 GPU 和超级计算机,对深度学习发展的重要推动作用。

10:00

🧠 神经网络的创新与 Boltzmann 机

Jeff 谈到了他在神经网络领域的多项创新,包括与 Terry Sejnowski 合作开发的 Boltzmann 机。他们发现了一个简单的学习算法,适用于大型、密集连接的网络,这个算法在大脑中可能也有所体现。他还提到了受限 Boltzmann 机(RBM)和深度置信网络(DBN),这些工作对神经网络和深度学习的复兴起到了重要作用。此外,他还讨论了变分方法和变分贝叶斯方法,这些方法在统计学中有着广泛的应用。

15:01

🔄 深度学习中的快速权重与递归

Jeff 提到了他在深度学习中关于快速权重和递归的思想。快速权重可以快速适应并迅速衰减,从而能够保持短期记忆。他展示了如何使用这些权重进行真正的递归,即在递归调用中重用表示事物的神经元和权重。他还讨论了如何在递归调用结束时,通过快速权重恢复神经元的活动状态。这项工作从他在研究生第一年的一个想法开始,经过了大约40年的发展。

20:01

💡 胶囊网络与多维实体表示

Jeff 介绍了他关于胶囊网络的思考,这是一种新的深度学习架构,用于更好地表示多维实体。在胶囊网络中,每个胶囊代表一个具有多个属性的特征实例。这种方法允许网络通过“通过协议进行分组”的方式,对不同特征的部分进行分组和关联,从而提高网络的泛化能力。他相信胶囊网络能够改善从有限数据中学习的能力,更好地处理视角变化,并提高统计效率。

25:02

📚 深度学习的学习建议与未来方向

Jeff 分享了他对想要进入深度学习领域的人的建议。他建议阅读适量的文献以培养直觉,然后相信自己的直觉。他还推荐复制已发表的论文来深入理解,因为这样可以发现使论文结果工作的所有小技巧。此外,他鼓励持续编程,因为这是理解问题和解决问题的关键。他提到,尽管目前学术界在深度学习方面的培训能力有限,但这种情况将会改变,而且大公司在培训方面发挥了重要作用。他还讨论了他对学术研究与加入顶尖公司之间的选择的看法。

30:05

🌐 计算机使用方式的变革与 AI 的未来

在访谈的最后,Jeff 讨论了计算机使用方式的根本变化,即从编程转向展示。他认为这是一种全新的与计算机交互的方式,而且计算机科学部门需要适应这种变化。他预见,未来将有更多的人通过展示而非编程来使计算机执行任务。他还提到了深度学习专业化的创建,以及他通过 Coursera 提供的深度学习课程,这些工作对推广深度学习教育起到了重要作用。

Mindmap

Keywords

💡深度学习

深度学习是机器学习的一个分支,它基于神经网络的算法,模仿人脑的处理方式来识别复杂的模式。在视频中,Jeff Hinton作为深度学习领域的先驱之一,分享了他在深度学习领域的见解和经历。深度学习在图像识别、语音识别和自然语言处理等方面取得了显著的进展。

💡神经网络

神经网络是深度学习的基础,它由相互连接的神经元组成,模仿人脑的工作机制。视频提到了Hinton对神经网络的研究,包括他对如何改进神经网络以更好地模拟人脑功能的探索。

💡反向传播

反向传播是一种训练神经网络的算法,通过计算损失函数对每个参数的梯度,并利用这些梯度来更新网络的权重。在视频中,Hinton讨论了反向传播算法的重要性,并且提到了他与David Rumelhart和Ron Williams共同开发反向传播算法的历史。

💡Boltzmann机

Boltzmann机是一种随机神经网络,能够通过模拟物理系统的热平衡状态来学习数据的概率分布。Hinton在视频中提到了他与Terry Sejnowski共同研究Boltzmann机的工作,以及这些研究如何为深度学习领域带来新的洞见。

💡深度信念网络

深度信念网络是一种由多个受限玻尔兹曼机堆叠而成的深层神经网络,用于特征学习。在视频中,Hinton讨论了深度信念网络的构建和它们在特征学习中的作用,以及如何通过这些网络实现高效的推理。

💡变分法

变分法是一种数学优化方法,在统计学和机器学习中用于估计复杂概率模型的参数。Hinton在视频中提到了他与Bradford Neal关于变分法的工作,以及他们如何改进EM算法,使其在神经网络中更有效。

💡ReLU激活函数

ReLU(Rectified Linear Unit)激活函数是一种在神经网络中广泛使用的激活函数,因其计算简单和训练效率高而受到青睐。视频提到了Hinton对ReLU的研究,以及它如何与多个逻辑单元相等价。

💡dropout

Dropout是一种在神经网络训练过程中用于防止过拟合的技术,通过在训练过程中随机地关闭网络中的神经元来实现。Hinton在视频中提到了dropout技术,以及它如何帮助改进神经网络的泛化能力。

💡胶囊网络

胶囊网络是Hinton提出的一种新的神经网络架构,旨在改善网络对空间关系的学习能力。在视频中,Hinton讨论了胶囊网络的概念,以及它们如何通过“通过协议进行路由”来更好地理解和处理图像中的对象。

💡梯度下降

梯度下降是一种优化算法,用于最小化损失函数,通过调整网络的参数来提高模型的性能。在讨论反向传播和神经网络的训练过程中,梯度下降是核心概念之一。

💡生成对抗网络

生成对抗网络(GAN)是一种由两个网络组成的模型,一个生成器网络生成数据,另一个鉴别器网络评估数据的真实性。这种竞争关系推动了生成器网络生成更真实的数据。Hinton在视频中提到了GAN,并认为它是深度学习中的一个重要突破。

Highlights

杰夫·辛顿(Geoff Hinton)被誉为深度学习之父,对深度学习领域做出了巨大贡献。

辛顿在高中时期就对大脑如何存储记忆产生了兴趣,这启发了他后续在人工智能领域的探索。

在剑桥大学,辛顿最初学习生理学和物理学,后转向心理学,最终选择了人工智能。

辛顿在爱丁堡大学学习人工智能期间,坚持研究神经网络,尽管当时并不被主流所接受。

1982年,辛顿与大卫·鲁梅尔哈特(David Rumelhart)合作,提出了反向传播算法。

反向传播算法的推广得益于辛顿在1986年发表的论文,该论文展示了算法学习词向量的能力。

辛顿认为,如果反向传播算法非常有效,那么进化过程中大脑很可能已经实现了类似的机制。

辛顿与特里·西诺西(Terry Sejnowski)合作,提出了玻尔兹曼机,这是一种简单且有效的学习算法。

受限玻尔兹曼机(RBM)在实际应用中非常有效,如在Netflix比赛中获胜的模型中就有使用。

辛顿提出了深度信念网络(Deep Belief Networks),这是一种结合了神经网络和概率图模型的方法。

辛顿与布拉德福德·尼尔(Bradford Neal)合作,改进了变分EM算法,使之在神经网络中更有效。

辛顿对ReLU(Rectified Linear Unit)激活函数的数学性质进行了研究,推动了其在深度学习中的广泛应用。

辛顿在Google工作期间,提出了使用ReLU和恒等矩阵初始化来训练深层网络的方法。

胶囊网络(Capsule Networks)是辛顿近期的研究重点,旨在提高神经网络的泛化能力。

胶囊网络通过“通过协议路由”(routing by agreement)机制,能更好地处理视角变化和图像分割任务。

辛顿认为,尽管当前监督学习非常成功,但无监督学习在未来将更加重要。

辛顿建议新研究者阅读少量文献以形成直觉,然后信任并追随自己的直觉。

辛顿鼓励通过编程实践来深入理解深度学习算法,并不断尝试和验证自己的想法。

辛顿认为,深度学习正在引领一场类似于第二次工业革命的技术变革。

辛顿强调,与传统的符号主义AI不同,现代AI认为思维可以是神经活动的向量表示,而非符号表达。

Transcripts

play00:01

welcome Jeff and thank you for doing

play00:03

this interview with deep

play00:06

learning.ai thank you for inviting me um

play00:09

I think that at this point you more than

play00:11

anyone else on this planet has invented

play00:13

so many of the ideas behind deep

play00:15

learning and uh a lot of people have

play00:17

been calling you The Godfather of deep

play00:20

learning although it wasn't until we're

play00:22

just chatting a few minutes ago that I

play00:24

realize you think I'm the first one to

play00:25

call you that uh which which I'm quite

play00:28

happy to have done but what I want to

play00:30

ask is many people know you as a legend

play00:34

I want to ask about your personal story

play00:36

behind a legend so how did you get

play00:39

involved in going way back how did you

play00:41

get involved in Ai and machine learning

play00:43

and neon

play00:44

networks so when I was at high school um

play00:48

I had a classmate who was always better

play00:50

than me at everything um he was a

play00:52

brilliant mathematician and he came into

play00:56

school one day and said did you know the

play00:58

brain uses hologram

play01:01

and um I guess that was about

play01:04

1966 and I said sort of What's a

play01:06

hologram and he explained that in a

play01:08

hologram you can chop off half of it and

play01:10

you still get the whole picture and that

play01:13

memories in the brain might be

play01:14

distributed over the whole brain and so

play01:16

I guess he'd read about lashley's

play01:18

experiments where you chop out bits of a

play01:20

rat's brain and discover it's very hard

play01:22

to find one bit where it stores one

play01:24

particular memory

play01:26

um so that's what first got me

play01:29

interested in how does the brain store

play01:32

memories and then when I went to

play01:34

University I started off studying

play01:36

physiology and physics I think when I

play01:39

was at Cambridge I was the only

play01:41

undergraduate doing physiology and

play01:43

physics um and then I gave up on that

play01:48

and tried to do philosophy um because I

play01:50

thought that might give me more insight

play01:52

but that seemed to me actually lacking

play01:56

in ways of distinguishing when they said

play01:58

something false

play02:00

and so then I switched to

play02:02

psychology um and and in Psychology they

play02:06

had very very simple theories and it

play02:08

seemed to me it was sort of hopelessly

play02:10

inadequate for explaining what the brain

play02:11

was doing so then I took some time off

play02:13

and became a carpenter um and then I

play02:16

decided I'd try Ai and I went off to

play02:19

Edinburgh to study AI with longit

play02:21

Higgins and he had done very nice work

play02:23

on neural networks and he' just given up

play02:26

on neural networks um and been very

play02:28

impressed by winterr thesis so when I

play02:31

arrived he thought I was kind of doing

play02:33

this oldfashioned stuff and I ought to

play02:35

start on symbolic Ai and we had a lot of

play02:38

fights about that but I just kept on

play02:39

doing what I believed in oh and then

play02:42

what um I eventually got a PhD in Ai and

play02:47

then I couldn't get a job in Britain um

play02:51

but I saw this very nice advertisement

play02:53

for um Sloan fellowships in

play02:56

California and I managed to get one of

play02:58

those and I went to California and

play03:00

everything was different there um so in

play03:05

Britain neural Nets was regarded as kind

play03:08

of

play03:09

silly and in California Don Norman and

play03:13

David

play03:14

rumelhart um were very open

play03:17

to uh ideas about neural Nets it was the

play03:20

first time I'd been somewhere where

play03:22

thinking about how the brain works and

play03:24

thinking about how that my relate to

play03:25

psychology was seen as a very positive

play03:28

thing and it was a lot of fun

play03:30

in particular collaborating with David

play03:31

rart was great I see right so this was

play03:34

when you were at UCSD and you and rart

play03:37

around what 1982 wound up writing you

play03:39

know the seminal backrop paper right so

play03:42

so actually it was more complicated than

play03:45

that um what Happ so in I think around

play03:50

early

play03:51

1982 um David DRL har and me um and Ron

play03:57

Williams um but between us developed the

play04:01

backdrop algorithm it was mainly David

play04:03

ramh heart's idea um we discovered later

play04:06

that many other people have invented it

play04:09

um David Parker had invented it um

play04:12

probably after us but before we

play04:14

published um Paul wbos had published it

play04:17

already quite a few years earlier but

play04:19

nobody paid it much attention and there

play04:22

were other people who developed very

play04:24

similar algorithms it's not clear what's

play04:26

meant by backprop but using the chain

play04:28

rule to get derivatives was not a novel

play04:31

idea why do you think it was your paper

play04:35

that helped so much the community lash

play04:38

on to back Prof it feels like your paper

play04:40

marked an inflection in the acceptance

play04:42

of this algorithm whoever accepted it so

play04:46

we managed to get a paper into nature in

play04:49

1986 and I did quite a lot of political

play04:51

work to get the paper accepted I figured

play04:54

out that one of the referees was

play04:55

probably going to be Stuart southernland

play04:57

who was a well-known psychologist in

play04:59

Britain

play05:00

and I went and talked to him for a long

play05:01

time and explained to him exactly what

play05:03

was going on and he was very impressed

play05:06

by the fact that we showed that backprop

play05:08

could learn representations for words

play05:12

and you could look at those

play05:13

representations which were little

play05:14

vectors and you could understand the

play05:16

meaning of the individual features so we

play05:19

actually trained it on little triples of

play05:22

words about family trees like um Mary

play05:27

has mother Victoria and you'd give it

play05:31

the first two words and it would have to

play05:32

predict the last word and after you

play05:35

trained it you could see all sorts of

play05:37

features in the representations of the

play05:39

individual words like the nationality of

play05:41

the person and their what generation

play05:44

they were which branch of the family

play05:45

tree they were in and so on um that was

play05:48

what made Stuart southernland really

play05:50

impressed with it and I think that was

play05:51

why the paper got accepted very early

play05:53

word embeddings and you're already

play05:55

seeing features learned features of

play05:57

semantic meanings emerg from their

play05:59

training algorithm yes so from a

play06:02

psychologist point of view what was

play06:05

interesting was it unified two

play06:07

completely different strands of ideas

play06:10

about what knowledge was like so there

play06:13

was the old psychologist view that a

play06:15

concept is just a big bundle of features

play06:18

and there's lots of evidence for that

play06:19

and then there was the AI view of the

play06:21

time which is a far more structuralist

play06:24

view which was that a concept is how it

play06:27

relates to other Concepts and to capture

play06:30

concept you'd have to do something like

play06:31

a graph structure or maybe a semantic

play06:33

net and what this back propagation

play06:38

example showed was you could give it the

play06:40

information that would go to a graph

play06:42

structure or in this case a family tree

play06:45

and it could convert that information

play06:47

into features in such a way that it

play06:49

could then use the features to derive

play06:53

new consistent information I generalize

play06:56

but the crucial thing was this to and

play06:58

fro between the graphical representation

play07:01

or the the tree structured

play07:03

representation of the family tree and a

play07:06

representation of the people as big

play07:08

feature vectors and the fact that you

play07:10

could from the graph likee

play07:12

representation you could get to the

play07:13

feature vectors and from the feature

play07:15

vectors you could get more of the graph

play07:17

likee representation so this is

play07:19

1986 in the early '90s Benjo showed that

play07:24

you could actually take real data you

play07:25

could take English text and apply the

play07:28

same technique there and get embeddings

play07:31

for real words from English text and

play07:35

that impressed people a lot I guess

play07:37

recently we've been talking a lot about

play07:39

how fast computers like gpus um and

play07:42

supercomputers is driving deep learning

play07:44

I didn't realize that back in between

play07:47

1986 to the early 90s it sounds like

play07:49

between you and djo there was already

play07:51

the beginnings of this trend yes there

play07:54

was a huge advance I mean in in 1986 I

play07:58

was using a list machine which was less

play08:00

than a tenth of a um megga flop and by

play08:05

about

play08:06

1993 or thereabouts people were seeing

play08:09

like 10 megga flops so it was a factor

play08:11

of 100 and that's the point at which it

play08:14

was easy to use because computers were

play08:15

just getting faster over the past

play08:17

several decades you've invented so many

play08:20

pieces of neuron networks and deep

play08:22

learning um I'm actually curious of all

play08:24

of the things you've invented which are

play08:26

the ones you're still most excited about

play08:28

today

play08:30

so I think the most beautiful one is the

play08:32

work I did with teres inosi on boltimore

play08:34

machines wow so we discovered there was

play08:36

this really really simple learning

play08:39

algorithm that applied to great big

play08:42

densely connected Nets where you could

play08:44

only see a few of the nodes so it would

play08:47

learn hidden representations and it was

play08:49

a very simple algorithm and it looked

play08:51

like the kind of thing you should be

play08:52

able to get in a brain because each

play08:54

synapse only needed to know about the

play08:56

behavior of the two neurons it was

play08:58

directly connected to

play09:00

and the information that was

play09:02

propagated was the same there were two

play09:05

different phases which we called wake

play09:07

and sleep but in the two different

play09:09

phases you're propagating information in

play09:11

just the same way whereas in something

play09:13

like back propagation there's a forward

play09:15

pass and a backward pass and they work

play09:17

differently they're sending different

play09:18

kinds of

play09:19

signals right so I think that's the most

play09:22

beautiful thing and for many years it

play09:25

looked like just like a curiosity

play09:26

because it looked like it was much too

play09:28

slow but then later on I got rid of a

play09:31

little bit of the beauty and instead of

play09:33

letting things settle down just use one

play09:34

iteration in a in a somewhat simpler net

play09:37

and that gave restricted bols machines

play09:39

which actually worked effectively in

play09:42

practice so in the Netflix competition

play09:44

for example um restricted bolts machines

play09:46

were one of the ingredients of the

play09:47

winning entry in fact a lot of the um

play09:50

recent Resurgence of neuron netor deep

play09:52

learning starting about I guess 2007 was

play09:56

the uh restricted baser machine and deep

play09:58

restricted B Machine work that you and

play10:00

your lab did yes so that's another of

play10:03

the pieces of work I'm very happy with

play10:05

the idea of that you could train a

play10:07

restricted bolts machine which just had

play10:09

one layer of hidden features and you

play10:12

could learn one layer of features and

play10:14

then you could treat those features as

play10:15

data and do it again and then you could

play10:18

treat the new features you'd learned as

play10:20

data and do it again as many times as

play10:21

you liked um so that was nice it worked

play10:24

in practice and then YY realized that

play10:29

the whole thing could be treated as a

play10:31

single model but it was a weird kind of

play10:34

model it was a model where at the top

play10:36

you had a restricted bolts machine but

play10:38

below that you had a sigmoid belief net

play10:42

which was something that Radford Neil

play10:44

had invented many years earlier so it

play10:46

was a directed model and what we'

play10:48

managed to come up with by training

play10:50

these restricted BS machines was an

play10:53

efficient way of doing inference and

play10:54

sigmo belief Nets so around that time

play10:59

time there were people doing neural Nets

play11:02

who would use densely connected Nets but

play11:04

didn't have any good ways of doing

play11:06

probabilistic inference in them and you

play11:09

had people doing graphical models um

play11:11

like Mike Jordan um who could do

play11:15

inference properly but only in sparsely

play11:18

connected

play11:19

Nets and what we managed to show was

play11:22

there's a way of learning these deep

play11:24

belief Nets so that there's an

play11:26

approximate form of inference that's

play11:28

very fast it just happens in a single

play11:30

forward pass and that was a very

play11:33

beautiful result and you could guarantee

play11:35

that each time you learned an extra

play11:37

layer of

play11:38

features there was a bound each time you

play11:40

learned a new layer you got a new bound

play11:43

and the new bound was always better than

play11:44

the old bound y the variational bound

play11:46

showing that as you add layers the the

play11:48

the yes yeah I remember that so that was

play11:50

the second thing that I was really

play11:51

excited by and I guess the third thing

play11:53

was the work I did with Bradford Neil on

play11:57

variational methods um um it turns out

play12:00

people in statistics had done similar

play12:03

work earlier but we didn't know about

play12:05

that

play12:06

um so we managed to make em work a whole

play12:11

lot better by showing you didn't need to

play12:12

do a perfect Eep you could do an

play12:14

approximate Eep and EM was a big

play12:16

algorithm in statistics and we showed a

play12:19

big generalization of it and in

play12:21

particular in 1993 I guess um with Van

play12:25

Camp I did a paper that was I think the

play12:27

first variational Baye paper where we

play12:29

showed that you could actually

play12:32

um do a version of basian learning that

play12:35

was far more tractable by approximating

play12:38

the true posterior with a gausin oh and

play12:41

you could do that in neural net and I

play12:43

was very excited by that I see wow right

play12:46

yep I think I remember all of these

play12:49

papers uh the the new and Hinton

play12:51

approximate em paper right spend many

play12:54

hours reading over that um and I think

play12:57

you know some of the albs you use today

play12:59

or some of the albums that lots of

play13:00

people use almost every day are what

play13:02

things like dropouts or um I guess value

play13:05

activations so came from your

play13:08

group um yes and no so other people had

play13:12

thought about rectified linear units and

play13:16

um we actually did some work with

play13:18

restricted bolts machine showing that a

play13:20

relu was almost exactly equivalent to a

play13:23

whole stack of logistic units and that's

play13:26

one of the things that helped reu catch

play13:28

on I was really curious about that the

play13:30

ru paper had a lot of math showing that

play13:34

this function can be approximated to

play13:36

this really complicated formula did you

play13:38

do that math so your paper would get

play13:40

Acceptance in academic conference or did

play13:42

all that math really influence the

play13:44

development of Max of zero and

play13:48

X that was one of the cases where

play13:51

actually the math was important to the

play13:54

development of the idea so I knew about

play13:56

rectified linear units obviously and I

play13:58

knew about logistic units and because of

play14:00

the work on boltzman machines all of the

play14:02

basic work was done using logistic units

play14:05

and so the question was could the

play14:08

learning algorithm work in something

play14:10

with rectified linear units and by

play14:12

showing the rectified L linear units

play14:15

were almost exactly equivalent to a

play14:17

stack of logistic units um we showed

play14:21

that all the math would go

play14:23

through I see and it provided

play14:25

inspiration but today tons of people use

play14:28

r and it just works without yeah without

play14:31

the same without necessarily needing to

play14:33

understand the same

play14:35

motivation yeah one thing I noticed

play14:37

later when I went to Google um I guess

play14:40

in 2014 I gave a talk at Google about um

play14:45

using relus and in initializing with the

play14:48

identity Matrix because the nice thing

play14:50

about relus is if you keep replicating

play14:52

the hidden layers and you initialize

play14:54

with the identity um it just copies the

play14:57

pattern in the layer below

play14:59

and so I was showing that you could

play15:00

train networks with 300 Hidden layers

play15:03

and you could train them really

play15:04

efficiently um if you initialize with

play15:06

the identity but I didn't pursue that

play15:08

any further and I really regret not

play15:10

pursuing that we published one paper

play15:12

with quar Lee showing you could

play15:14

initialize Rec and na jely showing you

play15:16

could initialize recurrent Nets like

play15:18

that but I should have pursued it

play15:20

further because later on um these

play15:23

residual networks were really um that

play15:26

kind of thing over the years I've heard

play15:28

you you talk a lot about the brain I've

play15:29

heard you talk about the relationship

play15:31

between back Prof and the Brain what are

play15:33

your current thoughts on

play15:35

that um I'm actually working on a paper

play15:38

on that right now um I guess my main

play15:42

thought is this if it turns out that

play15:45

backprop is a really good algorithm for

play15:48

doing learning then for sure Evolution

play15:52

could have figured out how to implement

play15:54

it I mean you have cells that can turn

play15:58

into eyeballs or teeth now if cells can

play16:01

do that um they can for sure Implement

play16:03

back propagation and presumably there's

play16:06

huge selective pressure for it so I

play16:09

think the neuroscientist's idea that it

play16:11

doesn't look plausible is just silly

play16:13

there may be some subtle implementation

play16:15

of it and I think the brain probably has

play16:17

something that may not be exactly back

play16:19

propagation but is quite close to it and

play16:21

over the years I come up with a number

play16:23

of ideas about how this might work so in

play16:27

1987 working with J

play16:29

mcland um I came up with the

play16:32

recirculation algorithm where the idea

play16:35

is um you send information round a

play16:39

loop and you try to make it so that

play16:42

things don't change as information goes

play16:44

around this Loop so the simplest version

play16:47

would be you have um input units and

play16:50

hidden units and you send information

play16:53

from the input to the hidden and then

play16:54

back to the input and then back to the

play16:56

hidden and then back to the input and so

play16:57

on and what you want you want to train

play17:01

an autoencoder but you want to train it

play17:02

without having to do back propagation so

play17:05

you just train it to try and get rid of

play17:08

all variation in the activities so the

play17:11

the idea is that the learning rule for a

play17:15

synapse

play17:16

is change the weight in proportion to

play17:18

the Press synaptic input and in

play17:21

proportion to the rate of change of the

play17:23

post synaptic input but in recirculation

play17:25

you're trying to make the post synaptic

play17:27

input you're trying to make the old one

play17:29

be good and the new one be bad so you're

play17:31

changing it in that direction and we

play17:34

invented this algorithm before

play17:36

neuroscientists come up with Spike time

play17:38

dependent plasticity Spike time

play17:40

dependent plasticity is actually the

play17:42

same algorithm but the other way around

play17:44

where the new thing is good and the old

play17:46

thing is bad in the learning rule so

play17:49

you're changing the weight in proportion

play17:50

to the prec activity times the new

play17:55

postoptic activity minus the old one

play17:59

um later on I realized in

play18:02

2007 that if you took a stack of bolts

play18:05

restricted bolts machines and you

play18:07

trained it

play18:09

up um after it was trained you then had

play18:13

exactly the right conditions for

play18:15

implementing back propagation by just

play18:18

trying to reconstruct if you looked at

play18:20

the Reconstruction error that

play18:22

reconstruction error would actually tell

play18:24

you the derivative of the discriminative

play18:28

performance and I at the first deep

play18:31

learning workshop at nips in 2007 I gave

play18:34

a talk about that um that was almost

play18:37

completely ignored um later on Yoshua

play18:42

benio um took up the idea and that's

play18:44

actually done quite a lot more work on

play18:46

that and I've been doing more work on it

play18:48

myself and I think this idea that if you

play18:51

have a stack of

play18:53

autoencoders then you can get

play18:56

derivatives by sending activity

play18:59

backwards and looking at reconstruction

play19:00

errors is a really interesting idea may

play19:03

well be how the brain does it um one

play19:06

other topic that I know you thought a

play19:08

lot about and that uh I hear you're

play19:10

still working on is how to deal with

play19:12

multiple time scales in deep learning so

play19:15

can can you share your thoughts on that

play19:17

yes so actually that goes back to my

play19:19

first year as a graduate student the

play19:22

first talk I ever gave was about using

play19:24

um what I called Fast weights so weights

play19:27

that adapt rapidly but Decay rapidly and

play19:30

therefore can hold short-term memory and

play19:33

I I showed in a very simple system in

play19:36

1973 that you could do true recursion

play19:38

with those weights and what I mean by

play19:40

true recursion is that

play19:43

the the neurons that are used for

play19:46

representing things get reused for

play19:49

representing things in the recursive

play19:52

call and the weights that are used for

play19:54

representing knowledge get reused in the

play19:56

recursive call and so that leaves the

play19:59

question of when you pop out of a

play20:00

recursive call how do you remember what

play20:03

it was you in the middle of doing

play20:04

where's that memory because you use the

play20:06

neurons for the recursive call and the

play20:09

answer is you can put that memory into

play20:11

fast weights and you can recover the

play20:13

activity states of the neurons from

play20:15

those fast weights and more recently

play20:18

working with Jimmy bar we actually got a

play20:20

paper in nips about using fast weights

play20:22

for recursion like that see um so that

play20:25

was quite a big gap I the first model

play20:29

was unpublished in

play20:31

1973 and then Jimmy Bar's model was in

play20:35

2015 I think or 2016 so it's about 40

play20:39

years later and I guess one other idea I

play20:42

appr you talk about for quite a few

play20:45

years now over five years I think is

play20:48

capsules where where are you with

play20:51

that okay so um I'm back to the state

play20:55

I'm used to being in which is I have

play20:58

this idea I really believe in and nobody

play21:00

else believes it and I submit papers

play21:03

about it and they all get rejected um

play21:06

but I really believe in this idea and

play21:08

I'm just going to keep pushing it so it

play21:11

hinges

play21:12

on

play21:14

um there's a couple of key ideas one is

play21:17

about how you represent

play21:18

multi-dimensional

play21:20

entities

play21:21

and you can represent multi-dimensional

play21:25

entities by just a little Vector of

play21:27

activities as long as you know there's

play21:29

only one of them so the idea is in each

play21:31

region of the image you'll assume

play21:34

there's at most one of a particular kind

play21:36

of

play21:37

feature and then you'll use a bunch of

play21:40

neurons and their activities will

play21:43

represent the different aspects of that

play21:46

feature like within that region exactly

play21:49

what are it X and Y coordinates what

play21:50

orientation is it at how fast is it

play21:52

moving what color is it how bright is it

play21:54

and stuff like that so you can use a

play21:56

whole bunch of neurons to to represent

play21:58

different dimensions of the same thing

play22:01

provided there's only one of them

play22:04

um that's a very different way of doing

play22:08

representation from what we're normally

play22:10

used to in neuron Nets normally in

play22:11

neural Nets we just have a great big

play22:12

laay and all the units go off and do

play22:14

whatever they do but you don't think of

play22:16

bundling them up into little groups that

play22:18

represent different coordinates of the

play22:20

same thing so I think there's I think

play22:23

there should be this extra structure and

play22:25

then the other the other idea that goes

play22:27

with that so so this means in the

play22:29

distributed representation you partition

play22:31

the representation to have different

play22:33

subsets to to represent right rather I

play22:37

call each of those subsets a capsule and

play22:39

the idea is a capsule is able to

play22:41

represent an instance of a feature but

play22:44

only one um and it represents all the

play22:47

different properties of that feature so

play22:50

it's a it's a feature that has lots of

play22:51

properties as opposed to a normal neuron

play22:54

in a normal neural net which is just has

play22:56

one scal of property sure I see yep

play22:59

right and then what you can do if you've

play23:01

got that is you can do something that

play23:04

normal neuronet are very bad at which is

play23:07

you can do um what I call rooting by

play23:11

agreement so let's suppose you want to

play23:14

do

play23:14

segmentation and you have something that

play23:17

might be a mouth and something else that

play23:18

might be a

play23:20

nose and you want to know if you should

play23:22

put them together to make one one thing

play23:25

so the idea is you'd have a capsule for

play23:27

a mouth that has the parameters of the

play23:29

mouth and you have a Capal for a nose

play23:31

that has the parameters of the nose and

play23:33

then to decide whether to put them

play23:35

together or not you get each of them to

play23:38

vote for what the parameters should be

play23:41

for a face see now if the mouth and the

play23:44

nose are in the right spatial

play23:45

relationship they will

play23:46

agree so when you get two capsules at

play23:49

one level voting for the same set of

play23:52

parameters at the next level up you can

play23:54

assume they're probably right because

play23:56

agreement in a high dimensional space is

play23:57

very

play23:59

unlikely and that's a very different way

play24:02

of doing filtering than what we normally

play24:05

use in neural

play24:08

Nets

play24:10

so I think this rooting by agreement is

play24:13

going to be crucial for getting neural

play24:15

Nets to generalize much better from

play24:18

limited data I think it' be very good at

play24:20

dealing with changes in Viewpoint very

play24:22

good at doing segmentation and I'm

play24:24

hoping it'll be much more statistically

play24:26

efficient than what we currently do in

play24:28

your nets which is if you want to deal

play24:30

with changes in Viewpoint you just give

play24:32

it a whole bunch of changes in Viewpoint

play24:33

and and train it on them all I see right

play24:36

right so rather than feif for only

play24:39

supervised learning you could learn this

play24:40

in some different way well I've still

play24:44

plan to do it with supervised learning

play24:47

but the mechanics of the forward pass

play24:49

are very different it's not a pure

play24:51

forward pass in the sense that there's

play24:53

little little bits of iteration going on

play24:56

where you you think you find a MTH and

play24:58

you think you found a nose and you do a

play25:00

little bit of iteration to decide

play25:02

whether they should really go together

play25:03

to make a face I see and you can do back

play25:07

props through all that iteration I see

play25:09

so you can train it all discriminatively

play25:11

I see

play25:13

and um we're working on that now at my

play25:16

group in Toronto so I now have a little

play25:18

Google team in Toronto part of the brain

play25:20

team see yep I see and that's what I'm

play25:23

excited about right now oh I see great

play25:25

yeah look forward to that paper when

play25:27

that comes out yeah if it comes

play25:32

out you know you work in deep learning

play25:35

for several decades I'm actually really

play25:36

curious how has your thinking your

play25:39

understanding of AI you know changed

play25:41

over these

play25:42

years so I guess um a lot of my

play25:46

intellectual history has been around

play25:48

back propagation and how to use back

play25:51

propagation how to make use of its power

play25:55

um so to begin with in the mid 80s we

play26:00

were using it for discriminative

play26:01

learning it was working well I then

play26:03

decided by the early '90s that actually

play26:08

most human learning was going to be

play26:09

unsupervised learning and I got much

play26:12

more interested in unsupervised learning

play26:14

and that's when I worked on things like

play26:15

the Wake sleep algorithm um and your

play26:18

comments at that time really influenced

play26:20

my thinking as well so when I was

play26:23

leading Google bra our first project

play26:25

spent a lot of work in unsupervised

play26:26

learning because of

play26:28

inluence right um and I may have misled

play26:32

you that is in the long run I think UNS

play26:35

supervised learning is going to be

play26:36

absolutely crucial yeah but you have to

play26:39

sort of face

play26:40

reality um and what's worked over the

play26:44

last 10 years or so is supervised

play26:46

learning discriminative training where

play26:49

you have labels or you're trying to

play26:51

predict the next thing in a series so

play26:53

that access the label and that's worked

play26:55

incredibly well

play26:59

and I still believe that unsupervised

play27:02

learning is going to be crucial and

play27:05

things will work incredibly much better

play27:07

than they do now when we get that

play27:09

working properly but we haven't yet yeah

play27:12

y I think Mo many of the senior people

play27:15

in deep learning including myself remain

play27:17

very excited about it it's just none of

play27:20

us really have almost any idea how to do

play27:23

it yet maybe you do I don't feel like I

play27:26

do um

play27:28

variational Auto encoders where you use

play27:30

the reparameterization trick seemed to

play27:31

me a really nice idea and generative

play27:35

adversarial Nets also seem to me to be a

play27:37

really nice idea I think generative

play27:39

adversarial Nets are one of the sort of

play27:42

biggest ideas in deep learning that's

play27:44

really new I see yeah um I'm hoping I

play27:48

can make capsules that successful but

play27:49

right now genitor adversarial Nets I

play27:52

think have been a big breakthrough what

play27:55

happened to sparity and Slow Fe feates

play27:58

which were two of the other principles

play27:59

for building on supervised

play28:02

models

play28:03

um I was never as big on sparity as you

play28:06

were see okay um

play28:10

but slow features I think is a mistake

play28:14

um you shouldn't say slow the basic idea

play28:17

is right but you shouldn't go for

play28:19

features that don't change you should go

play28:21

for features that change in predictable

play28:23

ways I see so here's the sort of basic

play28:27

princi about how you model anything

play28:30

um you take your

play28:34

measurements and you apply nonlinear

play28:36

transformations to your measurements

play28:38

until you get to a representation as a

play28:41

state Vector in which the action is

play28:44

linear so you don't just pretend it's

play28:47

linear like you do with common filters

play28:49

but you actually find a transformation

play28:52

from the observables to the underlying

play28:54

variables where linear operations like

play28:56

Matrix multiplies on the underlying

play28:58

variables will do the work so for

play29:00

example if you want to change viewpoints

play29:03

if you want to produce an image from

play29:04

another Viewpoint what you should do is

play29:07

go from the pixels to coordinates and

play29:11

once you got to the coordinate

play29:12

representation which is the kind of

play29:14

thing I'm hoping capsules will find um

play29:17

you can then do a matrix multiply to

play29:19

change Viewpoint and then you can map it

play29:21

back to pixels right that's why you did

play29:23

all that that's a very very general

play29:24

principle that's why you did all that

play29:26

work on face syn right we take a face

play29:28

and compress it a very low dimensional

play29:31

vector and show you can FID that and get

play29:32

back other

play29:34

faces um I had a student who worked on

play29:37

that I didn't do much work on that

play29:38

myself but I see I'm sure you still get

play29:41

out all the time if someone wants to

play29:43

break into deep learning um what should

play29:46

they do so what advice would you have

play29:48

I'm sure you given a lot of advice to

play29:49

people in one-on-one settings but you

play29:51

know for the global audience of people

play29:53

watching this video what advice would

play29:55

you have for them to get into deep

play29:59

okay so my advice is sort of read the

play30:01

literature but don't read too much of it

play30:04

um so this is advice I got from my

play30:07

advisor um which is very unlike what

play30:10

most people say most people say you

play30:11

should spend several years reading the

play30:13

literature and then you should start

play30:15

working on your own ideas um and that

play30:18

may be true for some researchers but for

play30:22

Creative researchers I think what you

play30:24

want to do is read a little bit of the

play30:26

literature and notice something that you

play30:28

think everybody is doing wrong and

play30:31

contrarian in that sense you look at it

play30:33

and it just doesn't feel

play30:35

right and then figure out how to do it

play30:39

right and then when people tell you

play30:41

that's no good just keep at it um and I

play30:46

have a very good principle for helping

play30:48

people keep at it which is either your

play30:50

intuitions are good or they're not if

play30:53

your intuitions are good you should

play30:54

follow them and you'll eventually be

play30:56

successful if if your intuitions are not

play30:58

good it doesn't matter what you

play31:01

do right inspiring advice so might as

play31:04

well go for it you might as well trust

play31:07

your intuitions there's no point not

play31:09

trusting them see yeah there you know I

play31:13

usually advise people to not just read

play31:16

but replicate publish papers and maybe

play31:19

that puts a natural limiter on how many

play31:21

you could do because replicating results

play31:22

is pretty timec consuming yeah yes it's

play31:25

true that when you try and replicate a

play31:26

public paper you discover all the little

play31:28

tricks necessary to make it to work I

play31:30

see the other the other advice I have is

play31:33

never stop programming I see because if

play31:36

you give a student something to do if

play31:38

they're a bad student they'll come back

play31:40

and say it didn't work and the reason it

play31:42

didn't work it'll be some little

play31:43

decision they made um that they didn't

play31:46

realize was crucial and if you give it

play31:48

to a good student like YY for example

play31:51

you can give him anything and he'll come

play31:53

back and he'll say it

play31:54

works I remember doing this once and I

play31:57

said but wait a minute y um since we

play32:00

last talk I realized it couldn't

play32:01

possibly work for the following reason

play32:03

and you said oh yeah well I realized

play32:05

that right away so I assumed you didn't

play32:06

mean

play32:08

that yeah that's great yeah um let's see

play32:13

uh any any other advice for people that

play32:16

want to break into Ai and deep

play32:20

learning I think that's basically read

play32:23

enough so you start developing

play32:24

intuitions and then trust your

play32:25

intuitions that's see cool yeah and go

play32:28

go for it see cool and don't be too

play32:31

worried if everybody else says this

play32:33

nonsense and I guess there's no way to

play32:35

know if others are right or wrong when

play32:37

they say it's nonsense but you just have

play32:38

to go for it and then find out right but

play32:42

there is one way there's one thing which

play32:44

is if you think it's a really good idea

play32:47

and other people tell you it's complete

play32:50

nonsense um then you know you're really

play32:52

on to something so one example of that

play32:54

is when Radford and I first came up with

play32:56

variational methods

play32:58

um I sent mail explaining it to a former

play33:02

student of mine called Peter Brown who

play33:04

knew a lot about

play33:05

em um and he showed it to people who

play33:08

worked with him called The deetra

play33:11

Brothers they were twins I think yes yes

play33:14

and he then told me later what they said

play33:19

and they said to him um either this

play33:21

guy's drunk or he's just stupid um so

play33:25

they really really thought it was not

play33:27

now it could have been partly the way I

play33:28

explained it because I explained it in

play33:30

intuitive terms see but when people when

play33:34

you have what you think is a good idea

play33:36

and other people think it's complete

play33:37

rubbish that's the sign of a really good

play33:39

idea oh I see unless you're

play33:42

wrong oh and and and research topics you

play33:45

know new grad students should work on

play33:48

what capsules and maybe unsupervised

play33:50

learning any

play33:53

other one good piece of advice for new

play33:55

grad students is

play33:57

see if you can find an

play33:59

advisor who has beliefs similar to yours

play34:02

because if you work on stuff that your

play34:05

advisor feels deeply about you'll get a

play34:07

lot of good advice and time from your

play34:09

advisor if you work on stuff your

play34:11

advisor's not interested in all you'll

play34:13

get is you'll get some advice but it

play34:15

won't be nearly so

play34:17

useful and uh uh last one on advice for

play34:21

Learners um how do you feel about people

play34:24

entering a PhD program versus joining

play34:26

you know a top company or a to research

play34:29

group in a

play34:32

corporation yeah it's complicated I

play34:34

think right now what's happening is

play34:37

there aren't enough academics trained in

play34:39

deep learning to educate all the people

play34:42

we need educated in universities there

play34:45

just isn't The Faculty bandwidth there

play34:48

um but I think that's going to be

play34:49

temporary I think what's happened is

play34:53

depart most departments are being very

play34:54

slow to understand the kind of

play34:56

revolution going on I kind of agree with

play34:58

you that it's it's not quite a Second

play35:00

Industrial Revolution but it's something

play35:02

on nearly that scale and there's a huge

play35:05

sea change going on basically because

play35:08

our relationship to computers has

play35:11

changed instead of programming them we

play35:13

now show them and they figure it out

play35:17

that's a completely different way of

play35:18

using computers and computer science

play35:20

departments are built around the idea of

play35:22

programming computers and they don't

play35:25

understand that sort of

play35:28

this showing computers is going to be as

play35:30

big as programming computers and so they

play35:32

don't understand that half the people in

play35:34

the department should be people who get

play35:36

computers to do things by showing them I

play35:38

see right so my own my own

play35:41

Department refuses to acknowledge that

play35:45

um it should have lots and lots of

play35:46

people doing this it thinks they've got

play35:48

they got a couple and maybe a few more

play35:50

but not too

play35:52

many I and in that situation you have to

play35:56

rely on the big companies to do quite a

play35:58

lot of the training so Google is now

play36:00

training people we call Brain residents

play36:03

um I suspect the universities will

play36:05

eventually catch up I see yeah right in

play36:08

fact uh maybe a lot of students have

play36:10

figured this out a lot of top PHD

play36:12

programs you know over half the P the

play36:15

applicant are actually wanting to work

play36:17

on showing rather than programming yes

play36:20

yeah cool yeah yeah in fact you to give

play36:23

credit where do you where whereas a deep

play36:25

learning. a is creating deep learning

play36:27

specialization far as I know the first

play36:29

deep learning Mook was actually yours

play36:31

Tau on corera back in 2012 as well and

play36:36

and and somewhat strangely that's when

play36:37

you first publish the RMS prop algorithm

play36:40

which also took

play36:42

off right yes well as as you know um

play36:47

that was because you invited me to do

play36:48

the Muk and then when I was very dubious

play36:51

about doing it you kept pushing me to do

play36:53

it so it was very good that I did and

play36:55

although it was a lot of work

play36:57

yes I yes thank you for doing that I

play36:59

remember you complaining to me how much

play37:01

work it was and you staying up late at

play37:02

night but I think you know many many

play37:05

Learners have benefited for your first

play37:07

move and I still very grateful to you

play37:09

for it so that's good yeah yeah over the

play37:12

years I've seen you embroid in debates

play37:15

about paradigms for AI uh and what it

play37:17

has been a paradigm shift for AI what do

play37:20

you all can you share your thoughts on

play37:22

that uh yes happily um so I think in the

play37:26

early days back in the 50s um people

play37:30

like Fon nman and shuring didn't believe

play37:33

in symbolic AI they were far more

play37:35

inspired by the brain Unfortunately they

play37:38

both died much too young um and their

play37:41

voice wasn't heard and in the early days

play37:44

of AI people were completely convinced

play37:47

that the representations you needed for

play37:49

intelligence were symbolic expressions

play37:52

of some kind sort of cleaned up logic um

play37:56

where you could do non- monotonic things

play37:58

and not quite logic but something like

play38:00

logic and that the essence of

play38:02

intelligence was

play38:03

reasoning what's happened now is there's

play38:05

a completely different view which is

play38:08

that um what a thought is is just a

play38:11

great big Vector of neural

play38:14

activity so contrast that with the

play38:16

thought being a symbolic expression and

play38:19

I think the people who thought that

play38:20

thoughts were symbolic Expressions just

play38:22

made a huge

play38:23

mistake what comes in is a string of

play38:27

words and what comes out is a string of

play38:30

words and because of that strings of

play38:33

words are the obvious way to represent

play38:35

things so they thought what must be in

play38:36

between was a string of words or

play38:38

something like a string of words and I

play38:41

think what's in between is nothing like

play38:43

a string of words I think the idea that

play38:45

thoughts must be in some kind of

play38:47

language is as silly as the idea that

play38:51

understanding the layer of a spatial

play38:53

scene must be in pixels pixels come in

play38:56

in and if we could if we had a dot

play38:59

matrix printer attached to us then

play39:01

pixels would come out um but what's in

play39:03

between isn't

play39:05

pixels and so I think thoughts are just

play39:08

these great big vectors and the big

play39:10

vectors have causal Powers they cause

play39:12

other big vectors and that's utterly

play39:14

unlike the standard AI view that

play39:17

thoughts are symbolic Expressions I see

play39:19

yep I guess AI is certainly coming round

play39:23

to this new point of view these days

play39:24

some of it let see I think a lot of

play39:27

people in a still think thoughts have to

play39:29

be symbolic Expressions thank you very

play39:31

much for doing this interview it's

play39:33

fascinating to hear how deep learning

play39:34

has evolved over the years as well as

play39:36

how you're still helping drive it into

play39:38

the future so thank you Jeff well thank

play39:41

you for giving me this opportunity okay

play39:43

thank

play39:44

you

Rate This

5.0 / 5 (0 votes)

Related Tags
深度学习神经网络AI未来Jeff Hinton机器学习学术访谈创新思维反向传播谷歌大脑学术贡献技术革命
Do you need a summary in English?