Season 2 Ep 22 Geoff Hinton on revolutionizing artificial intelligence... again

Season Two | The Robot Brains Podcast
1 Jun 202288:20

Summary

TLDR在这段深入的访谈中,人工智能领域的先驱杰夫·辛顿(Geoff Hinton)分享了他对深度学习和神经网络的深刻见解。辛顿教授讨论了深度学习的起源、发展以及它如何成为当今最突出的人工智能方法。他提到了自己在深度学习领域的工作,包括他在图像识别方面的突破,这一成就被称为“ImageNet时刻”,极大地推动了整个AI领域的发展。此外,辛顿还探讨了当前AI技术的局限性,特别是与人类大脑的工作方式相比,他认为现有的深度学习技术如反向传播可能与大脑的处理机制大相径庭。他提出了未来可能的研究方向,包括无监督学习、局部目标函数以及模拟大脑功能的新型学习算法。辛顿的工作不仅对理解大脑的工作原理具有重要意义,也为开发更高效、更强大的AI系统指明了方向。

Takeaways

  • 📈 过去十年,人工智能在多个领域取得了突破性进展,包括计算机视觉、语音识别、机器翻译、机器人技术、医学、计算生物学等,这些进展直接推动了数万亿美元公司和许多新创公司的业务发展。
  • 🧠 深度学习作为人工智能的一个子领域,是这些突破的基础。杰夫·辛顿(Jeff Hinton)是深度学习领域的先驱,他的工作被引用超过五十万次,对整个领域产生了深远影响。
  • 🏆 杰夫·辛顿因其在深度学习领域的贡献,被授予相当于计算机科学领域的诺贝尔奖,并且至今仍在领导该领域的研究。
  • 🔬 辛顿在2012年展示了深度学习在图像识别方面比其他方法更优越,这一成果被称为ImageNet时刻,它改变了整个人工智能领域的研究方向。
  • 🤖 辛顿认为,尽管现有的人工智能系统在某些方面非常高效,但它们与大脑的工作方式存在根本差异,特别是在反向传播机制方面。
  • 🌟 辛顿提出了关于大脑如何使用局部目标函数来学习的想法,这可能与大脑的学习机制更为接近,而与现有的深度学习算法不同。
  • 📚 辛顿和他的同事们提出了一种新的学习算法,称为SimCLR,它通过自我监督学习来提高性能,这与大脑可能使用的学习机制相似。
  • 🔬 辛顿讨论了大脑如何处理大量参数并从中学习,以及这与现有神经网络的对比,特别是在能量效率和计算方式上的差异。
  • 💡 辛顿提出了“死亡计算”(mortal computing)的概念,这是一种新的计算方式,它允许计算机通过学习获得知识,而不是通过编程,并且当计算机停止工作时,其知识也随之消失。
  • 🧵 辛顿还探讨了关于睡眠的计算功能,提出了睡眠可能与神经网络中的负相学习(negative phase of learning)有关的理论。
  • 👁️‍🗨️ 最后,辛顿讨论了如何通过可视化技术,如t-SNE,来理解神经网络在高维空间中的数据表示,以及这些技术如何帮助我们更好地理解机器学习模型的内部工作机制。

Q & A

  • 深度学习在哪些领域取得了突破性进展?

    -深度学习在计算机视觉、语音识别、机器翻译、机器人技术、医学、计算生物学、蛋白质折叠预测等多个领域取得了突破性进展。

  • 深度学习为何能驱动万亿级公司和众多初创公司的业务发展?

    -深度学习通过提供先进的算法和技术,使得公司能够在图像识别、数据处理、自动化决策等方面取得显著进步,从而推动了业务的创新和增长。

  • 杰弗里·辛顿(Geoffrey Hinton)在人工智能历史上的地位如何?

    -杰弗里·辛顿被认为是人工智能历史上最重要的人物之一,他不仅是深度学习领域的先驱,而且至今仍在领导该领域的研究。

  • 什么是神经网络,我们为什么需要关注它?

    -神经网络是一种模仿人脑神经元工作方式的计算模型,它通过调整神经元之间的权重来学习和处理信息。我们关注神经网络是因为它们在模式识别、数据处理和自动化决策等方面展现出强大的能力。

  • 杰弗里·辛顿对于大脑工作原理的理解有何看法?

    -辛顿认为我们对大脑工作原理的理解还很有限,但他相信在未来五年内会取得重大突破。他提出现有的人工智能并非完全模仿大脑的工作方式,尤其是在反向传播算法方面。

  • 为什么说反向传播算法可能与大脑的学习机制不同?

    -辛顿认为反向传播算法在参数调整上可能与大脑的机制不同。大脑可能使用了一种不同的方式获取梯度,这种方式可能更适合从较少的数据中抽象出结构。

  • 自监督学习和无监督学习在深度学习中扮演什么角色?

    -自监督学习和无监督学习是深度学习中的关键部分,它们允许模型从未标记的数据中学习,这减少了对大量标记数据的依赖,并且能够从数据中提取更多的结构化信息。

  • 为什么说大脑可能使用了多个局部目标函数来学习?

    -辛顿提出大脑可能使用了多个局部目标函数来学习,这样可以通过局部的不一致性来学习更多的信息。这种学习方式可能比单一的端到端系统更为高效和灵活。

  • 在深度学习中,如何处理从不同图像块中提取的特征之间的一致性?

    -通过设计目标函数,使得不同图像块提取的特征之间存在一致性。如果这些特征在给定的上下文中预测一致,那么它们通常会达成一致,如果不一致,则可以作为学习信号。

  • 为什么说大规模的神经网络模型(如GPT-3)可能并不完全理解它们所生成的内容?

    -尽管大规模神经网络模型能够生成连贯且看似有逻辑的内容,但它们可能只是在统计层面上理解了训练数据中的模式,而并没有真正理解内容的深层含义。

  • 在深度学习中,如何处理模型的泛化能力,特别是在面对对抗性样本时?

    -对抗性样本揭示了模型可能依赖于纹理等浅层特征进行识别,而非深层的语义理解。提高模型的泛化能力需要更深入地研究其决策过程,并可能需要开发新的算法或架构。

Outlines

00:00

😀 深度学习的起源与影响

本段落介绍了深度学习作为人工智能的一个子领域,在过去十年中在多个领域取得的突破性进展。特别提到了杰弗里·辛顿(Geoffrey Hinton)的贡献,他被视为AI历史上最重要的人物之一,其研究工作被引用超过五十万次。辛顿在2012年的ImageNet竞赛中展示了深度学习在图像识别上的显著优势,这一成就促使整个AI领域转向深度学习。

05:01

🧠 大脑工作机制的探索

辛顿讨论了大脑的工作方式,特别是神经元如何通过调整权重来响应输入信号。他提出了关于大脑如何调整这些权重的疑问,并认为这是理解大脑工作的关键。辛顿还表达了对未来几年内破解大脑工作机制的乐观态度,尽管他怀疑目前深度学习中使用的反向传播(back propagation)算法与大脑的实际工作机制大相径庭。

10:01

🔄 反向传播的效率与局限性

辛顿纠正了关于他发明反向传播的说法,他实际上与同事一起展示了反向传播能够学习有趣的表示,如词嵌入。他认为反向传播比大脑中的学习过程更高效,但可能不如大脑在从少量数据中抽象出大量结构方面做得好。辛顿提出了无监督目标函数的概念,认为这是关键的学习方式,并且大脑可能使用许多局部目标函数。

15:02

🤖 自监督学习与大脑的相似性

辛顿讨论了自监督学习的概念,特别是通过比较图像的小区域来提取表示,并将其与基于其他区域表示的上下文预测进行比较。他认为这种方法可能更接近大脑的工作方式,并且可以通过局部不同意来学习。他还提到了SimCLR论文和他早期关于通过图像不同区域之间的表示达成一致来进行自监督学习的论文。

20:02

🧬 学习算法与大脑的相似性

辛顿讨论了当前的学习方法,如端到端学习和反向传播,并提出了如何从更少的数据中提取更多信息以改进学习方法的问题。他提出了使用许多小的局部目标函数来提高学习带宽的想法,并讨论了大脑如何使用这些局部目标函数。他还提到了大脑可能不使用与当前神经网络相同的函数来进行局部处理。

25:03

🌱 知识的转移与共享

辛顿提出了通过局部区域之间的相互教学来实现知识转移的想法,这与权重共享不同,但提供了一种更灵活且更符合生物神经网络的方式。他认为这种相互教学的机制可能是大脑如何共享知识的方式,并且这种方法比权重共享更有效。

30:07

🚀 从学术界到工业界的转变

辛顿讲述了他从学术界转向工业界的经历,包括他在多伦多大学的不愉快经历,以及他如何通过Coursera课程和随后的公司拍卖来寻找赚取资金的其他方式。他详细描述了拍卖的过程,以及他们如何决定在Google工作,因为他喜欢那里的研究环境和团队。

35:08

🏆 学术成就与职业转变

辛顿分享了他的学术背景,包括他在剑桥的学习经历,以及后来成为木工的时期。他讨论了他对木工的热爱,以及他如何意识到自己在木工方面的不足,这促使他回到学术界。他还提到了他在爱丁堡大学攻读神经网络博士学位的经历,以及他早期关于神经网络的研究工作。

40:09

🤔 深度学习的未来与大脑的比较

辛顿反思了他对深度学习的看法,包括他所说的“深度学习会做到一切”的评论。他解释了他对深度学习的定义,以及他如何看待大脑使用局部目标函数的方式。他还讨论了计算机的当前设计,以及它们与大脑的工作方式的差异。

45:10

🧵 制造与生长计算机的未来

辛顿提出了一个关于计算机设计的未来愿景,其中包括放弃程序的不朽性以换取低能耗计算和便宜的制造过程。他讨论了使用纳米技术“生长”计算机的想法,这些计算机通过学习获得所有知识,并在它们死亡时失去所有知识。

50:13

🌟 神经网络的规模与理解

辛顿讨论了大型神经网络,如大型语言模型,以及它们在处理语言和图像识别方面的成就。他提出了关于这些模型是否真的理解它们所处理的信息的问题,并将它们的能力与昆虫的视觉系统进行了比较。

55:14

💤 睡眠的计算功能

辛顿探讨了睡眠的计算功能,提出了睡眠可能与学习过程的负面阶段有关的理论。他认为睡眠期间大脑可能在进行一种“非学习”或“非优化”的过程,这有助于巩固记忆并避免过度拟合。

00:16

📉 学生超越老师的现象

辛顿描述了即使在标签错误的情况下,神经网络(学生)如何能够超越其训练数据(老师)。他讨论了通过使用大量有噪声的标签数据,神经网络如何学习并最终提供比训练数据更准确的输出。

05:18

📈 t-SNE的发明与高维数据可视化

辛顿讲述了他如何发明t-SNE(t分布随机邻域嵌入)算法,这是一种用于高维数据可视化的技术。他解释了t-SNE如何通过使用高斯混合模型来改进早期的SNE算法,并且如何通过区分不同的尺度来同时显示数据的粗略结构和细节。

Mindmap

Keywords

💡深度学习

深度学习是一种机器学习技术,它通过模拟人脑中的神经网络来识别模式,广泛应用于图像识别、语音识别、自然语言处理等领域。在视频中,深度学习是讨论的核心,因为它是实现AI突破的关键技术之一。

💡神经网络

神经网络是深度学习的基础,由相互连接的神经元组成,模仿人脑的处理方式。视频提到,理解神经网络的工作原理对于揭示大脑的工作机制至关重要。

💡反向传播

反向传播是深度学习中用于训练神经网络的一种算法,通过调整网络中的权重来最小化误差。视频中提到反向传播可能与大脑的学习机制不同,这表明未来可能需要新的学习算法。

💡自监督学习

自监督学习是一种无监督的机器学习方法,它使模型能够从未标记的数据中学习。视频中讨论了自监督学习的重要性,以及它是如何通过比较不同图像块的表示来实现的。

💡梯度下降

梯度下降是一种优化算法,用于最小化损失函数,通过迭代调整参数来寻找函数的最小值。在深度学习中,梯度下降用于训练神经网络,调整网络权重。

💡ImageNet时刻

ImageNet时刻是指2012年深度学习在ImageNet竞赛中取得显著成绩,大幅超越其他方法,这一事件标志着深度学习时代的到来。视频中提到这一时刻对整个AI领域产生了深远影响。

💡局部目标函数

局部目标函数是针对数据的一小部分定义的优化标准,与全局目标函数相对。视频中提到,大脑可能使用许多局部目标函数来进行学习,这与当前深度学习使用的端到端学习策略不同。

💡能量函数

在机器学习中,能量函数用于表示一个系统的能量状态,通常用于指导模型学习。视频中提到,通过调整神经元的权重来改变能量函数,可以帮助神经网络学习。

💡dropout

dropout是一种正则化技术,通过在训练过程中随机关闭神经网络中的一些神经元来防止过拟合。视频中提到dropout作为一种技术,帮助了深度学习模型在ImageNet竞赛中取得成功。

💡稀疏编码

稀疏编码是一种表示数据的方式,其中大部分元素是零或接近零。在深度学习中,稀疏编码有助于提取更有意义的特征。视频中提到稀疏编码作为一种技术,可以提高神经网络的性能。

💡玻尔兹曼机

玻尔兹曼机是一种随机神经网络,能够学习输入数据的概率分布。视频中提到玻尔兹曼机在早期深度学习研究中的重要性,以及它是如何启发了后续的深度学习模型。

Highlights

过去10年中,人工智能在计算机视觉、语音识别、机器翻译、机器人技术、医学、计算生物学等多个领域取得了突破性进展。

深度学习作为AI的一个子领域,对上述突破起到了基础性作用。

杰弗里·辛顿(Geoffrey Hinton)被认为是AI历史上最重要的人物之一,对深度学习的起源和发展有着深远的影响。

辛顿的工作被引用超过五十万次,意味着有大量研究论文在其研究基础上进行。

2012年,辛顿展示了深度学习在图像识别方面比其他方法更优越,这一时刻被称为ImageNet时刻,引发了AI领域的转变。

辛顿认为,尽管现有的AI主要基于反向传播算法,但人脑可能使用了一种不同的机制来调整神经网络的权重。

辛顿提出了无监督目标函数的概念,强调通过观察世界来学习一个好的世界模型,从而基于模型而非原始数据采取行动。

辛顿认为大脑可能使用许多局部目标函数,而非单一的端到端系统。

辛顿提出了一种基于图像局部区域表示的潜在目标函数,通过比较局部提取的特征与基于上下文的预测来学习。

辛顿讨论了自监督学习的重要性,以及它与大脑学习方式的相似性。

辛顿指出,大脑的神经元通过尖峰信号工作,这与当前人工神经网络的工作原理有显著不同。

辛顿认为,为了更好地模拟大脑,未来的学习算法可能需要放弃程序的“不朽性”,转而设计能够通过学习获得知识的“有死亡率的计算机”。

辛顿提出了“学生超越老师”的概念,说明了即使在标签错误的数据集上,深度学习模型仍然能够学习并超越其训练数据的准确性。

辛顿讨论了睡眠的计算功能,提出了睡眠可能与负面学习或记忆整合有关的理论。

辛顿介绍了t-SNE(t分布随机邻域嵌入)算法,这是一种用于高维数据可视化的技术,能够揭示数据点之间的相似性。

辛顿强调了在构建模型时理解模型学习内容的重要性,并讨论了可视化技术在理解学习过程中的应用。

Transcripts

play00:00

[Music]

play00:06

[Music]

play00:11

over the past 10 years ai has

play00:13

experienced breakthrough after

play00:15

breakthrough after breakthrough in

play00:17

computer vision in speech recognition in

play00:20

machine translation in robotics in

play00:23

medicine in computational biology

play00:25

protein folding prediction and the list

play00:28

goes on and on and on and the

play00:30

breakthroughs aren't showing any signs

play00:32

of stopping not to mention these ai

play00:35

breakthroughs are directly driving the

play00:37

business of trillion dollar companies

play00:39

and many many new startups

play00:42

underneath all of these breakthroughs is

play00:45

one single subfield of ai

play00:48

deep learning

play00:49

so

play00:50

when and where did deep learning

play00:52

originate

play00:53

and when did it become the most

play00:55

prominent ai

play00:56

approach today's guest has everything to

play00:59

do with this

play01:01

today's guest is arguably the single

play01:03

most important person in ai history and

play01:05

continues to lead the charge today

play01:08

award the equivalent of the nobel prize

play01:11

for computer science

play01:13

today's guest has their work cited over

play01:15

half a million times

play01:18

that means there is half a million and

play01:20

counting other research papers out there

play01:23

that build on top of his work

play01:26

today's guest has worked on deep

play01:28

learning for about half a century

play01:31

and most of the time

play01:33

in relative obscurity

play01:35

but that all changed in 2012

play01:38

when he showed deep learning is better

play01:40

at image recognition than any other

play01:42

approaches to computer vision and by a

play01:45

very large margin

play01:47

that result that moment

play01:50

known as the imagenet moment

play01:52

changed the whole ai field

play01:54

pretty much everyone dropped what they

play01:56

had been doing and switched to deep

play01:58

learning

play02:00

former students of today's cast include

play02:03

vlognee who put deep mind on the map

play02:05

with their first major result on

play02:07

learning to play atari games

play02:09

and includes our season one finale guest

play02:12

eliza skiver founder and research

play02:15

director of openai

play02:17

in fact every single guest in our

play02:20

podcast has built on top of the work

play02:22

done by today's guest

play02:25

i am of course

play02:26

talking about no one less than jeff

play02:29

hinton

play02:31

chef welcome to the show so happy to

play02:33

have you here

play02:34

well thank you very much for inviting me

play02:37

well so glad to get to talk with you on

play02:39

the show here and i'd say let's dive

play02:41

right in with maybe the you know the

play02:43

highest level question

play02:45

i can ask you um

play02:47

what are neural nets and why should we

play02:50

care

play02:51

okay if you already know a lot about

play02:53

neural nets please forgive the

play02:54

simplifications

play02:56

um

play02:57

here's how your brain works

play02:59

it has lots of little processing

play03:00

elements called neurons

play03:02

and every so often a neuron goes ping

play03:05

and what makes it go ping is that it's

play03:07

hearing pings from other neurons and

play03:09

each time it hears a ping from another

play03:11

neuron

play03:12

it adds a little weight to some store of

play03:15

um input that it's got and when it gets

play03:17

it when it's got enough input it goes

play03:19

ping

play03:21

and so if you want to know how the brain

play03:23

works all you need to know

play03:25

is how the neurons decide to adjust

play03:27

those weights

play03:29

that they add when a ping arrives um

play03:32

that's all you need to know this there's

play03:34

got to be some procedure used for

play03:35

adjusting those weights and if we could

play03:36

figure it out we'd know how the brain

play03:37

works and that's been your quest for a

play03:40

long time now figuring out how the brain

play03:42

might work and

play03:44

what's the status do you do we as a

play03:47

field understand how the brain works

play03:49

okay i always think we're going to crack

play03:51

it in the next five years since that's

play03:53

quite a productive thing to think

play03:55

um but i actually do and i think we're

play03:57

going to crack it in the next five years

play03:59

um i think we're getting closer

play04:02

i'm fairly confident now that it's not

play04:04

back propagation

play04:05

so all of existing ai i think

play04:08

is built on something that's quite

play04:10

different from what the brain's doing

play04:12

at a high level it's got to be the same

play04:14

that is you have a lot of parameters

play04:16

these weights between neurons

play04:19

and you adjust those parameters

play04:21

um on the basis of lots of training

play04:23

examples

play04:24

and that causes wonderful things to

play04:26

happen if you have billions of

play04:27

parameters

play04:28

the brain's like that and deep learning

play04:31

is like that

play04:32

the question is

play04:34

how do you

play04:35

um get the gradient for adjusting those

play04:37

parameters so what you want is some

play04:39

some measure of how well you're doing

play04:42

and then you want to adjust the

play04:43

parameters so they improve that measure

play04:45

of how well you're doing

play04:46

um

play04:47

but my belief currently is that

play04:50

back propagation which is the way deep

play04:53

learning works at present

play04:54

is quite different from what the brain's

play04:56

doing the brain's getting gradients in a

play04:58

different way now that's interesting

play05:00

you're the one saying that jeff because

play05:02

you actually

play05:03

you

play05:04

wrote a paper on back recreation for

play05:07

training neural networks and it's

play05:09

powering everything everybody's doing

play05:10

today

play05:11

and now here you are saying actually

play05:13

it's time probably time for us to figure

play05:14

out how do you think we should change it

play05:16

close to what the brain is doing or do

play05:18

you think maybe

play05:19

back repetition could be better than

play05:21

what the brain is doing

play05:22

let me first correct you

play05:24

um yes we did write the most cited paper

play05:26

on back propagation on

play05:28

ronald hunt and williams and me um

play05:31

back propagation was already known um

play05:34

to a number of different authors what we

play05:36

really did was showed that it could

play05:38

learn interesting representations so it

play05:40

wasn't that we invented back propagation

play05:42

we

play05:43

ronald hart reinvented back propagation

play05:44

we showed that it could learn

play05:45

interesting representations like for

play05:47

example word embeddings so

play05:49

i think back propagation is probably

play05:51

much more efficient

play05:53

than what we have in the brain that's

play05:55

squeezing a lot of information into a

play05:57

few connections

play05:59

whereby a few connections i mean only a

play06:01

few billion

play06:05

so the problem the brain has is that

play06:09

connections are very cheap

play06:11

um we've got

play06:13

hundreds of trillions of them

play06:15

um

play06:17

experience is very expensive

play06:20

and so we are willing to throw lots and

play06:23

lots of parameters

play06:25

at a small amount of experience

play06:27

whereas the neural nets we're using are

play06:29

basically the other way around

play06:31

they have lots and lots of experience

play06:33

and they're trying to get the

play06:34

information

play06:35

about what relates the input to the

play06:37

output into the parameters

play06:39

and i think back propagation is much

play06:40

more efficient than what brain's using

play06:41

at doing that

play06:43

but maybe not as good

play06:44

at from not much data

play06:48

abstracting a lot of structure

play06:51

and well this begs the question of

play06:53

course do you have any hypothesis on

play06:56

approaches

play06:57

that might get

play06:59

better performance in that regard i have

play07:01

a sort of general view which i've had

play07:03

for a long long time

play07:05

which is that we need unsupervised

play07:08

objective functions so i'm talking

play07:10

mainly about perceptual learning

play07:12

um

play07:14

which i think is the sort of key if you

play07:16

can learn a good model of the world by

play07:18

looking at it

play07:19

um

play07:20

then you can base your actions on that

play07:21

model rather than on the raw data

play07:24

and that's going to make

play07:25

doing the right things much easier

play07:28

i'm convinced that the brain

play07:30

is using lots of little local objective

play07:33

functions

play07:34

so rather than being a kind of

play07:36

end-to-end system chain trained to

play07:38

optimize one objective function

play07:41

i think it's using lots of little local

play07:43

ones um so as an example

play07:46

the kind of thing i think will make a

play07:47

good objective function though it's hard

play07:49

to make it work

play07:51

is

play07:52

if you look at a small patch of an image

play07:54

and try and extract some representation

play07:56

of what you think is there

play07:58

you can now compare the representation

play08:00

you got from that small patch of the

play08:01

image

play08:03

with a contextual bet that was got by

play08:06

taking the representations of other

play08:08

nearby patches and based on those

play08:10

predicting what that patch of the image

play08:12

should have in it

play08:14

and obviously

play08:16

um

play08:17

once you're very familiar with the

play08:18

domain

play08:19

those predictions from context and

play08:22

locally extracted features will agree

play08:24

generally agree and you'll be very

play08:25

surprised when they don't and you can

play08:27

learn an awful lot on one trial if they

play08:29

disagree radically

play08:31

so that's an example of where i think

play08:32

the brain could learn a lot from this

play08:34

local disagreement

play08:36

um

play08:37

it's hard to get that to work but i'm

play08:40

convinced something like that is going

play08:42

to be the objective function and if you

play08:44

think of a big image

play08:46

and lots of little local patches in the

play08:48

image

play08:48

that means you get lots and lots of

play08:53

feedback in terms of the agreement of

play08:55

what was extracted locally and what was

play08:56

predicted contextually

play08:58

um

play08:59

all over the image

play09:01

and at many different levels of

play09:03

representation

play09:04

and so we can get a much much richer

play09:08

feedback

play09:09

from these agreements with contextual

play09:11

predictions but making all that work is

play09:13

difficult

play09:15

but i think it's going to be along those

play09:16

lines

play09:17

now

play09:18

what you're describing strikes me as

play09:21

part of what people are trying to do in

play09:23

self-supervised and unsupervised

play09:26

learning and in fact you wrote one of

play09:27

the breakthrough papers the sim clear

play09:30

paper with with a couple of

play09:31

collaborators of course

play09:33

in this space what do you think about

play09:35

the simclear work and contrast of

play09:37

learning more generally

play09:38

and what do you think about the recent

play09:40

mast auto encoders and how does that

play09:42

relate to what you just described it

play09:44

relates quite closely to what i've it's

play09:46

evidence that that kind of objective

play09:47

function is good

play09:49

um

play09:51

i didn't write the sim clear paper um

play09:54

team chain written simply a paper um

play09:57

with help from the major co-authors i

play09:58

was my name was on the paper for general

play10:01

inspiration but

play10:02

i did write a paper a long time ago with

play10:04

sue becker

play10:07

on the idea of getting agreement

play10:10

between representations you got from two

play10:12

different patches of the image

play10:16

so that was i think of that as the

play10:18

origin of this idea

play10:20

of doing self-supervised learning by

play10:23

having agreement between representations

play10:25

from two patches of the same image

play10:27

um

play10:28

the method that sue and i used

play10:31

didn't work very well

play10:32

because of a subtle thing that we didn't

play10:35

understand at the time but i now do

play10:36

understand

play10:38

um

play10:40

and i could explain that if you like but

play10:42

i'll lose most of the audience

play10:44

well i'm curious i think it'd be great

play10:47

great great to hear it but maybe we can

play10:48

zoom out for a moment before zooming

play10:50

back in you talk about

play10:52

current methods use end-to-end learning

play10:54

back propagation

play10:56

to power the end-to-end learning

play10:58

and you're saying switch to learn from

play11:00

less data and extract more from less

play11:02

data is going to be key as as as a way

play11:05

to make progress to get closer to how

play11:07

the brain learns

play11:09

um yes so you get much bigger bandwidth

play11:11

for learning by having many many little

play11:14

local objective functions

play11:16

and when when we look at these local

play11:18

objective functions like filling in a

play11:20

blanked out part of an image or maybe

play11:22

filling back in a word

play11:25

if we look at today's technologies in

play11:27

fact this is the current frontier you've

play11:30

contributed a lot of people are working

play11:32

exactly on on that problem of

play11:35

learning from unlabeled data effectively

play11:37

because it requires a lot less

play11:39

human labor but they still use back

play11:42

propagation

play11:44

the same mechanism so what i do

play11:46

what i don't like about the mast auto

play11:48

encoder is

play11:49

you have

play11:50

your input patches

play11:53

and then you go through many layers of

play11:54

representation

play11:56

and at the output of the net you try to

play11:58

reconstruct the missing input patches

play12:03

i think

play12:04

the brain you have these levels of

play12:06

representation but at each

play12:08

level you're trying to reconstruct

play12:10

what's at the level below

play12:11

um so it's not like you go through this

play12:13

many many layers and then come back out

play12:15

again

play12:16

um it's that you have all these levels

play12:18

each of which is trying to reconstruct

play12:19

what's at the level below

play12:21

um

play12:22

so i think that's much more brainlike

play12:25

and the question is can you do that

play12:26

without using back propagation obviously

play12:28

if you go through many many levels and

play12:30

then reconstruct the missing patches of

play12:32

the output you need to get information

play12:34

back through all those levels

play12:36

and since we have back propagation it's

play12:39

built into all the simulators you might

play12:40

as well do it that way but i don't think

play12:42

that's how the brain's doing it

play12:44

and now imagine the brain is doing with

play12:46

all these local objectives

play12:49

do you think for for our engineered

play12:51

systems

play12:52

will it matter whether

play12:54

in some sense there are three choices to

play12:56

make it seems one choice is

play12:58

what are the objectives what are those

play13:00

local objectives that we want to

play13:02

optimize

play13:03

a second choice is

play13:05

what's the algorithm

play13:07

to use

play13:08

to optimize it and then a third choice

play13:10

is

play13:11

what's the architecture of how do we

play13:13

wire the neurons together that are

play13:15

doing this this learning

play13:17

and

play13:18

among those three it seems like all

play13:20

three

play13:21

could be the missing piece that we're

play13:23

not getting right or what do you think i

play13:26

if you're interested in perceptual

play13:27

learning

play13:28

i think it's fairly clear you want

play13:30

retinotopic maps a hierarchy of written

play13:32

topic maps

play13:34

so the architecture is local

play13:35

connectivity

play13:37

um

play13:39

and the point about that is

play13:42

you can solve a lot of the credit

play13:43

assignment problem by just assuming that

play13:46

something in one locality in a

play13:47

retrotronpic map is going to be

play13:49

determined

play13:50

by the corresponding locality in the

play13:53

retinotopic map that feeds into it

play13:56

so

play13:57

you're not trying to low down in the

play13:59

system

play14:00

um figure out how

play14:02

pixels

play14:03

determine what's going on a long

play14:05

distance away in the image

play14:07

you're going to just use local

play14:08

interactions and that gives you a lot of

play14:10

locality

play14:11

um

play14:12

and you'd be crazy not to lose not to

play14:14

use that

play14:16

one thing neural nets do at present is

play14:18

they assume you're going to be using the

play14:19

same functions at every locality so

play14:22

convolutional let's do that and

play14:23

transformers do that too

play14:25

um

play14:26

i don't think the brain can do that

play14:28

because that would involve weight

play14:29

sharing

play14:30

and it would involve doing exactly the

play14:32

same computation at each locality so you

play14:35

can use the same weights

play14:37

i think it's most unlikely the brain

play14:38

does that

play14:39

but actually

play14:41

there's a way to achieve what weight

play14:44

sharing does what convolutional

play14:47

nets do in the brain in a much more

play14:49

plausible way than i think people have

play14:51

suggested before

play14:52

which is

play14:54

if you do have

play14:55

contextual predictions trying to agree

play14:58

with locally extracted things

play15:01

then imagine a whole bunch of columns

play15:03

that are making local predictions and

play15:05

looking at nearby columns to get their

play15:06

contextual prediction

play15:08

you can think of the context as a

play15:10

teacher for the local thing

play15:12

but also vice versa

play15:14

but think of the context as a teacher

play15:16

for what you're attracting locally

play15:18

so you can think of the information

play15:20

that's in the context as being distilled

play15:24

into the local extractor but that's true

play15:26

for all the local extractors

play15:29

so what you've got is mutual

play15:30

distillation

play15:32

where they're all providing teaching

play15:33

signals for each other

play15:35

and what that means is knowledge about

play15:38

what you should extract in one location

play15:40

is getting transferred

play15:42

into other locations

play15:45

if they're trying to agree if you're

play15:47

trying to get different locations to

play15:48

agree on something if for example you

play15:50

find a nose and you find a mouth and you

play15:52

want them both to agree that they're

play15:54

part of the same face

play15:55

so they should both give rise to the

play15:57

same representation

play15:59

then the fact that you're trying to get

play16:00

the same representation at different

play16:01

locations allows knowledge to be

play16:03

distilled from one location to another

play16:06

and there's a big advantage of that over

play16:09

actual weight sharing

play16:11

obviously biologically one advantage is

play16:12

that the detailed architecture in these

play16:14

different locations doesn't need to be

play16:15

identical

play16:17

but the other advantage is the front-end

play16:19

processing doesn't need to be the same

play16:21

so if you take your retina

play16:22

different parts of the retina have

play16:24

different size receptive fields

play16:26

and convolutional nets try to ignore

play16:28

that they sometimes have multiple

play16:30

different resolutions and do convolution

play16:32

at each resolution but they just can't

play16:34

deal with different front-end processing

play16:38

whereas if you're distilling knowledge

play16:39

from one location to another

play16:42

what you're trying to do is get the same

play16:43

function

play16:45

from the optic array

play16:47

to the representation in these different

play16:49

locations

play16:52

and

play16:52

it's fine if you pre-process the optic

play16:55

array differently in the two different

play16:56

locations

play16:57

you can still distill the knowledge

play16:59

across the function from the optic array

play17:01

to the representation even though the

play17:03

frontend processing is different

play17:05

and so

play17:06

although distillation is less efficient

play17:08

than actually showing the weights it's

play17:10

much more flexible and it's much more

play17:12

neurally plausible

play17:14

so for me that was a kind of big insight

play17:16

i had about a year ago that

play17:18

we have to have something like weight

play17:20

sharing to be efficient

play17:22

but local distillation will work if

play17:24

you're trying to get neighboring things

play17:25

to agree on a representation

play17:27

but that idea of trying to get them to

play17:30

agree gives you the signal you need for

play17:32

knowledge in one location to supervise

play17:34

knowledge in another location

play17:37

and jeff do you think so what you're

play17:39

describing one way to think of it is to

play17:41

say hey we're cheering

play17:43

is clever because it's something the

play17:45

brain kind of does too it just does it

play17:47

differently so we should continue to do

play17:49

weight sharing another way to think of

play17:51

it is that actually we shouldn't

play17:52

continue to do weight sharing because

play17:54

the brain does it somewhat differently

play17:56

and it might be be a reason to do it

play17:58

differently

play17:59

what's your thinking i think the brain

play18:01

doesn't do weight sharing because it's

play18:03

hard for it to

play18:04

ship cement strengths about the place

play18:06

it's very easy if they're all sitting in

play18:08

ram

play18:09

so i think we should continue to do

play18:10

convolutional things in convnets and in

play18:13

transformers we should share weights

play18:16

um we should share knowledge by sharing

play18:17

weights

play18:18

but just bear in mind that the brain is

play18:20

going to share knowledge not by sharing

play18:22

weights but by sharing the function from

play18:24

input to output

play18:25

and using distillation to transfer

play18:27

knowledge

play18:29

now there's the other topic that is

play18:31

talked about quite a bit where the brain

play18:33

is drastically different

play18:35

from our current neural nets and is the

play18:37

fact that

play18:39

neurons are work with spiking signals

play18:42

and that's very different from our

play18:43

artificial neurons in our gpus

play18:46

and so i'm very curious on your thinking

play18:48

on that is is that just an engineering

play18:51

difference or do you think there could

play18:53

be

play18:53

more to it that we need to understand

play18:56

better and benefits to spiking

play18:58

i think it's not just an engineering

play19:00

difference i think once we understand

play19:02

why that hardware is so good

play19:05

why you can do so much in such an energy

play19:07

efficient way with that kind of hardware

play19:08

thing

play19:11

we'll see that um it's sensible for the

play19:14

brain geospiking units the retina for

play19:15

example doesn't use spiking neurons the

play19:17

retina does lots of processing with

play19:19

non-spiking nerves

play19:21

so

play19:23

once we understand why cortex is using

play19:24

those

play19:25

um

play19:27

we'll see that it was the right thing

play19:29

for biology to do and i think that's

play19:31

going to hinge on what the learning

play19:33

algorithm is how you get gradients

play19:36

for networks of spiking neurons and at

play19:38

present nobody really knows the present

play19:40

what people do is say

play19:43

you see the problem with the spiking

play19:44

urine is

play19:45

there's two diff quite different kinds

play19:47

of decision

play19:49

one is exactly when does it spike and

play19:52

the other is does it or doesn't it spike

play19:55

so there's this discrete decision should

play19:57

the neuron spike or not and then this

play19:59

continuous variable of exactly when it

play20:00

should spike

play20:02

and people trying to optimize a system

play20:04

like that have come up with various kind

play20:06

of surrogate functions which sort of

play20:08

smooth things a bit so you can get

play20:09

continuous functions they didn't seem

play20:11

quite right

play20:12

um

play20:13

it'd be really nice to have a learning

play20:14

algorithm and in fact

play20:17

in nips in about 2000 andy brown and i

play20:20

had a paper on trying to learn spiking

play20:22

boltzmann machines um

play20:25

but it'd be really nice to

play20:27

get a learning algorithm that's good for

play20:29

spiking yards and i think that's the

play20:31

main thing that's holding up spiking

play20:32

neuron hardware so people like steve

play20:35

ferber in manchester have realized that

play20:37

many other people have realized that um

play20:40

you can make more energy efficient

play20:41

hardware this way and they built great

play20:43

big systems what they don't have is a

play20:46

good learning outcome for it and i think

play20:47

until we've got a good learning

play20:48

algorithm for it we won't really be able

play20:50

to exploit what we can do with spiking

play20:52

neurons and there's one obvious thing

play20:54

you can do with them that isn't easy in

play20:56

conventional neural nets

play20:58

and that's

play20:59

agreement

play21:01

so if you take a standard artificial

play21:02

neuron

play21:03

then you simply ask the question can it

play21:05

tell if it's two inputs have the same

play21:07

value

play21:08

well it can't it's not an easy thing for

play21:09

a standard neuron to do the standard

play21:11

artificial one

play21:13

um if you use spiking neurons it's very

play21:16

easy to build a system where if the two

play21:17

spikes arrive at the same time they'll

play21:19

make when you're on fire and if they're

play21:20

over different times they won't

play21:23

so using the time of a spike

play21:25

seems like a very good way of measuring

play21:27

agreement we know the biological system

play21:30

does that

play21:31

so

play21:33

you can see the direction a sound is

play21:35

coming from or rather here the direct

play21:37

sound is coming from by the time delay

play21:40

in the signals reaching the two

play21:42

ears and

play21:45

if you take a foot

play21:46

that's about a nanosecond for light

play21:49

and it's about a millisecond first sound

play21:53

and the point is if i move something

play21:56

sideways in front of you by a few inches

play21:59

the difference

play22:01

in

play22:01

the time delay to the two ears the

play22:04

length of the path to the two ears

play22:06

is only a small fraction of an inch

play22:08

and so

play22:10

it's only a small fraction of a

play22:12

millisecond difference

play22:14

in the time the signal gets to the two

play22:15

is and we can deal with that and owls

play22:17

can deal with it even better um

play22:20

and so we're measuring we're sensitive

play22:23

to times of like um 30 milliseconds 30

play22:26

microseconds

play22:28

in order to get stereo from sound

play22:31

um i can't remember what i was sensitive

play22:32

to but it's i think it's a lot better

play22:34

than 30 microseconds

play22:36

and we do that by having

play22:38

um two axons with spikes traveling in

play22:40

different directions one from one and

play22:42

one from the other ear and then you have

play22:44

cells that fire if the spikes get at the

play22:45

same time

play22:46

that's the simplification but roughly

play22:48

that um

play22:50

so we know that spike timing can be used

play22:52

for exquisitely sensitive things like

play22:54

that

play22:55

and it would sort of be very surprising

play22:57

if the precise times the spike wasn't

play22:59

being used but we really don't know how

play23:02

and for a long time i thought

play23:04

it'd be really nice if you could use

play23:06

spike times to detect agreement for

play23:10

things like self-supervised learning

play23:12

or for things like

play23:14

um

play23:15

if i've extracted your mouth and i've

play23:17

extracted your nose or other

play23:18

representations of them

play23:21

and

play23:22

i

play23:23

from your mouth i can now predict

play23:25

something about your whole face and from

play23:27

your nose i can break something about

play23:28

your whole face

play23:29

and if your mouth and nose are in the

play23:31

right relationship to make a face those

play23:32

predictions will agree

play23:34

and it'd be really nice to use spike

play23:35

timing to see that those predictions

play23:37

agree

play23:38

um

play23:39

but it's hard to make that work and one

play23:41

of the reasons it's hard to make that

play23:42

work is because we don't know we don't

play23:43

have a good algorithm for training

play23:45

networks just like in neurons so that's

play23:47

one of the things

play23:48

i'm focused on now how can we get a good

play23:50

training out of the networks of spiking

play23:52

yours and i think that'll have a big

play23:54

impact on hardware

play23:56

that's a really interesting question

play23:58

you're putting forward there because i

play23:59

doubt too many people are working on

play24:00

that compared to let's say the number of

play24:02

people working on

play24:04

large language models or

play24:06

other problems that are much more i

play24:08

guess

play24:09

visible in terms of progress recently

play24:12

um

play24:13

i think

play24:14

yeah it's always a good idea to figure

play24:16

out what huge numbers of very smart

play24:18

people are working on and to work on

play24:19

something else

play24:21

yeah i think the challenge of course for

play24:23

most people

play24:25

i'd say including myself but i

play24:26

definitely hear the question from many

play24:28

students too is that

play24:29

it's easy to work on something else than

play24:31

everybody else but it's hard to make

play24:33

sure that something else is actually

play24:34

relevant

play24:36

because there's many other things

play24:37

out there that are not not very relevant

play24:39

you could possibly spend time on yeah

play24:42

that involves having good intuitions

play24:44

yeah

play24:45

listening to you for example could help

play24:47

um so i've actually a follow-up question

play24:49

something you just said jeff which is um

play24:53

that

play24:54

the retina

play24:57

doesn't use all spiking neurons are you

play24:59

saying that the brain has two types of

play25:01

neurons some that are more like our

play25:03

artificial neurons and some that are

play25:05

spiking neurons

play25:07

i'm not sure the retina is more like

play25:08

artificial neurons but um

play25:11

certainly the cortex has

play25:14

the neocortex has spiking neurons

play25:17

um and that's its primary mode of

play25:19

communication is by sending spikes

play25:22

to from one parameter to another

play25:24

parameter cell

play25:26

um and i don't think we're gonna

play25:30

understand the brain until we understand

play25:32

why he chooses to send spikes

play25:34

for a while i thought i had a good

play25:36

argument that didn't involve

play25:39

the precise time to spikes and the

play25:42

argument went like this

play25:43

the brains in the regime where it's got

play25:45

lots and lots of parameters and not much

play25:47

data

play25:49

relative to

play25:50

the typical neural nets we use

play25:53

and

play25:54

there's a potential overfitting in that

play25:55

regime unless you use very strong

play25:57

regularization

play25:59

and a good regularization technique is

play26:01

dropout where each time you use a neural

play26:03

net you ignore a whole bunch of the

play26:05

units

play26:06

and so maybe

play26:08

the fact that the neurons are sending

play26:11

spikes

play26:13

what they're really communicating

play26:16

is the underlying poisson rate

play26:18

so let's assume it's pass on but she's

play26:20

close enough for this argument um

play26:22

there's a price on process which spends

play26:24

sends spike stochastically

play26:26

but the rate of that process varies and

play26:29

that's determined by the input to the

play26:30

neuron

play26:32

and you might think you'd like to send

play26:34

the real valued rate from one urine to

play26:37

another

play26:38

um

play26:39

but if you want to do lots and lots of

play26:41

regularization

play26:42

you could send the real valued rate with

play26:44

some noise added

play26:46

and one way to add noise is to just use

play26:49

spikes that'll add lots of noise

play26:52

and so

play26:54

this was the motivation for dropout that

play26:56

the

play26:58

most of the times most of the neurons

play27:00

aren't involved in things if you look at

play27:02

any fine time window

play27:04

um

play27:05

and you can think of

play27:07

spikes

play27:08

as a representational underlying

play27:10

personal rate it's just a very very

play27:12

noisy representation

play27:13

which sounds like a very very bad idea

play27:15

because it's very very noisy but

play27:17

actually once you understand about

play27:18

regularization we have too many

play27:19

parameters it's a very very good idea

play27:22

so

play27:22

i still have a

play27:24

lingering fondness for the idea but

play27:26

actually we're not using spike timing at

play27:27

all um it's just about

play27:31

using very noisy representations of

play27:33

personal rates to be a good regularizer

play27:36

and i sort of flip between i think it's

play27:38

very important when you do science

play27:41

not to totally commit to one idea and

play27:43

ignore all the evidence for other ideas

play27:45

but if you do that you end up

play27:47

um

play27:49

flipping between ideas every few years

play27:51

so

play27:53

some years i think neural nets are

play27:55

deterministic i mean we should have

play27:57

deterministic neural nets and that's

play27:58

what backwards using

play27:59

another years i think it's about a

play28:01

five-year cycle i think no no it's very

play28:04

important the best stochastic

play28:06

um and that

play28:07

changes the play for everything so

play28:08

boltzmann machines were intrinsically

play28:10

stochastic and that was very important

play28:11

to them

play28:12

um

play28:13

but the main thing is not to fully

play28:15

commit to either of those but to be open

play28:16

to both

play28:18

now one thing if we think more about

play28:20

what you just said the importance of

play28:22

spiking neurons and figuring out how to

play28:24

train a spiking neuron network

play28:27

effectively

play28:28

what if we for now just say let's not

play28:30

worry about the training part

play28:32

given that

play28:33

seemingly it's

play28:35

far more power efficient

play28:37

um wouldn't people want to distribute

play28:39

pure inference chips that are you you

play28:42

pre-trained effectively separately and

play28:44

then you compile it onto a spiking

play28:46

neuron

play28:47

chip to have very low power inference

play28:50

capabilities what about that

play28:52

yeah so lots of people have thought of

play28:54

that and um it's a very sensible idea

play28:56

and it's probably on the evolutionary

play28:58

path to getting to use backing neural

play29:00

nets

play29:01

because once you're using them for

play29:02

inference

play29:04

um

play29:04

and it works and it's all people are

play29:06

already doing that and it's already

play29:07

working being shown to be more power

play29:09

efficient

play29:11

and various companies have produced

play29:13

these big spiking systems um

play29:17

once you're doing them for inference

play29:18

anyway you'll get more and more

play29:20

interested in

play29:21

how you could learn in a way that makes

play29:24

more use of the available power in these

play29:28

spike times

play29:30

so

play29:30

you can imagine a system where you learn

play29:32

using backprop

play29:34

um

play29:35

but not on

play29:36

not on analog hardware for example not

play29:38

on the this low energy hardware

play29:41

um

play29:42

and then you transfer it to the lower

play29:44

energy hardware and that's fine

play29:46

um

play29:48

but we'd really like to learn directly

play29:50

in the hardware

play29:51

now one thing that really strikes me

play29:53

jeff is when i think about

play29:55

your talks back around

play29:58

2005 6 7 8 when i was a phd student

play30:03

essentially pre-alex ned talks

play30:07

those talks i think topically have a lot

play30:09

of resemblance to what you're excited

play30:11

about now

play30:12

and it almost feels like alexnet is an

play30:15

outlier in in your path

play30:17

um how did you go from thinking so

play30:20

closely how the brain might work

play30:22

to deciding that you know maybe you can

play30:24

first explain what was alex net and but

play30:26

also how did it come about and what was

play30:28

that path to go from working on

play30:30

restricted boltzmann machines trying to

play30:32

see how the brain works too

play30:34

i would say that

play30:35

the more traditional approach to neural

play30:37

nets that you all of a sudden show it

play30:38

can actually work

play30:40

well um if you're an academic you have

play30:42

to raise grant money

play30:44

and

play30:45

it's convenient to have things that

play30:47

actually work

play30:48

even if they don't work the way you're

play30:49

interested in

play30:51

um so part of it's that just go with the

play30:53

flow and

play30:54

um

play30:55

if you can make back prop work well

play30:57

and back then in about 2006 2005 i got

play31:02

fascinated by the idea you could use

play31:05

stacks of restricted voltage machines

play31:07

to pre-train feature detectors and then

play31:10

it would be much easier to get backdrop

play31:11

to work

play31:13

it turned out with enough data

play31:15

which is what you had in speech

play31:17

recognition

play31:18

um and later on because of faith ali and

play31:21

her team

play31:22

in image recognition with enough data

play31:24

you don't need the pre-training although

play31:26

pre-training is coming back i mean gpt-3

play31:28

has pre-training

play31:31

and pre-training is a thoroughly good

play31:32

idea

play31:33

um

play31:35

but

play31:37

once we

play31:39

discovered that you can pre-train and

play31:40

that will make backdrop work better and

play31:42

that did great things for speech

play31:44

which george john and abdul rahman

play31:46

muhammad did um

play31:49

in 2009

play31:51

then alex

play31:54

who was a graduate student in my group

play31:55

thing

play31:56

um

play31:57

started uh

play32:00

applying the same ideas to

play32:02

vision

play32:04

um

play32:06

and pretty soon we discovered that you

play32:08

didn't actually need this pre-training

play32:11

especially if you had the imagenet

play32:12

data

play32:14

and in fact that project

play32:17

um

play32:18

was partly due to india's persistence so

play32:20

i remember ilya coming into the lab one

play32:22

day and saying look we now that we've

play32:24

got speech recognition working this

play32:25

stuff really works we've got to do

play32:28

imagenet before anybody else does

play32:31

and retrospectively learned that janella

play32:33

come was going into the lab and saying

play32:35

look we've got to do imagenet with

play32:37

compliments before anybody else does

play32:39

and

play32:41

jan's students also and postdoc said oh

play32:43

but i'm busy doing something else so

play32:45

well

play32:46

he he couldn't actually get someone to

play32:47

commit to it

play32:49

and

play32:50

yeah initially couldn't get people to

play32:52

commit to it

play32:53

and so he persuaded alex to commit to it

play32:56

by pre-processing the data for him so he

play32:58

didn't have to pre-process the data the

play32:59

data was all pre-processed to be just

play33:01

what he needed

play33:02

and then alex really went to china and

play33:04

alex is just a superb programmer

play33:06

and it was

play33:07

alex was able to make a couple of gpus

play33:10

really sing he made them work together

play33:12

in his bedroom at home

play33:14

um i don't think his parents realized

play33:16

that they were paying most of the cost

play33:18

because that was the electricity um

play33:21

but

play33:22

he did a superb job of programming

play33:24

convolutional nets on them

play33:26

um

play33:27

so

play33:28

he said we've got to do this

play33:30

and helped alex with the design and so

play33:32

on alex did the really intricate

play33:34

programming

play33:35

and i provided support um and a few

play33:38

ideas like using dropout

play33:41

i also did some good management i'm not

play33:43

often very good at management but i'm

play33:45

very proud of the management idea which

play33:47

is alex kruszewski had to write a depth

play33:50

or to show that he was sort of capable

play33:53

of understanding research literature

play33:56

which is what you have to do after a

play33:57

couple of years to stay in the phd

play33:58

program

play33:59

and he doesn't really like writing um

play34:02

and he didn't really want to do the

play34:03

depth of it but it was way past the

play34:04

deadline and the development was

play34:05

hassling us

play34:07

so i said to him

play34:09

um each

play34:10

each time you can improve the

play34:13

performance by one percent on imagenet

play34:15

um

play34:16

you can delay your depth order by

play34:18

another week

play34:20

and alex delayed his death roll by a

play34:22

whole lot of weeks

play34:26

yeah and just for context for i mean a

play34:28

lot of researchers know this of course

play34:30

but maybe not everybody alex's result

play34:33

with you and ilya

play34:34

cut the error rate in half compared to

play34:37

prior work on the imagenet image

play34:39

recognition competition which was just

play34:41

more or less i

play34:42

i used to be a professor so it wasn't

play34:44

quite in half close it cut it from about

play35:01

a whole available well that's why

play35:02

everybody switched from what they were

play35:04

doing which was hand engineered

play35:06

approaches to computer vision try to

play35:08

program directly

play35:09

how can a computer understand what's an

play35:11

image to to deep learning i should say

play35:13

one thing that's important to say here

play35:16

um

play35:16

jalakhar spent many years um developing

play35:20

convolutional neural nets

play35:22

um

play35:23

and it really should have been him

play35:25

his lab that developed that system we

play35:27

had a few little extra tricks but they

play35:29

weren't the important thing the

play35:30

important thing was to apply

play35:31

convolutional nets

play35:32

using gpus to a big data set

play35:35

um

play35:36

so yam was kind of unlucky in that um he

play35:39

didn't get the win on that

play35:42

but it was using many of the techniques

play35:43

that he developed it didn't have the the

play35:45

russian immigrants that uh toronto and

play35:48

eu had been able to attract to make it

play35:50

happen

play35:51

well once russian one's ukrainian and

play35:53

it's important to confuse those even

play35:54

though the ukraine is a russian-speaking

play35:56

ukrainian don't confuse russian

play35:58

absolutely

play36:00

it's a it's a different country

play36:03

so

play36:03

now jeff that moment actually also

play36:06

marked a big change in your career

play36:10

because as far as i understand you've

play36:12

never been involved in

play36:15

corporate

play36:16

work

play36:17

but

play36:18

it marked a transition for you soon

play36:20

thereafter from being a pure academic to

play36:24

being ending up at google actually

play36:26

uh can you see a bit about that how was

play36:28

that for you like

play36:29

did you have any internal resistance i

play36:31

can say why that transition happened

play36:34

what triggered i'm curious

play36:36

so

play36:37

um i have a lonely disabled son who

play36:39

needs

play36:40

um

play36:41

future provisions so i needed to get a

play36:43

lump of money and i thought one way i

play36:45

might get a lump of money was by

play36:46

teaching a coursera course

play36:49

and so i did a coursera course on neural

play36:51

networks in

play36:52

2012. and it was one of the early

play36:55

coursera courses so their software

play36:56

wasn't very good so it's extremely

play36:58

irritating to do

play36:59

um

play37:01

it really was very irritating then

play37:03

i'm not very good on software so i

play37:05

didn't like that

play37:07

and

play37:09

from my point of view it amounted to

play37:12

you agree to supply a chapter of a

play37:14

textbook one chapter every week

play37:17

um

play37:18

so

play37:19

you had to give them these videos and

play37:20

then a whole bunch of people are going

play37:22

to watch the videos

play37:23

like sometimes the next day yoshi or

play37:24

banjo would say why did you say that

play37:27

um

play37:28

so you know that it's going to be people

play37:30

who know very little but also people

play37:31

know a whole lot

play37:32

and so it's stressful you know that if

play37:34

you make mistakes they're going to be

play37:35

caught not like a normal lecture where

play37:37

you can just sort of

play37:38

press on the sustaining pedal and sort

play37:40

of

play37:41

blow your way through it if you get some

play37:43

slightly confused about something um

play37:45

here you have to get it straight

play37:48

and

play37:48

the deal with the university of toronto

play37:50

originally was that

play37:52

um

play37:53

if any money was made from these courses

play37:55

which i was hoping there would be

play37:57

um

play37:58

the money that came to the university

play38:00

would be split with the professor they

play38:02

didn't specify exactly what the split

play38:03

would be

play38:04

but one assumed it would be like 50 50

play38:06

or something like that

play38:08

and i was okay with that

play38:10

the university didn't provide any

play38:11

support in preparing the videos

play38:14

and

play38:16

then after i started the course and when

play38:18

i could no longer back out of it

play38:21

the provost made a unilateral decision

play38:24

without consulting

play38:25

me or anybody else um that actually if

play38:28

money came from coursera the university

play38:30

would take all the money and the

play38:31

professor would get zero

play38:33

which is exactly the opposite of what

play38:35

happens with textbooks

play38:37

and the process was very like bringing

play38:39

textbooks

play38:40

i actually asked the university to help

play38:42

me prepare the videos

play38:44

and the av people came back to me and

play38:46

said do you have any idea how expensive

play38:48

it is to make videos

play38:50

and i actually did have an idea because

play38:52

i've been doing this

play38:55

so i got really pissed off with my

play38:56

university because they unilaterally

play39:00

sort of canceled the idea i get any

play39:01

immunization for this they said it was

play39:03

part of my teaching well actually it

play39:05

wasn't part of my teaching it was

play39:07

clearly based on lectures i given as

play39:09

part of my teaching but i was doing my

play39:10

teaching as well as that and that i

play39:12

wasn't using that course for my teaching

play39:15

and that got me pissed off enough

play39:17

that i was willing to consider

play39:19

alternatives to being a professor

play39:22

um and at that time

play39:24

we then

play39:26

suddenly got interest from all sorts of

play39:27

companies

play39:29

in the

play39:30

in recruiting us

play39:31

um either in funding giving big grants

play39:35

or in funding a startup

play39:38

it was clear that a number of big

play39:40

companies were just very interesting

play39:41

getting in on the act

play39:43

and so

play39:46

normally i would have just said no i'm i

play39:48

get paid

play39:49

by the state we're doing research

play39:51

um i don't want to try and make extra

play39:53

money from my research i'd rather get on

play39:55

with the research

play39:56

but because that particular experience

play39:58

with

play39:59

the university

play40:00

um cheating me out of the money no it

play40:03

turned out they didn't shoot me out of

play40:04

anything because

play40:05

uh no money came from of course anyway

play40:07

um

play40:08

but that pushed me over the edge into

play40:10

thinking well

play40:11

okay i'm going to find some other way to

play40:12

make some money that was the end of my

play40:14

princess oh no

play40:18

well

play40:19

but the result is that these companies

play40:22

are and in fact if you read the the

play40:24

genius makers book by kid metz which i

play40:27

reread last week in preparation for this

play40:30

conversation um if you read the book it

play40:33

starts off with actually you running an

play40:36

auction for these companies to try to

play40:39

acquire your company which is quite the

play40:41

start of a book

play40:43

um very intriguing but how is it for you

play40:46

oh when it was happening it was at nips

play40:49

and terry had organized nips in a casino

play40:53

um

play40:56

at lake tahoe

play40:58

um and so in the basement of the hotel

play41:01

there were these smoke-filled rooms full

play41:03

of people pulling one arm bandits and

play41:06

big lights flashing saying you won 25

play41:08

000 and all that stuff and people

play41:10

gambling um in other ways

play41:13

and upstairs we were running this

play41:15

auction

play41:16

and

play41:17

we felt like we were in a movie we felt

play41:20

like this was like being in

play41:22

that movie the social network it sort of

play41:24

felt like that it was great

play41:26

the reason we did it was we had

play41:28

absolutely no idea how much we were

play41:30

worth

play41:32

and i consulted

play41:34

a lawyer who an ip lawyer who said

play41:36

there's two ways to go about this

play41:38

you could hire a professional negotiator

play41:41

um

play41:42

in that case um

play41:44

you'll end up working for a company but

play41:46

they'll be pissed off with you

play41:49

or you could just run an auction

play41:52

um

play41:54

as far as i know this was the first time

play41:56

a small group like that just ran an

play41:58

auction we ran it on gmail

play42:01

i've worked at google over the summer so

play42:02

i knew enough about google to know that

play42:03

they wouldn't read our gmail

play42:05

um

play42:07

and

play42:10

i'm still pretty confident they didn't

play42:11

read our gmail

play42:13

microsoft wasn't so confident

play42:17

and we just ran this auction where

play42:18

people had to gmail me their bids

play42:21

and we then immediately mailed them out

play42:23

to everybody else with the timestamp of

play42:25

the gmail

play42:26

and

play42:28

um

play42:28

he just kept going up by a half million

play42:30

dollars to and he was

play42:32

half a million dollars to begin with and

play42:33

then a million dollars after that

play42:35

um

play42:37

and

play42:38

yeah it was pretty exciting

play42:40

and we discovered we were worth a lot

play42:42

more than we thought

play42:46

retrospectively

play42:47

we could probably have got more but we

play42:49

we got to an amount that we thought was

play42:51

astronomical

play42:53

and then

play42:55

basically

play42:57

we wanted to work for google so we

play42:58

stopped the auction so we could be sure

play43:00

of working for you and as i understand

play43:02

it you're still at google today

play43:04

i'm still at google today

play43:06

i'm nine years later i'm in my tenth

play43:08

year there

play43:09

i think i'll get some kind of award when

play43:11

i beat them for 10 years because it's so

play43:13

rare um

play43:14

although people tend to stay in google

play43:16

longer than other companies

play43:17

yeah i like it there the

play43:20

the main reason i like it is because

play43:22

the brain team's a very nice team

play43:25

and i get along very well with jeff dean

play43:27

um

play43:28

he's kind of

play43:30

very smart but i'm very straightforward

play43:32

to deal with

play43:34

and what he wants me to do is do

play43:36

what i want to do which is basic

play43:38

research um he thinks what i should be

play43:41

doing is trying to come up with

play43:42

radically new algorithms and that's what

play43:44

i want to do anyway

play43:45

so it's just a

play43:47

very nice fit i'm no good at managing a

play43:49

big team to improve speech recognition

play43:50

by one percent i'd be happy

play43:52

well it's better to just revolutionize

play43:54

the field again right yeah

play43:58

i would like to do it one more time but

play44:01

i'm looking forward to it i wouldn't be

play44:02

surprised at all

play44:04

now when when i look at your career

play44:07

and some of this information actually

play44:09

comes from the book as i didn't notice

play44:10

before

play44:12

i had read the book the first time

play44:14

i mean

play44:15

you are

play44:16

you were a computer science professor at

play44:18

the university of toronto emeritus now i

play44:21

believe but computer science but you

play44:23

never got a computer science

play44:25

degree you got a psychology degree

play44:29

and

play44:30

you actually at some point were a

play44:32

carpenter

play44:34

how does it come about how do you go

play44:36

from yes studying psychology to becoming

play44:38

a carpenter to getting into ai what's

play44:42

the path for you there how do you look

play44:44

at that

play44:45

in my last year at cambridge

play44:47

i had a very difficult time and got very

play44:49

unhappy and i dropped out

play44:51

just after the exams i dropped out

play44:53

and

play44:54

became a carpenter

play44:58

and

play44:59

i'd always enjoyed carpentry more than

play45:02

anything else

play45:04

so at high school

play45:06

um to be sort of all the classes and

play45:08

then you could stay in the evenings and

play45:10

do carpentry that's what i really look

play45:11

forward to

play45:13

and

play45:16

so i became a carpenter and then after

play45:17

i've been a carpenter for about six

play45:19

months you couldn't actually make a

play45:21

living as a carpenter um so i was a

play45:23

carpenter and decorator and i made the

play45:24

money during decorating but i had the

play45:25

fun doing carpentry

play45:27

um

play45:29

and

play45:30

the point is carpentry is more work than

play45:33

it looks and decorating is less work

play45:34

than it looks

play45:35

um so you can you can charge more per

play45:38

hour for decorating

play45:39

um

play45:40

unless you're a very good card

play45:42

and then i met a real carpenter

play45:45

and i realized i was completely hopeless

play45:48

at carpentry

play45:49

and so he

play45:51

he's making a door

play45:53

for a basement for a

play45:55

coal seller under the sidewalk that was

play45:57

very damp

play45:58

and he was taking pieces of wood and

play46:01

arranging them

play46:03

so that they would walk in opposite

play46:04

directions so that it would cancel out

play46:07

and that was kind of a level of kind of

play46:09

understanding and thought about the

play46:11

process that never occurred to me he

play46:13

could also take a piece of wood and just

play46:15

cut it exactly square with a hand saw

play46:19

and he explained something useful to me

play46:20

he said if you want to cut a piece of

play46:22

wood square

play46:23

you have to

play46:25

line the saw bench up with the room and

play46:27

you have to line the piece of wood up

play46:29

with the room

play46:30

you can't cut it square if it's not

play46:32

aligned with the room

play46:33

which is very interesting in terms of

play46:35

coordinate frames

play46:39

so

play46:40

anyway because i was so hopeless

play46:42

compared with him i decided i might as

play46:43

well go back into am

play46:46

now when you say get back into ai

play46:48

as i understand this was at the

play46:50

university of edinburgh where you went

play46:52

for your phd

play46:53

yeah

play46:54

i went to do a phd there and i went to a

play46:56

phd on neural networks

play46:59

with

play47:00

an eminent professor called christopher

play47:01

longa higgins

play47:03

who was really very brilliant um

play47:06

he almost got a nobel prize when he was

play47:08

in his 30s to figuring out something

play47:11

about the structure of boron hydride

play47:14

um

play47:15

and i i still don't understand what it

play47:17

is because it'll do with quantum

play47:18

mechanics but it hinged on the fact that

play47:21

360 degree rotation is not the identity

play47:24

operator it's 720 degrees

play47:26

um

play47:27

there's a thing you want to find one's

play47:28

books about it

play47:30

um anyway

play47:33

he was interested in neural nets and the

play47:35

relation to holograms and about the day

play47:37

i arrived in edinburgh he lost interest

play47:39

in neural nets

play47:41

because he read winograd's thesis and he

play47:43

became completely converted

play47:46

um

play47:46

he thought neural nets was the wrong way

play47:48

to think about it we should do symbolic

play47:50

ai he was very impressed by one great

play47:52

thesis

play47:54

um

play47:55

and so we had

play47:57

he had a lot of integrity

play47:59

so even though he completely disagreed

play48:01

with what he was i was doing he didn't

play48:03

stop me doing it

play48:04

he kept trying to get me to do

play48:07

stuff more like winner thesis but he let

play48:09

me carry on doing what i was doing

play48:11

um

play48:13

and

play48:14

yeah i was a bit of a loner everybody

play48:16

else back then in the early 70s was

play48:19

saying minsky and papua shown that

play48:20

neural nets are nonsense why are you

play48:22

doing this stuff it's crazy

play48:25

and in fact

play48:26

the first talk i ever gave to that group

play48:30

was about how to do true recursion with

play48:32

neural networks um

play48:35

so this was talking 1973 so 49 years ago

play48:39

and so my one of my first projects i

play48:41

discovered a write-up of it recently

play48:44

was

play48:45

um

play48:46

you want a neural network

play48:48

that will

play48:50

be able to draw a shape

play48:52

and

play48:53

you want it to pass the shape into

play48:56

parts

play48:58

and you want it to be possible for a

play49:00

part of the shape

play49:01

to be done drawn by the same neural

play49:03

hardware as the whole shape's being

play49:05

drawn by

play49:06

so

play49:07

the neural hub where they're storing the

play49:09

whole shape

play49:10

has to remember where it's got to in the

play49:12

whole shape and what the

play49:15

orientation and position size is for the

play49:17

whole shape

play49:19

but now it has to go off and you want to

play49:20

use the very same neurons

play49:23

for drawing a part of the shape

play49:26

so you need somewhere to remember what

play49:28

the whole shape was and how far you got

play49:30

in it

play49:31

so you can pop back to that once you've

play49:33

finished doing this subroutine this part

play49:35

of the shape

play49:36

and the question is how is the neural

play49:38

network going to remember that because

play49:39

obviously you can't just copy the

play49:40

neurons

play49:41

and so i managed to get a system working

play49:43

where the neural network remembered it

play49:45

by having fast heavy and weights

play49:48

that were just adapting all the time

play49:50

and were adapting so that

play49:52

any state that it had been in recently

play49:54

could be retrieved by giving it part of

play49:56

that state and then say fill in the rest

play50:00

and so i had a neural net that was doing

play50:01

true recursion reusing the same neurons

play50:04

and the same weights

play50:05

to do the recursive call as it used for

play50:07

the high level call

play50:09

and that was in 1973 and

play50:13

the

play50:14

i think people didn't understand the

play50:15

talk because i wasn't very good at

play50:16

giving talks but they also said why

play50:18

would you want to do recursion within

play50:19

your match you can do recursion with

play50:21

lisp

play50:22

um

play50:23

they didn't understand the point which

play50:25

is that

play50:26

unless we get neural nets to do

play50:28

something like recursion we're never

play50:30

going to be able to explain a whole

play50:31

bunch of things

play50:32

and

play50:33

now that's become sort of an interesting

play50:35

question again

play50:36

so i'm going to wait one more year until

play50:39

that idea is an antique a genuine

play50:41

antique it'll be 50 years old and then

play50:43

i'm gonna sort of write up the research

play50:45

i did then and it was all about fast

play50:47

weights for as a member so

play50:50

i have many questions here jeff one the

play50:52

first one is

play50:54

you're standing in this room

play50:56

where everybody's you're a phd student

play50:59

or maybe fresh out of phd

play51:01

you're standing in a room with

play51:03

essentially everybody telling you what

play51:05

you're working on

play51:06

is a waste of time

play51:09

and

play51:10

you were convinced somehow was not where

play51:13

do you get that conviction from

play51:15

i think a large part of it was my

play51:17

schooling

play51:19

so

play51:21

my father was a communist

play51:24

but he sent me to an expensive private

play51:26

school because they had good science

play51:29

education

play51:31

and i was there from the age of seven

play51:33

[Music]

play51:34

while they had a preschool

play51:36

and it was a christian school

play51:40

and all the other kids believed in god

play51:44

and it was just

play51:45

at home i was taught that that was

play51:47

nonsense

play51:48

and it did seem to me that

play51:50

it was nonsense

play51:52

um

play51:52

[Music]

play51:53

and so

play51:55

i was used to

play51:56

just having

play51:58

everybody else being wrong and obviously

play52:01

wrong

play52:03

and i think that's important

play52:06

i think you need you need

play52:09

i was back to say you need the faith

play52:12

which is funny in this

play52:13

situation

play52:15

um you need the faith in science

play52:17

to um

play52:19

be willing to work on stuff just because

play52:21

it's obviously right

play52:23

even though everybody else says it's

play52:25

nonsense

play52:26

and in fact it wasn't everybody else it

play52:28

was everybody else in the early 70s

play52:30

doing ai said it was nonsense or nearly

play52:33

everybody else

play52:35

um

play52:36

but if you look a bit early if you look

play52:37

in the 50s

play52:38

both von neumann and turing believed in

play52:40

neural nets

play52:42

turing in particular believed in neural

play52:44

nets training with reinforcement um

play52:47

so

play52:49

if i i still believe if they hadn't both

play52:51

died early the whole history of ai might

play52:53

have been very different

play52:55

because they were sort of powerful

play52:57

enough intellects to have swayed a field

play53:00

and

play53:03

they were very interested in sort of how

play53:05

does the brain work

play53:07

so i think it was just bad luck we both

play53:09

died early

play53:10

well british intelligence might have

play53:11

come into it but

play53:13

now

play53:14

you go from believing in this

play53:16

well

play53:17

at the time many people didn't

play53:20

getting the big breaks resulted in that

play53:22

power almost everything

play53:24

that's being done today and now there is

play53:26

this in some sense the next question

play53:29

right

play53:30

is

play53:31

it's not just that deep learning works

play53:33

and works great the question becomes

play53:36

is it all we need or

play53:38

will we need other things

play53:40

and you've said things

play53:43

maybe i'm not literally quoting you but

play53:44

to the extent of deep learning will do

play53:46

everything what i really meant by that i

play53:49

i i sometimes say things without

play53:51

thinking without being accurate enough

play53:53

and then people call me

play53:54

like saying we won't need radiologists

play53:57

um

play53:58

so

play53:59

what i really meant was um

play54:03

using stochastic gradient to send to it

play54:05

just a whole bunch of parameters that's

play54:06

what i sort of had in mind when i said

play54:08

deep learning

play54:11

the way you get the gradient might not

play54:13

be back propagation

play54:14

and the thing you get the gradient of

play54:16

might not be some final performance

play54:19

measure but rather these lots of local

play54:21

objective functions

play54:22

but i think that's how the brain works

play54:24

and i think that's gonna explain

play54:25

everything yes well nice nice to see it

play54:28

confirmed um

play54:30

so one other thing i want to say is the

play54:32

kind of computers we have now

play54:34

um

play54:35

are very good for

play54:37

um doing banking

play54:39

because they can remember how much you

play54:41

have in your account it wouldn't be so

play54:43

good if you went in and they said well

play54:45

you got roughly this much we're not

play54:46

really sure because we don't do it to

play54:48

that precision but roughly this much

play54:50

um we don't want that in a computer

play54:53

doing banking

play54:55

um

play54:56

or in a computer guiding the space

play54:58

shuttle or something we would really

play54:59

rather it got the answer exactly right

play55:02

um

play55:03

and they're very different from us

play55:07

and i think

play55:09

people aren't sufficiently aware

play55:12

that we made a decision

play55:14

about how computing would be

play55:17

um which is that

play55:19

um

play55:20

our computer our

play55:23

knowledge will be immortal

play55:25

so if you look at

play55:27

existing computers you have a computer

play55:29

program

play55:30

or maybe you just have a lot of weights

play55:32

for a neural net that's a different kind

play55:34

of program

play55:36

um but if your hardware dies you can run

play55:38

the same program on another piece of

play55:41

hardware

play55:43

and so that makes the knowledge immortal

play55:45

it doesn't hinge on that particular

play55:47

piece of hardware surviving now the cost

play55:49

of the immortality is huge

play55:52

because it means the two-bit different

play55:54

bits of hardware have to do exactly the

play55:56

same thing obviously zero correction all

play55:58

that but after you've done all the error

play55:59

correction they have to do exactly the

play56:00

same thing

play56:01

which means there better be digital or

play56:04

mostly digital

play56:05

um

play56:06

[Music]

play56:07

and they're probably gonna do things

play56:09

like multiplying numbers together which

play56:11

involves

play56:12

using lots and lots of energy to make

play56:15

things very discreet

play56:17

which is not what hardware really wants

play56:19

to be

play56:20

and so as soon as you

play56:23

commit yourself to the immortality of

play56:25

your program or your neural net

play56:28

you're committed to

play56:30

um very

play56:31

expensive computations and also to very

play56:34

expensive manufacturing processes you

play56:36

need to manufacture these things

play56:38

accurately and probably in 2d and then

play56:40

put lots of 2d things together

play56:43

um if you're just willing to give up on

play56:45

immortality

play56:48

sort of in fiction normally what you get

play56:50

in return is love

play56:51

um

play56:52

but if if we're willing to give up

play56:54

immortality what we'll get in return is

play56:57

very low energy computation and very

play56:59

cheap manufacturing

play57:01

so

play57:02

instead of manufacturing computers

play57:06

what we should do is grow them

play57:08

um we should use nanotechnology to just

play57:10

grow the things in 3d

play57:13

and

play57:14

each one will be slightly different

play57:16

so the image i have is if you take a pot

play57:18

plant

play57:20

and you sort of pull it out of its pot

play57:22

there's a root ball and it's the shape

play57:23

of the pot right

play57:25

and so all the different pot plants have

play57:27

the same shaped root ball but the

play57:28

details of the roots are all different

play57:30

but they're all doing the same thing

play57:31

they're extracting nutrients from the

play57:33

soil and they got the same function and

play57:35

they're pretty much the same

play57:37

but the details are all very different

play57:39

um

play57:41

so that's what real brains are like

play57:43

and i think that's what

play57:45

what i call mortal computers will be

play57:47

like

play57:48

so these are computers that are grown

play57:50

rather than manufactured

play57:53

you can't program them they just learn

play57:56

they obviously have to have a learning

play57:57

algorithm sort of built into them

play57:59

they learn they can do most of their

play58:01

computation in analog

play58:04

because analog is very good for doing

play58:06

things like taking a voltage times a

play58:09

resistance and turning it into a charge

play58:12

and then adding up the charge and

play58:13

already chips that do things like that

play58:16

the problem is what do you do next

play58:18

um and how do you learn in those chips

play58:21

and at present people have suggested

play58:23

back propagation or various versions of

play58:25

boxing machines

play58:26

um

play58:28

i think we're going to need something

play58:29

else

play58:30

but i think

play58:32

sometime in the not too distant future

play58:34

we're going to see mortal computers

play58:37

which are very cheap to create

play58:40

have to get all their knowledge in there

play58:42

by learning

play58:43

and are very low energy

play58:46

and

play58:48

these mortal computers when they die

play58:49

they die and their knowledge dies with

play58:51

them

play58:52

and so and it's no use looking at the

play58:54

weights because those weights only work

play58:56

for that hardware

play58:57

um

play58:58

so what you have to do is distill the

play59:00

knowledge into other computers so when

play59:01

these multiple computers get old

play59:03

they're gonna have to do lots of

play59:04

podcasts to try and get the knowledge

play59:05

into younger modern english first one

play59:07

you build i'll happily have that one on

play59:10

let me know

play59:12

so jeff this reminds me of another uh

play59:15

question that's been on my mind

play59:17

for you which is

play59:19

when you think about today's neural nets

play59:22

the ones that

play59:23

grab the headlines

play59:25

are very very large i mean not as large

play59:28

as as the brain maybe but

play59:30

in some sense starting to get in that

play59:31

way

play59:33

size right the large language models

play59:36

um

play59:37

but and and the results look very

play59:40

very impressive

play59:42

um

play59:43

so one i'm i'm curious

play59:46

about your take on those kinds of models

play59:48

and what you see in them and we see as

play59:50

limitations but two i'm also curious

play59:52

about

play59:53

what do you think about working on the

play59:56

other end of the spectrum for example

play59:58

ants have much smaller brains

play60:01

obviously than humans

play60:03

yet

play60:04

it's fair to say that our

play60:06

visual motor systems that we have

play60:08

developed artificially are not yet at

play60:10

the level of what ants can can pull off

play60:13

or b's and so forth and so i'm curious

play60:16

about that spectrum as well as the the

play60:18

recent big advances in language models

play60:20

where you think about those

play60:22

so b's they may look small to you but i

play60:26

think a b has about a million neurons

play60:28

so

play60:30

i think a b is like closer to gpg3 um

play60:34

certainly closer than romantics um but a

play60:37

b is actually quite a big neural net

play60:39

um

play60:42

my belief is that

play60:44

um

play60:45

if you take a system with lots of

play60:47

parameters and they're tuned sensibly

play60:48

using some kind of gradient descent in

play60:51

some kind of sensible objective function

play60:53

then you'll get wonderful properties out

play60:55

of it and you'll get all these emerging

play60:57

properties

play60:59

um

play61:00

like you do with gpg3 and

play61:02

also the

play61:03

the the google equivalents that i've

play61:06

talked about so much

play61:07

that doesn't sort of

play61:09

settle the issue of whether they're

play61:11

doing the same way as us

play61:13

and i think

play61:14

um

play61:17

we're doing

play61:19

a lot more things like recursion which i

play61:21

think we do in neural nets

play61:23

and i tried to address some of these

play61:25

issues in a paper i put on the web last

play61:27

year called glom

play61:28

um well i call it glom it's how you do

play61:31

part hole hierarchies in neural nets so

play61:33

you definitely have to have structure

play61:35

and if what you mean by symbolic

play61:36

computation is just that you have part

play61:38

whole structure

play61:40

then we do symbolic computation that's

play61:42

not normally what people meant by

play61:43

symbolic computation

play61:45

the sort of hardline symbolic

play61:47

computation means

play61:48

you're using

play61:50

symbols and you're operating on symbols

play61:53

using rules that just depend on the form

play61:55

of the symbol string you're processing

play61:57

and that a symbol

play61:59

the only property a symbol has is that

play62:02

it's either identical or not identical

play62:03

to some other symbol

play62:05

and perhaps that it points to something

play62:07

it can be used as a pointer to get

play62:08

something

play62:09

um the neural nets are very different

play62:11

from that

play62:13

so

play62:14

the sort of hard-line symbol processing

play62:16

i don't think we do that but we

play62:17

certainly

play62:18

deal with

play62:20

pothole hierarchies

play62:22

but i think we do it in great big neural

play62:24

nets

play62:26

and i'm sort of up in the air at present

play62:28

as to

play62:29

to what extent does gpt3 really

play62:31

understand what it's saying

play62:33

i think it's fairly clear

play62:36

it's not just like the old eliza program

play62:39

which just rearranges strings of symbols

play62:41

and had no clue what it was talking

play62:42

about

play62:43

um and the reason for believing that is

play62:45

you say you say in english show me a

play62:47

picture of a hamster wearing a red hat

play62:49

and it draws a picture of a hamster

play62:51

wearing a red hat

play62:53

um

play62:54

and you're fairly sure it never got that

play62:57

pair before

play62:59

so it has to understand the relationship

play63:00

between the english string and the

play63:03

picture

play63:05

and

play63:07

before it had done that if you'd asked

play63:09

any of these um doubters

play63:12

these neural net skeptics

play63:15

um neural net deniers let's call them

play63:17

neural net deniers um

play63:20

if you'd ask them well how would you

play63:22

show that it understands

play63:24

i think they'd have accepted that well

play63:25

if you asked to draw a picture something

play63:27

draws a picture of that thing then it

play63:28

understood

play63:29

just as with winograd's thesis you ask

play63:32

it to put the blue

play63:33

the blue block in the green box and it

play63:35

puts the blue block in the green box and

play63:37

so that's pretty good evidence it

play63:38

understood what you said

play63:40

um

play63:41

but now that it does it of course the

play63:43

skeptics then say well you know that

play63:45

doesn't really count

play63:47

there's nothing that was satisfied

play63:48

basically yeah

play63:50

the goal line's always moving uh

play63:53

for true skeptics

play63:54

yeah

play63:56

now there's the recent one um

play63:59

the google won the paul model that uh in

play64:01

in the paper showed how it was

play64:03

explaining

play64:05

effectively how jokes work that was

play64:07

extraordinary that just seemed a very

play64:09

deep understanding of language no it was

play64:12

just rearranging the words it had in its

play64:13

training you think so no

play64:17

no it had i i didn't see how it could

play64:19

generate those explanations without sort

play64:22

of understanding what's going on now i'm

play64:24

still open to the idea that

play64:26

because it was framed with back

play64:27

propagation it's going to end up with a

play64:29

very different sort of understanding

play64:31

from us

play64:33

and obviously

play64:34

adversarial images um

play64:37

tell you a lot that you can recognize

play64:39

objects by using their textures

play64:42

and you can be correct about it in the

play64:44

sense that it'll generalize to other

play64:45

instances of those objects

play64:47

but it's a completely different way of

play64:48

doing it from what we do

play64:51

and i like to think of the example of

play64:52

insects and flowers so insects seeing

play64:54

the ultraviolet

play64:56

so two flowers that look the same to us

play64:59

can look completely different to insects

play65:01

and now

play65:02

because the flowers look the same to us

play65:04

do we say the insects are getting it

play65:06

wrong

play65:07

um because these flowers evolved with

play65:09

the insects to give signals to the

play65:11

insects in the ultraviolet to tell them

play65:12

which flower it is so it's clear the

play65:14

insects are getting it right and we just

play65:16

can't see the difference

play65:17

and that's another way of thinking about

play65:19

adversarial examples

play65:21

um

play65:22

it looks you know this this thing that

play65:24

it says is an ostrich looks like a looks

play65:26

like a school bus to us

play65:28

but actually if you look in the texture

play65:29

domain then it's actually an ostrich so

play65:32

um

play65:33

the question is who's right in the case

play65:35

of the insects

play65:37

um just because two flowers look

play65:39

identical to us it doesn't mean they're

play65:40

really the same the insects are right

play65:42

about them being very different in that

play65:44

case it's different parts of the

play65:45

electromagnetic spectrum that are

play65:46

indicating the difference that we can

play65:48

pick up on but it could be in the case

play65:50

of image recognition for our current

play65:51

neural nets so you could argue maybe

play65:53

that

play65:54

um

play65:55

since we build them and we want them to

play65:57

do things for us in our world then

play66:00

we really

play66:01

don't want to

play66:03

just say okay they got it right and we

play66:05

got it wrong i mean they need to

play66:06

recognize the car and the pedestrian

play66:08

yeah i agree i just want to show it's

play66:10

not as simple as you might think about

play66:11

who's right and who's wrong

play66:14

and

play66:15

part of the point of my glom paper was

play66:19

to try and build

play66:21

perceptual systems that work more like

play66:23

us

play66:24

so they're much more likely to make the

play66:25

same kinds of mistakes as us

play66:27

and not make

play66:29

very different kinds of mistakes and

play66:31

obviously

play66:32

um

play66:34

if you've got a self-driving car for

play66:36

example if it makes a mistake that any

play66:38

normal human driver would have made

play66:40

that seems much more acceptable than

play66:43

making a really dumb mistake

play66:47

so jeff as i understand it sleep is

play66:50

something you also think about can you

play66:53

say a bit more

play66:55

yes i often think about it when i'm not

play66:56

sleeping at night

play66:58

um so

play67:02

there's something funny about sleep

play67:03

which is um

play67:05

animals do it fruit flies sleep

play67:08

and it may just be to stop them flying

play67:09

around in the dark but um

play67:12

if you deprive people of sleep

play67:15

then they go really weird

play67:17

like if you surprise someone for three

play67:19

days they'll start hallucinating if you

play67:21

surprise someone for a week they'll go

play67:23

psychotic and they never recover

play67:25

um

play67:26

these are nice experiments done by the

play67:28

cia i think um

play67:30

and

play67:32

the question is why

play67:34

why do we what what is the computational

play67:36

function of sleep there's presumably

play67:37

some pretty important question for it if

play67:40

depriving you of it makes you just

play67:42

completely fall apart

play67:44

and so current theories are things like

play67:45

it's for consolidating memories or maybe

play67:48

for downloading things from hippocampus

play67:50

into cortex which is a bit odd since i

play67:52

had to come through court exactly on the

play67:53

campus in the first place

play67:55

um

play67:57

so a long time ago in the early 80s

play68:00

terry sanofsky and i had this theory

play68:01

called baltimore machines

play68:03

and it was partly based on an insight of

play68:05

francis crick

play68:07

um when he was thinking about hopfield

play68:08

nets princess quickly rare mitchelson

play68:10

had a paper about sleep and the idea

play68:13

that

play68:14

um

play68:15

you would hit the net with random things

play68:18

and tell it not to be happy with random

play68:20

things so in a hot field next you give

play68:22

it something you wanted to memorize and

play68:24

it changes the weights so the energy of

play68:26

that vector is lower

play68:28

and the idea is if you also give it

play68:30

random vectors and say make the energy

play68:32

higher the whole thing works better

play68:34

and that led to boltzmann machines where

play68:36

we figured out that um

play68:38

if you instead of giving it random

play68:40

things you get things generated from a

play68:42

markov chain the model's own markov

play68:44

chain and you say make those less likely

play68:47

and make the data more likely that is

play68:49

actually a maximum likelihood learning

play68:51

and so we got very excited about that

play68:52

because we thought okay that's what

play68:54

sleep is for sleep is this negative

play68:56

phase of learning

play68:58

it comes up again now in contrasting

play69:00

learning

play69:01

where you have

play69:03

two patches from the same image you try

play69:05

and get get them to have similar

play69:07

representations

play69:08

and two patches from different images

play69:10

you try and get them to have

play69:12

representations that are sufficiently

play69:14

different once they're different you

play69:15

don't make them any more different

play69:17

but you stop them being too similar and

play69:20

that's how contrastive learning works

play69:22

now with boltzmann machines

play69:24

you couldn't actually separate the

play69:26

positive face from the negative face you

play69:28

had to interleave positive examples and

play69:30

negative examples

play69:32

otherwise the whole thing would go wrong

play69:34

and i went to i tried a lot not

play69:37

interleaving them it's quite hard to do

play69:39

a lot of positive examples followed by a

play69:41

lot of negative examples

play69:43

what i discovered a couple of years ago

play69:45

they got me very excited and caused me

play69:47

to agree to give lots of talks that i

play69:49

then cancelled when i couldn't make it

play69:50

work better um

play69:53

was that

play69:55

with contrastive learning

play69:58

you can actually separate the positive

play70:00

and negative phases

play70:02

so you can do

play70:03

lots of examples of positive pairs

play70:06

followed by lots of examples in negative

play70:08

pairs

play70:09

and that's great because what that means

play70:11

is

play70:12

you can have something like a video

play70:14

pipeline

play70:15

where you're just trying to make things

play70:16

similar while you're awake

play70:19

and trying to make things dissimilar

play70:21

while you're asleep

play70:23

um

play70:24

if you can figure out how sleep can

play70:26

generate video for you um

play70:29

so it makes

play70:31

it makes a contrastive learning area

play70:33

much more plausible if you can separate

play70:36

the positive and negative phases and do

play70:37

them at different times and do a whole

play70:38

bunch of positive updates followed by a

play70:40

whole bunch of negative updates

play70:43

and

play70:44

even for the standard contrastive

play70:46

learning you can do that moderately well

play70:49

you have to use lots of momentum and

play70:50

stuff like that there's all sorts of

play70:51

little tricks to make it work but you

play70:52

can make it work

play70:54

um

play70:55

so i think it's quite likely that the

play70:58

function of sleep

play71:00

is to do unlearning or negative examples

play71:04

and that's why you don't remember your

play71:06

dreams you don't want to remember them

play71:08

you're unlearning them

play71:09

qrik pointed this out you'll remember

play71:11

the ones that are in the fast weights

play71:13

when you wake up

play71:14

um

play71:15

because the fast weights are a temporary

play71:17

store so

play71:18

that's not unrunning that still works

play71:19

the same way

play71:21

but the long-term memory um

play71:23

the whole point is to get rid of those

play71:25

things and that's why you dream for many

play71:28

hours a night but when you wake up you

play71:29

can just remember the last minute of the

play71:31

dream you're having when you woke up

play71:33

um

play71:35

and i think this is a much more

play71:36

plausible theory of sleep than any other

play71:38

i've seen because it explains why if you

play71:40

got rid of it the whole system would

play71:41

just fall apart

play71:43

you'll go disastrously wrong and start

play71:45

hallucinating and doing all sorts of

play71:46

weird things

play71:47

and let me say a little bit more about

play71:49

the need for negative examples that

play71:51

you're having a trust of learning if

play71:54

you've got a neural net

play71:55

and it's trying to optimize some

play71:57

internal objective function something

play71:59

about the kinds of representations it

play72:00

has or something about the agreement

play72:02

between contextual predictions and local

play72:04

predictions

play72:07

it wants this agreement to be

play72:10

a property of the real data

play72:13

and the problem inside a neural net is

play72:15

that

play72:16

you might get all sorts of correlations

play72:18

in your inputs i'm a neuron right so i

play72:19

get all sorts of correlations in my

play72:21

inputs and those correlations have

play72:22

nothing to do with the real data they're

play72:24

caused by the wiring of the network and

play72:26

the way it's in the network

play72:27

if these two neurons are both looking at

play72:29

the same pixel they'll have a

play72:30

correlation but that doesn't tell you

play72:32

anything about the data

play72:34

and so the question is how do you learn

play72:37

to extract structure that's about

play72:40

the real the real data

play72:42

and not about the wiring of your network

play72:45

and the way to do that is to feed it

play72:47

positive examples and say find structure

play72:49

in the positive examples

play72:51

that isn't in the negative examples

play72:54

because the negative examples are going

play72:55

to go through exactly the same wiring

play72:58

and if the structure is not in the

play72:59

negative examples but it is in the

play73:01

positive examples

play73:02

then the structure is about the

play73:04

difference between the positive and

play73:05

negative examples not about your wiring

play73:08

so as people don't think about this much

play73:10

but

play73:10

if you have

play73:12

powerful learning algorithms they you

play73:14

better not make them learn about the

play73:15

neural network's own weights and wine

play73:18

that's not what's interesting

play73:19

now when you think about people who

play73:21

don't get sleep then and start

play73:23

hallucinating is hallucinating just

play73:25

effectively trying to do the same thing

play73:27

you're just doing it while you're awake

play73:29

obviously you can have little naps and

play73:30

that's very helpful and maybe

play73:32

hallucinating when you're awake is

play73:34

serving the same function in sleep

play73:36

and it's i mean all the experiments i've

play73:38

done say it's better to not have

play73:40

16 hours awake and eight hours of sleep

play73:43

it's better to have a few hours awake in

play73:44

a few hours of sleep so

play73:47

and a lot of people have discovered that

play73:48

little naps help einstein used to

play73:51

take little naps all the time and he did

play73:53

okay

play73:54

yeah he did very well no

play73:57

for sure

play73:59

now

play74:00

there's this other thing

play74:02

you uh you've brought up this notion of

play74:04

student beats teacher

play74:06

what does that refer to

play74:08

okay so

play74:09

um a long time ago ago i didn't

play74:11

experiment on mnist

play74:14

which is a standard digit database for

play74:15

every 900 digits

play74:18

where

play74:19

um

play74:20

you take the data the training data

play74:25

and you corrupt it and you corrupt it by

play74:29

substituting

play74:31

the wrong label one of the other nine

play74:33

labels

play74:35

eighty percent of the time

play74:38

so now you've got a data set in which

play74:40

the labels are correct

play74:42

um 20 of the time

play74:44

and wrong 80 of the time

play74:48

and the question is

play74:49

um can you learn from that

play74:52

and how well do you learn from that

play74:54

and the answer is you can learn to get

play74:55

like 95 correct on that

play74:58

so now you've got a teacher who's wrong

play75:00

80 of the time

play75:01

then the student

play75:03

is right 95 of the time

play75:06

so the student is much much better than

play75:08

the teacher

play75:09

and this isn't

play75:11

um each time you get an example you

play75:13

corrupted you take the training examples

play75:14

you can wrap them once and for all so

play75:16

you can't average away the corruption

play75:18

over different

play75:19

you might be able to average it away

play75:21

over different training cases that

play75:22

happen to have similar images but

play75:25

and if you ask well how many training

play75:27

cases do you need if you have corrupted

play75:29

ones

play75:30

and this was a great interest because of

play75:31

the tiny images data set some time ago

play75:33

where they had 80 million tiny images

play75:35

with a lot of wrong labels in

play75:37

and the question is

play75:38

would you rather have a million

play75:40

things that are flakily labeled or would

play75:42

you rather have 10 000 things with

play75:44

accurate labels

play75:46

and i had a hypothesis that what counts

play75:49

is the amount of mutual information

play75:51

between the label and the truth

play75:54

so if the labels are correct corrupted

play75:56

ninety percent of the time there's no

play75:58

mutual information between the labels

play75:59

and the truth

play76:01

if they corrupt eighty percent of the

play76:02

time there's only a small amount of

play76:04

mutual information is that's what you

play76:05

think is i think it's about my memory is

play76:07

it's 0.06 bits per case

play76:10

whereas if it's uncorrected it's about

play76:11

3.3 bits per case

play76:14

um so it's only a tiny amount and then

play76:16

the question is well

play76:18

suppose i balance the size of the

play76:19

training set by putting as much mutual

play76:21

information in there

play76:23

um

play76:25

so if if there's like a 50th of the

play76:27

mutual information i have 50 times as

play76:28

many examples do i now get the same

play76:30

performance

play76:31

and the answer is yes you do to within a

play76:34

factor of two i mean the training set

play76:36

actually needs to be twice that big but

play76:38

roughly speaking

play76:39

you can see how useful a training

play76:41

example is by the amount of mutual

play76:42

information between the label and the

play76:44

truth

play76:45

and i noticed recently you have

play76:46

something for doing sim to real

play76:48

where you're labeling real data using a

play76:51

neural net and those labels aren't

play76:52

perfect

play76:54

and then you take the student that

play76:55

learned from those labels and the

play76:57

student is better than the teacher it

play76:58

learned from

play76:59

and people are always puzzled by how

play77:01

could the student be better than the

play77:02

teacher

play77:04

um

play77:05

but in neural nets it's very easy

play77:08

the student will be better than the

play77:09

teacher

play77:10

if there's enough training data

play77:13

even if the teachers are very flaky and

play77:15

i have a paper

play77:17

a few years ago with melody guan about

play77:19

this for some medical data

play77:21

the first part of paper talks about this

play77:23

but the the rule of thumb is basically

play77:25

what counts is the mutual information

play77:27

between

play77:28

the assigned label and the truth

play77:31

and that tells you how valuable a

play77:32

training example is

play77:34

and so you can make do with lots of

play77:35

flaky ones

play77:37

that's so interesting now in the work we

play77:40

did that you just referenced javen and

play77:41

the work i've seen

play77:43

quite popular recently

play77:45

usually the teacher

play77:47

provides

play77:49

noisy labels

play77:51

but then

play77:53

not all the noise labels are used

play77:55

there's a notion that only look at the

play77:56

ones where the teacher is more

play77:58

confident

play78:00

your description doesn't matter that's

play78:01

obviously a good hatch

play78:03

yeah you don't need to do that you don't

play78:04

need to do that it's a good hack

play78:06

and it probably helps to only look at

play78:08

the ones where you have reason to

play78:09

believe the teacher got it right but

play78:11

it'll work even if you just look at them

play78:12

all and there's a phase transition

play78:15

so with with mnist

play78:18

melody plotted a graph and as soon as

play78:21

you get like 20 of the labels right

play78:23

your student will get like 95 correct

play78:26

wow

play78:26

but as you get down to about 15 right

play78:29

you suddenly get a phase transition

play78:31

where you don't do any better than

play78:32

chance

play78:33

because somehow the student has to get

play78:35

it the teacher is saying these labels

play78:37

and the student has to in some sense

play78:39

understand

play78:41

which cases are right and which case is

play78:42

wrong and sort of see the relationship

play78:44

between the labels and the inputs

play78:48

and then once the student's seen that

play78:49

relationship a wrongly labeled thing is

play78:51

just very obviously wrong

play78:53

um so it's fine if it's randomly wrongly

play78:55

enabled

play78:57

but there is a phase transition where

play78:58

you have to have it good enough so the

play79:00

students sort of get the idea

play79:02

but that explains how our students are

play79:04

all smarter than us

play79:06

we all need to get it right a small

play79:08

fraction of the time

play79:10

right and i'm sure the students do some

play79:12

of this data curation where you say

play79:14

something and the student thinks oh

play79:15

that's rubbish i'm not going to listen

play79:16

to that

play79:18

those are the very best students you

play79:19

know

play79:21

yeah those are the ones that can

play79:23

surprise us um

play79:26

now one of the things that

play79:28

is really important in

play79:30

neural net learning and especially when

play79:32

you're building models is to get an

play79:34

understanding of what is it what is it

play79:36

learning and often people

play79:38

try to somehow visualize what's

play79:39

happening during learning and one of the

play79:42

most prevalent visualization techniques

play79:44

is called disney

play79:48

which is something you invented jeff so

play79:50

i'm curious how do you come up with that

play79:51

what if maybe first describe what it

play79:53

does and then what's the story behind it

play79:56

so if you have some high dimensional

play79:57

data

play79:58

and you

play80:00

try and draw a 2d or a 3d map of it

play80:03

you could take the first two principal

play80:05

components and just plot the first two

play80:07

principal components

play80:09

but what principal components cares

play80:10

about is getting the big distances right

play80:13

so if two things are very different

play80:15

principal components is very concerned

play80:17

to get them very different

play80:18

in the 2d space

play80:20

it doesn't care at all about the small

play80:21

differences because it's it's sort of

play80:23

operating on the squares of the big

play80:25

differences

play80:26

so it won't preserve similarity very

play80:29

well

play80:30

high dimensional similarity

play80:32

and

play80:33

you're often interested in just the

play80:34

opposite you've got some data you're

play80:36

interested in what's very similar to

play80:37

what when you don't care if it gets the

play80:39

big distances a bit wrong as long as it

play80:40

gets the small distances right

play80:43

so i had the idea

play80:45

a long time ago that

play80:47

what if we took the

play80:49

distances and we turned them into

play80:51

probabilities of pairs

play80:53

there's various versions of tc's but

play80:55

suppose we turned them into the

play80:56

probability of a pair

play80:58

such that we say

play81:00

pairs with a small distance are probable

play81:02

and players with a big distance are

play81:03

improbable

play81:06

so we're converting distances into

play81:07

probabilities in such a way that small

play81:09

distances correspond to big

play81:10

probabilities

play81:11

and we do that by putting a gaussian

play81:13

around a point a data point

play81:15

and computing the density of the other

play81:17

data point under this gaussian

play81:19

and that's an unnormalized probability

play81:21

then you normalize these things

play81:24

and then you try and lay the points out

play81:26

in 2d so as to preserve those

play81:29

probabilities

play81:31

and so it won't care much if two points

play81:33

are far apart they'll have a very low

play81:35

pairwise probability and it doesn't care

play81:38

the relative positions of those two

play81:39

points but it cares about the relative

play81:40

positions of the ones with

play81:41

hyperventilators

play81:42

and that produced quite nice maps and

play81:44

that was called stochastic neighbour

play81:46

embedding

play81:47

because we thought of this you put a

play81:48

gaussian and you stochastically pick a

play81:50

neighbor according to the density under

play81:52

the gaussian

play81:53

um and i did that work with samurais and

play81:55

it had very nice simple derivatives

play81:58

which convinced me that we were onto

play81:59

something and we got nice maps but they

play82:02

tended to crowd things together

play82:04

and

play82:05

there's obviously a basic problem

play82:07

in converting high dimensional data into

play82:10

low dimensional data

play82:12

so sneak tends to crowd things together

play82:14

stochastic in everybody and that's

play82:16

because of the nature of high

play82:17

dimensional space and low dimension

play82:19

spaces

play82:20

in a high dimensional space a data point

play82:22

can be close to lots of other points

play82:24

without them all being too close to each

play82:26

other

play82:28

in a low dimensional space they all have

play82:30

to be close to each other if they're all

play82:31

close to this data point

play82:33

so

play82:35

you've got a problem in embedding

play82:36

closenesses from high dimensions to low

play82:38

dimensions

play82:40

and

play82:42

i had the idea when i was doing snee

play82:43

that since i was using probabilities as

play82:46

this kind of intermediate currency

play82:49

there should be a mixture model it

play82:51

should be a mixture version where you

play82:53

say in high dimensions the probability

play82:56

of a pair is proportional to

play82:58

e to the minus s squared distance

play83:00

on my gaussian um

play83:03

and in low dimensions suppose you have

play83:05

two different maps

play83:07

the probability of a pair is the sum of

play83:10

e to the minus the distance in the first

play83:11

2d map

play83:13

and e to the minus the squared distance

play83:15

in the second 2d studio

play83:17

and that way if we have a word like bank

play83:19

and we're trying to put similar words

play83:21

near one another bank can be close to

play83:23

greed in one map

play83:25

and can be close to river in the other

play83:27

map

play83:27

without river ever being close to greed

play83:31

so

play83:31

i really pushed that idea because i

play83:33

thought this was a really neat idea and

play83:34

you could have a mixture of maps

play83:36

and we managed to get to where elio was

play83:38

one of the first people to work on that

play83:39

and james cook worked on it a lot

play83:42

and several other students worked on it

play83:44

and we never really got it to work well

play83:46

um

play83:48

and i was very disappointed that someone

play83:49

hadn't been able to make use of the

play83:51

mixture idea

play83:52

and then i went to a simpler version

play83:55

which i called unicene

play83:57

which was a mixture of a gaussian and a

play84:00

uniform

play84:02

and that worked much better

play84:05

um

play84:06

so the idea is in one map

play84:09

all pairs

play84:11

are equally probable

play84:13

and that gives you a sort of background

play84:14

probability which goes through the big

play84:16

distances

play84:17

a small background probability

play84:19

and then in the other map

play84:21

you contribute

play84:23

um

play84:24

a probability proportional to your

play84:26

squared distance in this other map

play84:28

but it means in this other map

play84:30

things can be very far apart if they

play84:32

want to be

play84:34

because the fact that then

play84:36

they need some probability is taken care

play84:39

of by the uniform

play84:41

and

play84:42

then i got a review paper from a plumber

play84:44

called lawrence van der marton

play84:46

which i thought was actually a published

play84:48

paper because of the form it arrived in

play84:49

but wasn't actually a published paper

play84:51

and he wanted to come do research with

play84:53

me and i thought he had this published

play84:54

paper so i invited him to come do

play84:55

research um it turned out he was

play84:57

extremely good and it's lucky i've been

play85:00

mistaken in thinking it was a published

play85:01

paper um

play85:03

and we started on unisoning

play85:05

and then i realized that actually eunice

play85:08

me

play85:09

is a special case of using a mixture of

play85:11

a gaussian

play85:13

and a very very broad gaussian which is

play85:16

a uniform

play85:18

so what if we used a whole hierarchy of

play85:19

gaussians

play85:21

many many gaussians with different

play85:22

widths

play85:23

and that's called a t distribution um

play85:27

and that led to t-sne and t c works much

play85:29

better

play85:31

and tc has a very nice property

play85:33

that

play85:34

um it can show you things at multiple

play85:37

scales

play85:38

because it's got a kind of one over d

play85:40

squared property

play85:41

that

play85:42

um

play85:43

once distances get big

play85:45

it behaves just like gravity and

play85:47

clusters of galaxies and things you have

play85:49

pluses of galaxies and galaxies and

play85:51

clusters of stars and so on

play85:52

and you get structured many different

play85:55

levels in it you get the course

play85:57

structure and the fine structure all

play85:58

showing up

play86:00

now the

play86:02

objective function used for all this

play86:04

which was the sort of relative densities

play86:06

under a gaussian

play86:08

came from other work i did with alberto

play86:10

pacinero earlier

play86:12

um

play86:13

that we found hard to get published

play86:16

i got a review saying

play86:18

yeah i got a review of that work when it

play86:20

was rejected by some conference

play86:22

saying hinton's been working on this

play86:25

idea for seven years and nobody's

play86:26

interested

play86:27

um

play86:29

i take those reviews as telling me i'm

play86:31

on to something very original

play86:33

um

play86:35

and that actually had the function in it

play86:37

that's now used i think it's called nce

play86:39

it's using these contrastive methods

play86:42

um and t-sne is actually a version of

play86:44

that function

play86:45

um

play86:46

but it's being used for making maps

play86:48

so it's a very long history of tc of

play86:51

getting the original sne and then trying

play86:53

to make a mixture version and it's just

play86:54

not working and not working not working

play86:57

and then eventually getting the

play86:58

coincidence of figuring out it was a

play87:01

t-distribution of what you wanted to use

play87:03

that was the kind of mixture

play87:05

and

play87:06

lauren's arriving and laurence was very

play87:08

smart in a very good program really made

play87:10

it all work beautifully

play87:11

this is really interesting because it

play87:13

seems a lot of the um

play87:16

a lot of the progress these days the the

play87:18

bigger idea place plays a big role but

play87:21

here it seems it was really getting the

play87:23

details right

play87:24

was the only way to get it to fully work

play87:27

you typically need both

play87:29

you have to have a big idea for it to be

play87:31

interesting original stuff but you also

play87:33

have to get the details right

play87:35

and that's what graduate students are

play87:37

for

play87:38

so

play87:39

jeff thank you thank you for such a

play87:42

wonderful uh conversation for our part

play87:46

one of our season finale

play88:20

you

Rate This

5.0 / 5 (0 votes)

Related Tags
深度学习神经网络大脑机制AI发展Jeff Hinton机器学习图像识别自然语言处理人工智能科技前沿学术探讨
Do you need a summary in English?