Season 2 Ep 22 Geoff Hinton on revolutionizing artificial intelligence... again
Summary
TLDR在这段深入的访谈中,人工智能领域的先驱杰夫·辛顿(Geoff Hinton)分享了他对深度学习和神经网络的深刻见解。辛顿教授讨论了深度学习的起源、发展以及它如何成为当今最突出的人工智能方法。他提到了自己在深度学习领域的工作,包括他在图像识别方面的突破,这一成就被称为“ImageNet时刻”,极大地推动了整个AI领域的发展。此外,辛顿还探讨了当前AI技术的局限性,特别是与人类大脑的工作方式相比,他认为现有的深度学习技术如反向传播可能与大脑的处理机制大相径庭。他提出了未来可能的研究方向,包括无监督学习、局部目标函数以及模拟大脑功能的新型学习算法。辛顿的工作不仅对理解大脑的工作原理具有重要意义,也为开发更高效、更强大的AI系统指明了方向。
Takeaways
- 📈 过去十年,人工智能在多个领域取得了突破性进展,包括计算机视觉、语音识别、机器翻译、机器人技术、医学、计算生物学等,这些进展直接推动了数万亿美元公司和许多新创公司的业务发展。
- 🧠 深度学习作为人工智能的一个子领域,是这些突破的基础。杰夫·辛顿(Jeff Hinton)是深度学习领域的先驱,他的工作被引用超过五十万次,对整个领域产生了深远影响。
- 🏆 杰夫·辛顿因其在深度学习领域的贡献,被授予相当于计算机科学领域的诺贝尔奖,并且至今仍在领导该领域的研究。
- 🔬 辛顿在2012年展示了深度学习在图像识别方面比其他方法更优越,这一成果被称为ImageNet时刻,它改变了整个人工智能领域的研究方向。
- 🤖 辛顿认为,尽管现有的人工智能系统在某些方面非常高效,但它们与大脑的工作方式存在根本差异,特别是在反向传播机制方面。
- 🌟 辛顿提出了关于大脑如何使用局部目标函数来学习的想法,这可能与大脑的学习机制更为接近,而与现有的深度学习算法不同。
- 📚 辛顿和他的同事们提出了一种新的学习算法,称为SimCLR,它通过自我监督学习来提高性能,这与大脑可能使用的学习机制相似。
- 🔬 辛顿讨论了大脑如何处理大量参数并从中学习,以及这与现有神经网络的对比,特别是在能量效率和计算方式上的差异。
- 💡 辛顿提出了“死亡计算”(mortal computing)的概念,这是一种新的计算方式,它允许计算机通过学习获得知识,而不是通过编程,并且当计算机停止工作时,其知识也随之消失。
- 🧵 辛顿还探讨了关于睡眠的计算功能,提出了睡眠可能与神经网络中的负相学习(negative phase of learning)有关的理论。
- 👁️🗨️ 最后,辛顿讨论了如何通过可视化技术,如t-SNE,来理解神经网络在高维空间中的数据表示,以及这些技术如何帮助我们更好地理解机器学习模型的内部工作机制。
Q & A
深度学习在哪些领域取得了突破性进展?
-深度学习在计算机视觉、语音识别、机器翻译、机器人技术、医学、计算生物学、蛋白质折叠预测等多个领域取得了突破性进展。
深度学习为何能驱动万亿级公司和众多初创公司的业务发展?
-深度学习通过提供先进的算法和技术,使得公司能够在图像识别、数据处理、自动化决策等方面取得显著进步,从而推动了业务的创新和增长。
杰弗里·辛顿(Geoffrey Hinton)在人工智能历史上的地位如何?
-杰弗里·辛顿被认为是人工智能历史上最重要的人物之一,他不仅是深度学习领域的先驱,而且至今仍在领导该领域的研究。
什么是神经网络,我们为什么需要关注它?
-神经网络是一种模仿人脑神经元工作方式的计算模型,它通过调整神经元之间的权重来学习和处理信息。我们关注神经网络是因为它们在模式识别、数据处理和自动化决策等方面展现出强大的能力。
杰弗里·辛顿对于大脑工作原理的理解有何看法?
-辛顿认为我们对大脑工作原理的理解还很有限,但他相信在未来五年内会取得重大突破。他提出现有的人工智能并非完全模仿大脑的工作方式,尤其是在反向传播算法方面。
为什么说反向传播算法可能与大脑的学习机制不同?
-辛顿认为反向传播算法在参数调整上可能与大脑的机制不同。大脑可能使用了一种不同的方式获取梯度,这种方式可能更适合从较少的数据中抽象出结构。
自监督学习和无监督学习在深度学习中扮演什么角色?
-自监督学习和无监督学习是深度学习中的关键部分,它们允许模型从未标记的数据中学习,这减少了对大量标记数据的依赖,并且能够从数据中提取更多的结构化信息。
为什么说大脑可能使用了多个局部目标函数来学习?
-辛顿提出大脑可能使用了多个局部目标函数来学习,这样可以通过局部的不一致性来学习更多的信息。这种学习方式可能比单一的端到端系统更为高效和灵活。
在深度学习中,如何处理从不同图像块中提取的特征之间的一致性?
-通过设计目标函数,使得不同图像块提取的特征之间存在一致性。如果这些特征在给定的上下文中预测一致,那么它们通常会达成一致,如果不一致,则可以作为学习信号。
为什么说大规模的神经网络模型(如GPT-3)可能并不完全理解它们所生成的内容?
-尽管大规模神经网络模型能够生成连贯且看似有逻辑的内容,但它们可能只是在统计层面上理解了训练数据中的模式,而并没有真正理解内容的深层含义。
在深度学习中,如何处理模型的泛化能力,特别是在面对对抗性样本时?
-对抗性样本揭示了模型可能依赖于纹理等浅层特征进行识别,而非深层的语义理解。提高模型的泛化能力需要更深入地研究其决策过程,并可能需要开发新的算法或架构。
Outlines
😀 深度学习的起源与影响
本段落介绍了深度学习作为人工智能的一个子领域,在过去十年中在多个领域取得的突破性进展。特别提到了杰弗里·辛顿(Geoffrey Hinton)的贡献,他被视为AI历史上最重要的人物之一,其研究工作被引用超过五十万次。辛顿在2012年的ImageNet竞赛中展示了深度学习在图像识别上的显著优势,这一成就促使整个AI领域转向深度学习。
🧠 大脑工作机制的探索
辛顿讨论了大脑的工作方式,特别是神经元如何通过调整权重来响应输入信号。他提出了关于大脑如何调整这些权重的疑问,并认为这是理解大脑工作的关键。辛顿还表达了对未来几年内破解大脑工作机制的乐观态度,尽管他怀疑目前深度学习中使用的反向传播(back propagation)算法与大脑的实际工作机制大相径庭。
🔄 反向传播的效率与局限性
辛顿纠正了关于他发明反向传播的说法,他实际上与同事一起展示了反向传播能够学习有趣的表示,如词嵌入。他认为反向传播比大脑中的学习过程更高效,但可能不如大脑在从少量数据中抽象出大量结构方面做得好。辛顿提出了无监督目标函数的概念,认为这是关键的学习方式,并且大脑可能使用许多局部目标函数。
🤖 自监督学习与大脑的相似性
辛顿讨论了自监督学习的概念,特别是通过比较图像的小区域来提取表示,并将其与基于其他区域表示的上下文预测进行比较。他认为这种方法可能更接近大脑的工作方式,并且可以通过局部不同意来学习。他还提到了SimCLR论文和他早期关于通过图像不同区域之间的表示达成一致来进行自监督学习的论文。
🧬 学习算法与大脑的相似性
辛顿讨论了当前的学习方法,如端到端学习和反向传播,并提出了如何从更少的数据中提取更多信息以改进学习方法的问题。他提出了使用许多小的局部目标函数来提高学习带宽的想法,并讨论了大脑如何使用这些局部目标函数。他还提到了大脑可能不使用与当前神经网络相同的函数来进行局部处理。
🌱 知识的转移与共享
辛顿提出了通过局部区域之间的相互教学来实现知识转移的想法,这与权重共享不同,但提供了一种更灵活且更符合生物神经网络的方式。他认为这种相互教学的机制可能是大脑如何共享知识的方式,并且这种方法比权重共享更有效。
🚀 从学术界到工业界的转变
辛顿讲述了他从学术界转向工业界的经历,包括他在多伦多大学的不愉快经历,以及他如何通过Coursera课程和随后的公司拍卖来寻找赚取资金的其他方式。他详细描述了拍卖的过程,以及他们如何决定在Google工作,因为他喜欢那里的研究环境和团队。
🏆 学术成就与职业转变
辛顿分享了他的学术背景,包括他在剑桥的学习经历,以及后来成为木工的时期。他讨论了他对木工的热爱,以及他如何意识到自己在木工方面的不足,这促使他回到学术界。他还提到了他在爱丁堡大学攻读神经网络博士学位的经历,以及他早期关于神经网络的研究工作。
🤔 深度学习的未来与大脑的比较
辛顿反思了他对深度学习的看法,包括他所说的“深度学习会做到一切”的评论。他解释了他对深度学习的定义,以及他如何看待大脑使用局部目标函数的方式。他还讨论了计算机的当前设计,以及它们与大脑的工作方式的差异。
🧵 制造与生长计算机的未来
辛顿提出了一个关于计算机设计的未来愿景,其中包括放弃程序的不朽性以换取低能耗计算和便宜的制造过程。他讨论了使用纳米技术“生长”计算机的想法,这些计算机通过学习获得所有知识,并在它们死亡时失去所有知识。
🌟 神经网络的规模与理解
辛顿讨论了大型神经网络,如大型语言模型,以及它们在处理语言和图像识别方面的成就。他提出了关于这些模型是否真的理解它们所处理的信息的问题,并将它们的能力与昆虫的视觉系统进行了比较。
💤 睡眠的计算功能
辛顿探讨了睡眠的计算功能,提出了睡眠可能与学习过程的负面阶段有关的理论。他认为睡眠期间大脑可能在进行一种“非学习”或“非优化”的过程,这有助于巩固记忆并避免过度拟合。
📉 学生超越老师的现象
辛顿描述了即使在标签错误的情况下,神经网络(学生)如何能够超越其训练数据(老师)。他讨论了通过使用大量有噪声的标签数据,神经网络如何学习并最终提供比训练数据更准确的输出。
📈 t-SNE的发明与高维数据可视化
辛顿讲述了他如何发明t-SNE(t分布随机邻域嵌入)算法,这是一种用于高维数据可视化的技术。他解释了t-SNE如何通过使用高斯混合模型来改进早期的SNE算法,并且如何通过区分不同的尺度来同时显示数据的粗略结构和细节。
Mindmap
Keywords
💡深度学习
💡神经网络
💡反向传播
💡自监督学习
💡梯度下降
💡ImageNet时刻
💡局部目标函数
💡能量函数
💡dropout
💡稀疏编码
💡玻尔兹曼机
Highlights
过去10年中,人工智能在计算机视觉、语音识别、机器翻译、机器人技术、医学、计算生物学等多个领域取得了突破性进展。
深度学习作为AI的一个子领域,对上述突破起到了基础性作用。
杰弗里·辛顿(Geoffrey Hinton)被认为是AI历史上最重要的人物之一,对深度学习的起源和发展有着深远的影响。
辛顿的工作被引用超过五十万次,意味着有大量研究论文在其研究基础上进行。
2012年,辛顿展示了深度学习在图像识别方面比其他方法更优越,这一时刻被称为ImageNet时刻,引发了AI领域的转变。
辛顿认为,尽管现有的AI主要基于反向传播算法,但人脑可能使用了一种不同的机制来调整神经网络的权重。
辛顿提出了无监督目标函数的概念,强调通过观察世界来学习一个好的世界模型,从而基于模型而非原始数据采取行动。
辛顿认为大脑可能使用许多局部目标函数,而非单一的端到端系统。
辛顿提出了一种基于图像局部区域表示的潜在目标函数,通过比较局部提取的特征与基于上下文的预测来学习。
辛顿讨论了自监督学习的重要性,以及它与大脑学习方式的相似性。
辛顿指出,大脑的神经元通过尖峰信号工作,这与当前人工神经网络的工作原理有显著不同。
辛顿认为,为了更好地模拟大脑,未来的学习算法可能需要放弃程序的“不朽性”,转而设计能够通过学习获得知识的“有死亡率的计算机”。
辛顿提出了“学生超越老师”的概念,说明了即使在标签错误的数据集上,深度学习模型仍然能够学习并超越其训练数据的准确性。
辛顿讨论了睡眠的计算功能,提出了睡眠可能与负面学习或记忆整合有关的理论。
辛顿介绍了t-SNE(t分布随机邻域嵌入)算法,这是一种用于高维数据可视化的技术,能够揭示数据点之间的相似性。
辛顿强调了在构建模型时理解模型学习内容的重要性,并讨论了可视化技术在理解学习过程中的应用。
Transcripts
[Music]
[Music]
over the past 10 years ai has
experienced breakthrough after
breakthrough after breakthrough in
computer vision in speech recognition in
machine translation in robotics in
medicine in computational biology
protein folding prediction and the list
goes on and on and on and the
breakthroughs aren't showing any signs
of stopping not to mention these ai
breakthroughs are directly driving the
business of trillion dollar companies
and many many new startups
underneath all of these breakthroughs is
one single subfield of ai
deep learning
so
when and where did deep learning
originate
and when did it become the most
prominent ai
approach today's guest has everything to
do with this
today's guest is arguably the single
most important person in ai history and
continues to lead the charge today
award the equivalent of the nobel prize
for computer science
today's guest has their work cited over
half a million times
that means there is half a million and
counting other research papers out there
that build on top of his work
today's guest has worked on deep
learning for about half a century
and most of the time
in relative obscurity
but that all changed in 2012
when he showed deep learning is better
at image recognition than any other
approaches to computer vision and by a
very large margin
that result that moment
known as the imagenet moment
changed the whole ai field
pretty much everyone dropped what they
had been doing and switched to deep
learning
former students of today's cast include
vlognee who put deep mind on the map
with their first major result on
learning to play atari games
and includes our season one finale guest
eliza skiver founder and research
director of openai
in fact every single guest in our
podcast has built on top of the work
done by today's guest
i am of course
talking about no one less than jeff
hinton
chef welcome to the show so happy to
have you here
well thank you very much for inviting me
well so glad to get to talk with you on
the show here and i'd say let's dive
right in with maybe the you know the
highest level question
i can ask you um
what are neural nets and why should we
care
okay if you already know a lot about
neural nets please forgive the
simplifications
um
here's how your brain works
it has lots of little processing
elements called neurons
and every so often a neuron goes ping
and what makes it go ping is that it's
hearing pings from other neurons and
each time it hears a ping from another
neuron
it adds a little weight to some store of
um input that it's got and when it gets
it when it's got enough input it goes
ping
and so if you want to know how the brain
works all you need to know
is how the neurons decide to adjust
those weights
that they add when a ping arrives um
that's all you need to know this there's
got to be some procedure used for
adjusting those weights and if we could
figure it out we'd know how the brain
works and that's been your quest for a
long time now figuring out how the brain
might work and
what's the status do you do we as a
field understand how the brain works
okay i always think we're going to crack
it in the next five years since that's
quite a productive thing to think
um but i actually do and i think we're
going to crack it in the next five years
um i think we're getting closer
i'm fairly confident now that it's not
back propagation
so all of existing ai i think
is built on something that's quite
different from what the brain's doing
at a high level it's got to be the same
that is you have a lot of parameters
these weights between neurons
and you adjust those parameters
um on the basis of lots of training
examples
and that causes wonderful things to
happen if you have billions of
parameters
the brain's like that and deep learning
is like that
the question is
how do you
um get the gradient for adjusting those
parameters so what you want is some
some measure of how well you're doing
and then you want to adjust the
parameters so they improve that measure
of how well you're doing
um
but my belief currently is that
back propagation which is the way deep
learning works at present
is quite different from what the brain's
doing the brain's getting gradients in a
different way now that's interesting
you're the one saying that jeff because
you actually
you
wrote a paper on back recreation for
training neural networks and it's
powering everything everybody's doing
today
and now here you are saying actually
it's time probably time for us to figure
out how do you think we should change it
close to what the brain is doing or do
you think maybe
back repetition could be better than
what the brain is doing
let me first correct you
um yes we did write the most cited paper
on back propagation on
ronald hunt and williams and me um
back propagation was already known um
to a number of different authors what we
really did was showed that it could
learn interesting representations so it
wasn't that we invented back propagation
we
ronald hart reinvented back propagation
we showed that it could learn
interesting representations like for
example word embeddings so
i think back propagation is probably
much more efficient
than what we have in the brain that's
squeezing a lot of information into a
few connections
whereby a few connections i mean only a
few billion
so the problem the brain has is that
connections are very cheap
um we've got
hundreds of trillions of them
um
experience is very expensive
and so we are willing to throw lots and
lots of parameters
at a small amount of experience
whereas the neural nets we're using are
basically the other way around
they have lots and lots of experience
and they're trying to get the
information
about what relates the input to the
output into the parameters
and i think back propagation is much
more efficient than what brain's using
at doing that
but maybe not as good
at from not much data
abstracting a lot of structure
and well this begs the question of
course do you have any hypothesis on
approaches
that might get
better performance in that regard i have
a sort of general view which i've had
for a long long time
which is that we need unsupervised
objective functions so i'm talking
mainly about perceptual learning
um
which i think is the sort of key if you
can learn a good model of the world by
looking at it
um
then you can base your actions on that
model rather than on the raw data
and that's going to make
doing the right things much easier
i'm convinced that the brain
is using lots of little local objective
functions
so rather than being a kind of
end-to-end system chain trained to
optimize one objective function
i think it's using lots of little local
ones um so as an example
the kind of thing i think will make a
good objective function though it's hard
to make it work
is
if you look at a small patch of an image
and try and extract some representation
of what you think is there
you can now compare the representation
you got from that small patch of the
image
with a contextual bet that was got by
taking the representations of other
nearby patches and based on those
predicting what that patch of the image
should have in it
and obviously
um
once you're very familiar with the
domain
those predictions from context and
locally extracted features will agree
generally agree and you'll be very
surprised when they don't and you can
learn an awful lot on one trial if they
disagree radically
so that's an example of where i think
the brain could learn a lot from this
local disagreement
um
it's hard to get that to work but i'm
convinced something like that is going
to be the objective function and if you
think of a big image
and lots of little local patches in the
image
that means you get lots and lots of
feedback in terms of the agreement of
what was extracted locally and what was
predicted contextually
um
all over the image
and at many different levels of
representation
and so we can get a much much richer
feedback
from these agreements with contextual
predictions but making all that work is
difficult
but i think it's going to be along those
lines
now
what you're describing strikes me as
part of what people are trying to do in
self-supervised and unsupervised
learning and in fact you wrote one of
the breakthrough papers the sim clear
paper with with a couple of
collaborators of course
in this space what do you think about
the simclear work and contrast of
learning more generally
and what do you think about the recent
mast auto encoders and how does that
relate to what you just described it
relates quite closely to what i've it's
evidence that that kind of objective
function is good
um
i didn't write the sim clear paper um
team chain written simply a paper um
with help from the major co-authors i
was my name was on the paper for general
inspiration but
i did write a paper a long time ago with
sue becker
on the idea of getting agreement
between representations you got from two
different patches of the image
so that was i think of that as the
origin of this idea
of doing self-supervised learning by
having agreement between representations
from two patches of the same image
um
the method that sue and i used
didn't work very well
because of a subtle thing that we didn't
understand at the time but i now do
understand
um
and i could explain that if you like but
i'll lose most of the audience
well i'm curious i think it'd be great
great great to hear it but maybe we can
zoom out for a moment before zooming
back in you talk about
current methods use end-to-end learning
back propagation
to power the end-to-end learning
and you're saying switch to learn from
less data and extract more from less
data is going to be key as as as a way
to make progress to get closer to how
the brain learns
um yes so you get much bigger bandwidth
for learning by having many many little
local objective functions
and when when we look at these local
objective functions like filling in a
blanked out part of an image or maybe
filling back in a word
if we look at today's technologies in
fact this is the current frontier you've
contributed a lot of people are working
exactly on on that problem of
learning from unlabeled data effectively
because it requires a lot less
human labor but they still use back
propagation
the same mechanism so what i do
what i don't like about the mast auto
encoder is
you have
your input patches
and then you go through many layers of
representation
and at the output of the net you try to
reconstruct the missing input patches
i think
the brain you have these levels of
representation but at each
level you're trying to reconstruct
what's at the level below
um so it's not like you go through this
many many layers and then come back out
again
um it's that you have all these levels
each of which is trying to reconstruct
what's at the level below
um
so i think that's much more brainlike
and the question is can you do that
without using back propagation obviously
if you go through many many levels and
then reconstruct the missing patches of
the output you need to get information
back through all those levels
and since we have back propagation it's
built into all the simulators you might
as well do it that way but i don't think
that's how the brain's doing it
and now imagine the brain is doing with
all these local objectives
do you think for for our engineered
systems
will it matter whether
in some sense there are three choices to
make it seems one choice is
what are the objectives what are those
local objectives that we want to
optimize
a second choice is
what's the algorithm
to use
to optimize it and then a third choice
is
what's the architecture of how do we
wire the neurons together that are
doing this this learning
and
among those three it seems like all
three
could be the missing piece that we're
not getting right or what do you think i
if you're interested in perceptual
learning
i think it's fairly clear you want
retinotopic maps a hierarchy of written
topic maps
so the architecture is local
connectivity
um
and the point about that is
you can solve a lot of the credit
assignment problem by just assuming that
something in one locality in a
retrotronpic map is going to be
determined
by the corresponding locality in the
retinotopic map that feeds into it
so
you're not trying to low down in the
system
um figure out how
pixels
determine what's going on a long
distance away in the image
you're going to just use local
interactions and that gives you a lot of
locality
um
and you'd be crazy not to lose not to
use that
one thing neural nets do at present is
they assume you're going to be using the
same functions at every locality so
convolutional let's do that and
transformers do that too
um
i don't think the brain can do that
because that would involve weight
sharing
and it would involve doing exactly the
same computation at each locality so you
can use the same weights
i think it's most unlikely the brain
does that
but actually
there's a way to achieve what weight
sharing does what convolutional
nets do in the brain in a much more
plausible way than i think people have
suggested before
which is
if you do have
contextual predictions trying to agree
with locally extracted things
then imagine a whole bunch of columns
that are making local predictions and
looking at nearby columns to get their
contextual prediction
you can think of the context as a
teacher for the local thing
but also vice versa
but think of the context as a teacher
for what you're attracting locally
so you can think of the information
that's in the context as being distilled
into the local extractor but that's true
for all the local extractors
so what you've got is mutual
distillation
where they're all providing teaching
signals for each other
and what that means is knowledge about
what you should extract in one location
is getting transferred
into other locations
if they're trying to agree if you're
trying to get different locations to
agree on something if for example you
find a nose and you find a mouth and you
want them both to agree that they're
part of the same face
so they should both give rise to the
same representation
then the fact that you're trying to get
the same representation at different
locations allows knowledge to be
distilled from one location to another
and there's a big advantage of that over
actual weight sharing
obviously biologically one advantage is
that the detailed architecture in these
different locations doesn't need to be
identical
but the other advantage is the front-end
processing doesn't need to be the same
so if you take your retina
different parts of the retina have
different size receptive fields
and convolutional nets try to ignore
that they sometimes have multiple
different resolutions and do convolution
at each resolution but they just can't
deal with different front-end processing
whereas if you're distilling knowledge
from one location to another
what you're trying to do is get the same
function
from the optic array
to the representation in these different
locations
and
it's fine if you pre-process the optic
array differently in the two different
locations
you can still distill the knowledge
across the function from the optic array
to the representation even though the
frontend processing is different
and so
although distillation is less efficient
than actually showing the weights it's
much more flexible and it's much more
neurally plausible
so for me that was a kind of big insight
i had about a year ago that
we have to have something like weight
sharing to be efficient
but local distillation will work if
you're trying to get neighboring things
to agree on a representation
but that idea of trying to get them to
agree gives you the signal you need for
knowledge in one location to supervise
knowledge in another location
and jeff do you think so what you're
describing one way to think of it is to
say hey we're cheering
is clever because it's something the
brain kind of does too it just does it
differently so we should continue to do
weight sharing another way to think of
it is that actually we shouldn't
continue to do weight sharing because
the brain does it somewhat differently
and it might be be a reason to do it
differently
what's your thinking i think the brain
doesn't do weight sharing because it's
hard for it to
ship cement strengths about the place
it's very easy if they're all sitting in
ram
so i think we should continue to do
convolutional things in convnets and in
transformers we should share weights
um we should share knowledge by sharing
weights
but just bear in mind that the brain is
going to share knowledge not by sharing
weights but by sharing the function from
input to output
and using distillation to transfer
knowledge
now there's the other topic that is
talked about quite a bit where the brain
is drastically different
from our current neural nets and is the
fact that
neurons are work with spiking signals
and that's very different from our
artificial neurons in our gpus
and so i'm very curious on your thinking
on that is is that just an engineering
difference or do you think there could
be
more to it that we need to understand
better and benefits to spiking
i think it's not just an engineering
difference i think once we understand
why that hardware is so good
why you can do so much in such an energy
efficient way with that kind of hardware
thing
we'll see that um it's sensible for the
brain geospiking units the retina for
example doesn't use spiking neurons the
retina does lots of processing with
non-spiking nerves
so
once we understand why cortex is using
those
um
we'll see that it was the right thing
for biology to do and i think that's
going to hinge on what the learning
algorithm is how you get gradients
for networks of spiking neurons and at
present nobody really knows the present
what people do is say
you see the problem with the spiking
urine is
there's two diff quite different kinds
of decision
one is exactly when does it spike and
the other is does it or doesn't it spike
so there's this discrete decision should
the neuron spike or not and then this
continuous variable of exactly when it
should spike
and people trying to optimize a system
like that have come up with various kind
of surrogate functions which sort of
smooth things a bit so you can get
continuous functions they didn't seem
quite right
um
it'd be really nice to have a learning
algorithm and in fact
in nips in about 2000 andy brown and i
had a paper on trying to learn spiking
boltzmann machines um
but it'd be really nice to
get a learning algorithm that's good for
spiking yards and i think that's the
main thing that's holding up spiking
neuron hardware so people like steve
ferber in manchester have realized that
many other people have realized that um
you can make more energy efficient
hardware this way and they built great
big systems what they don't have is a
good learning outcome for it and i think
until we've got a good learning
algorithm for it we won't really be able
to exploit what we can do with spiking
neurons and there's one obvious thing
you can do with them that isn't easy in
conventional neural nets
and that's
agreement
so if you take a standard artificial
neuron
then you simply ask the question can it
tell if it's two inputs have the same
value
well it can't it's not an easy thing for
a standard neuron to do the standard
artificial one
um if you use spiking neurons it's very
easy to build a system where if the two
spikes arrive at the same time they'll
make when you're on fire and if they're
over different times they won't
so using the time of a spike
seems like a very good way of measuring
agreement we know the biological system
does that
so
you can see the direction a sound is
coming from or rather here the direct
sound is coming from by the time delay
in the signals reaching the two
ears and
if you take a foot
that's about a nanosecond for light
and it's about a millisecond first sound
and the point is if i move something
sideways in front of you by a few inches
the difference
in
the time delay to the two ears the
length of the path to the two ears
is only a small fraction of an inch
and so
it's only a small fraction of a
millisecond difference
in the time the signal gets to the two
is and we can deal with that and owls
can deal with it even better um
and so we're measuring we're sensitive
to times of like um 30 milliseconds 30
microseconds
in order to get stereo from sound
um i can't remember what i was sensitive
to but it's i think it's a lot better
than 30 microseconds
and we do that by having
um two axons with spikes traveling in
different directions one from one and
one from the other ear and then you have
cells that fire if the spikes get at the
same time
that's the simplification but roughly
that um
so we know that spike timing can be used
for exquisitely sensitive things like
that
and it would sort of be very surprising
if the precise times the spike wasn't
being used but we really don't know how
and for a long time i thought
it'd be really nice if you could use
spike times to detect agreement for
things like self-supervised learning
or for things like
um
if i've extracted your mouth and i've
extracted your nose or other
representations of them
and
i
from your mouth i can now predict
something about your whole face and from
your nose i can break something about
your whole face
and if your mouth and nose are in the
right relationship to make a face those
predictions will agree
and it'd be really nice to use spike
timing to see that those predictions
agree
um
but it's hard to make that work and one
of the reasons it's hard to make that
work is because we don't know we don't
have a good algorithm for training
networks just like in neurons so that's
one of the things
i'm focused on now how can we get a good
training out of the networks of spiking
yours and i think that'll have a big
impact on hardware
that's a really interesting question
you're putting forward there because i
doubt too many people are working on
that compared to let's say the number of
people working on
large language models or
other problems that are much more i
guess
visible in terms of progress recently
um
i think
yeah it's always a good idea to figure
out what huge numbers of very smart
people are working on and to work on
something else
yeah i think the challenge of course for
most people
i'd say including myself but i
definitely hear the question from many
students too is that
it's easy to work on something else than
everybody else but it's hard to make
sure that something else is actually
relevant
because there's many other things
out there that are not not very relevant
you could possibly spend time on yeah
that involves having good intuitions
yeah
listening to you for example could help
um so i've actually a follow-up question
something you just said jeff which is um
that
the retina
doesn't use all spiking neurons are you
saying that the brain has two types of
neurons some that are more like our
artificial neurons and some that are
spiking neurons
i'm not sure the retina is more like
artificial neurons but um
certainly the cortex has
the neocortex has spiking neurons
um and that's its primary mode of
communication is by sending spikes
to from one parameter to another
parameter cell
um and i don't think we're gonna
understand the brain until we understand
why he chooses to send spikes
for a while i thought i had a good
argument that didn't involve
the precise time to spikes and the
argument went like this
the brains in the regime where it's got
lots and lots of parameters and not much
data
relative to
the typical neural nets we use
and
there's a potential overfitting in that
regime unless you use very strong
regularization
and a good regularization technique is
dropout where each time you use a neural
net you ignore a whole bunch of the
units
and so maybe
the fact that the neurons are sending
spikes
what they're really communicating
is the underlying poisson rate
so let's assume it's pass on but she's
close enough for this argument um
there's a price on process which spends
sends spike stochastically
but the rate of that process varies and
that's determined by the input to the
neuron
and you might think you'd like to send
the real valued rate from one urine to
another
um
but if you want to do lots and lots of
regularization
you could send the real valued rate with
some noise added
and one way to add noise is to just use
spikes that'll add lots of noise
and so
this was the motivation for dropout that
the
most of the times most of the neurons
aren't involved in things if you look at
any fine time window
um
and you can think of
spikes
as a representational underlying
personal rate it's just a very very
noisy representation
which sounds like a very very bad idea
because it's very very noisy but
actually once you understand about
regularization we have too many
parameters it's a very very good idea
so
i still have a
lingering fondness for the idea but
actually we're not using spike timing at
all um it's just about
using very noisy representations of
personal rates to be a good regularizer
and i sort of flip between i think it's
very important when you do science
not to totally commit to one idea and
ignore all the evidence for other ideas
but if you do that you end up
um
flipping between ideas every few years
so
some years i think neural nets are
deterministic i mean we should have
deterministic neural nets and that's
what backwards using
another years i think it's about a
five-year cycle i think no no it's very
important the best stochastic
um and that
changes the play for everything so
boltzmann machines were intrinsically
stochastic and that was very important
to them
um
but the main thing is not to fully
commit to either of those but to be open
to both
now one thing if we think more about
what you just said the importance of
spiking neurons and figuring out how to
train a spiking neuron network
effectively
what if we for now just say let's not
worry about the training part
given that
seemingly it's
far more power efficient
um wouldn't people want to distribute
pure inference chips that are you you
pre-trained effectively separately and
then you compile it onto a spiking
neuron
chip to have very low power inference
capabilities what about that
yeah so lots of people have thought of
that and um it's a very sensible idea
and it's probably on the evolutionary
path to getting to use backing neural
nets
because once you're using them for
inference
um
and it works and it's all people are
already doing that and it's already
working being shown to be more power
efficient
and various companies have produced
these big spiking systems um
once you're doing them for inference
anyway you'll get more and more
interested in
how you could learn in a way that makes
more use of the available power in these
spike times
so
you can imagine a system where you learn
using backprop
um
but not on
not on analog hardware for example not
on the this low energy hardware
um
and then you transfer it to the lower
energy hardware and that's fine
um
but we'd really like to learn directly
in the hardware
now one thing that really strikes me
jeff is when i think about
your talks back around
2005 6 7 8 when i was a phd student
essentially pre-alex ned talks
those talks i think topically have a lot
of resemblance to what you're excited
about now
and it almost feels like alexnet is an
outlier in in your path
um how did you go from thinking so
closely how the brain might work
to deciding that you know maybe you can
first explain what was alex net and but
also how did it come about and what was
that path to go from working on
restricted boltzmann machines trying to
see how the brain works too
i would say that
the more traditional approach to neural
nets that you all of a sudden show it
can actually work
well um if you're an academic you have
to raise grant money
and
it's convenient to have things that
actually work
even if they don't work the way you're
interested in
um so part of it's that just go with the
flow and
um
if you can make back prop work well
and back then in about 2006 2005 i got
fascinated by the idea you could use
stacks of restricted voltage machines
to pre-train feature detectors and then
it would be much easier to get backdrop
to work
it turned out with enough data
which is what you had in speech
recognition
um and later on because of faith ali and
her team
in image recognition with enough data
you don't need the pre-training although
pre-training is coming back i mean gpt-3
has pre-training
and pre-training is a thoroughly good
idea
um
but
once we
discovered that you can pre-train and
that will make backdrop work better and
that did great things for speech
which george john and abdul rahman
muhammad did um
in 2009
then alex
who was a graduate student in my group
thing
um
started uh
applying the same ideas to
vision
um
and pretty soon we discovered that you
didn't actually need this pre-training
especially if you had the imagenet
data
and in fact that project
um
was partly due to india's persistence so
i remember ilya coming into the lab one
day and saying look we now that we've
got speech recognition working this
stuff really works we've got to do
imagenet before anybody else does
and retrospectively learned that janella
come was going into the lab and saying
look we've got to do imagenet with
compliments before anybody else does
and
jan's students also and postdoc said oh
but i'm busy doing something else so
well
he he couldn't actually get someone to
commit to it
and
yeah initially couldn't get people to
commit to it
and so he persuaded alex to commit to it
by pre-processing the data for him so he
didn't have to pre-process the data the
data was all pre-processed to be just
what he needed
and then alex really went to china and
alex is just a superb programmer
and it was
alex was able to make a couple of gpus
really sing he made them work together
in his bedroom at home
um i don't think his parents realized
that they were paying most of the cost
because that was the electricity um
but
he did a superb job of programming
convolutional nets on them
um
so
he said we've got to do this
and helped alex with the design and so
on alex did the really intricate
programming
and i provided support um and a few
ideas like using dropout
i also did some good management i'm not
often very good at management but i'm
very proud of the management idea which
is alex kruszewski had to write a depth
or to show that he was sort of capable
of understanding research literature
which is what you have to do after a
couple of years to stay in the phd
program
and he doesn't really like writing um
and he didn't really want to do the
depth of it but it was way past the
deadline and the development was
hassling us
so i said to him
um each
each time you can improve the
performance by one percent on imagenet
um
you can delay your depth order by
another week
and alex delayed his death roll by a
whole lot of weeks
yeah and just for context for i mean a
lot of researchers know this of course
but maybe not everybody alex's result
with you and ilya
cut the error rate in half compared to
prior work on the imagenet image
recognition competition which was just
more or less i
i used to be a professor so it wasn't
quite in half close it cut it from about
a whole available well that's why
everybody switched from what they were
doing which was hand engineered
approaches to computer vision try to
program directly
how can a computer understand what's an
image to to deep learning i should say
one thing that's important to say here
um
jalakhar spent many years um developing
convolutional neural nets
um
and it really should have been him
his lab that developed that system we
had a few little extra tricks but they
weren't the important thing the
important thing was to apply
convolutional nets
using gpus to a big data set
um
so yam was kind of unlucky in that um he
didn't get the win on that
but it was using many of the techniques
that he developed it didn't have the the
russian immigrants that uh toronto and
eu had been able to attract to make it
happen
well once russian one's ukrainian and
it's important to confuse those even
though the ukraine is a russian-speaking
ukrainian don't confuse russian
absolutely
it's a it's a different country
so
now jeff that moment actually also
marked a big change in your career
because as far as i understand you've
never been involved in
corporate
work
but
it marked a transition for you soon
thereafter from being a pure academic to
being ending up at google actually
uh can you see a bit about that how was
that for you like
did you have any internal resistance i
can say why that transition happened
what triggered i'm curious
so
um i have a lonely disabled son who
needs
um
future provisions so i needed to get a
lump of money and i thought one way i
might get a lump of money was by
teaching a coursera course
and so i did a coursera course on neural
networks in
2012. and it was one of the early
coursera courses so their software
wasn't very good so it's extremely
irritating to do
um
it really was very irritating then
i'm not very good on software so i
didn't like that
and
from my point of view it amounted to
you agree to supply a chapter of a
textbook one chapter every week
um
so
you had to give them these videos and
then a whole bunch of people are going
to watch the videos
like sometimes the next day yoshi or
banjo would say why did you say that
um
so you know that it's going to be people
who know very little but also people
know a whole lot
and so it's stressful you know that if
you make mistakes they're going to be
caught not like a normal lecture where
you can just sort of
press on the sustaining pedal and sort
of
blow your way through it if you get some
slightly confused about something um
here you have to get it straight
and
the deal with the university of toronto
originally was that
um
if any money was made from these courses
which i was hoping there would be
um
the money that came to the university
would be split with the professor they
didn't specify exactly what the split
would be
but one assumed it would be like 50 50
or something like that
and i was okay with that
the university didn't provide any
support in preparing the videos
and
then after i started the course and when
i could no longer back out of it
the provost made a unilateral decision
without consulting
me or anybody else um that actually if
money came from coursera the university
would take all the money and the
professor would get zero
which is exactly the opposite of what
happens with textbooks
and the process was very like bringing
textbooks
i actually asked the university to help
me prepare the videos
and the av people came back to me and
said do you have any idea how expensive
it is to make videos
and i actually did have an idea because
i've been doing this
so i got really pissed off with my
university because they unilaterally
sort of canceled the idea i get any
immunization for this they said it was
part of my teaching well actually it
wasn't part of my teaching it was
clearly based on lectures i given as
part of my teaching but i was doing my
teaching as well as that and that i
wasn't using that course for my teaching
and that got me pissed off enough
that i was willing to consider
alternatives to being a professor
um and at that time
we then
suddenly got interest from all sorts of
companies
in the
in recruiting us
um either in funding giving big grants
or in funding a startup
it was clear that a number of big
companies were just very interesting
getting in on the act
and so
normally i would have just said no i'm i
get paid
by the state we're doing research
um i don't want to try and make extra
money from my research i'd rather get on
with the research
but because that particular experience
with
the university
um cheating me out of the money no it
turned out they didn't shoot me out of
anything because
uh no money came from of course anyway
um
but that pushed me over the edge into
thinking well
okay i'm going to find some other way to
make some money that was the end of my
princess oh no
well
but the result is that these companies
are and in fact if you read the the
genius makers book by kid metz which i
reread last week in preparation for this
conversation um if you read the book it
starts off with actually you running an
auction for these companies to try to
acquire your company which is quite the
start of a book
um very intriguing but how is it for you
oh when it was happening it was at nips
and terry had organized nips in a casino
um
at lake tahoe
um and so in the basement of the hotel
there were these smoke-filled rooms full
of people pulling one arm bandits and
big lights flashing saying you won 25
000 and all that stuff and people
gambling um in other ways
and upstairs we were running this
auction
and
we felt like we were in a movie we felt
like this was like being in
that movie the social network it sort of
felt like that it was great
the reason we did it was we had
absolutely no idea how much we were
worth
and i consulted
a lawyer who an ip lawyer who said
there's two ways to go about this
you could hire a professional negotiator
um
in that case um
you'll end up working for a company but
they'll be pissed off with you
or you could just run an auction
um
as far as i know this was the first time
a small group like that just ran an
auction we ran it on gmail
i've worked at google over the summer so
i knew enough about google to know that
they wouldn't read our gmail
um
and
i'm still pretty confident they didn't
read our gmail
microsoft wasn't so confident
and we just ran this auction where
people had to gmail me their bids
and we then immediately mailed them out
to everybody else with the timestamp of
the gmail
and
um
he just kept going up by a half million
dollars to and he was
half a million dollars to begin with and
then a million dollars after that
um
and
yeah it was pretty exciting
and we discovered we were worth a lot
more than we thought
retrospectively
we could probably have got more but we
we got to an amount that we thought was
astronomical
and then
basically
we wanted to work for google so we
stopped the auction so we could be sure
of working for you and as i understand
it you're still at google today
i'm still at google today
i'm nine years later i'm in my tenth
year there
i think i'll get some kind of award when
i beat them for 10 years because it's so
rare um
although people tend to stay in google
longer than other companies
yeah i like it there the
the main reason i like it is because
the brain team's a very nice team
and i get along very well with jeff dean
um
he's kind of
very smart but i'm very straightforward
to deal with
and what he wants me to do is do
what i want to do which is basic
research um he thinks what i should be
doing is trying to come up with
radically new algorithms and that's what
i want to do anyway
so it's just a
very nice fit i'm no good at managing a
big team to improve speech recognition
by one percent i'd be happy
well it's better to just revolutionize
the field again right yeah
i would like to do it one more time but
i'm looking forward to it i wouldn't be
surprised at all
now when when i look at your career
and some of this information actually
comes from the book as i didn't notice
before
i had read the book the first time
i mean
you are
you were a computer science professor at
the university of toronto emeritus now i
believe but computer science but you
never got a computer science
degree you got a psychology degree
and
you actually at some point were a
carpenter
how does it come about how do you go
from yes studying psychology to becoming
a carpenter to getting into ai what's
the path for you there how do you look
at that
in my last year at cambridge
i had a very difficult time and got very
unhappy and i dropped out
just after the exams i dropped out
and
became a carpenter
and
i'd always enjoyed carpentry more than
anything else
so at high school
um to be sort of all the classes and
then you could stay in the evenings and
do carpentry that's what i really look
forward to
and
so i became a carpenter and then after
i've been a carpenter for about six
months you couldn't actually make a
living as a carpenter um so i was a
carpenter and decorator and i made the
money during decorating but i had the
fun doing carpentry
um
and
the point is carpentry is more work than
it looks and decorating is less work
than it looks
um so you can you can charge more per
hour for decorating
um
unless you're a very good card
and then i met a real carpenter
and i realized i was completely hopeless
at carpentry
and so he
he's making a door
for a basement for a
coal seller under the sidewalk that was
very damp
and he was taking pieces of wood and
arranging them
so that they would walk in opposite
directions so that it would cancel out
and that was kind of a level of kind of
understanding and thought about the
process that never occurred to me he
could also take a piece of wood and just
cut it exactly square with a hand saw
and he explained something useful to me
he said if you want to cut a piece of
wood square
you have to
line the saw bench up with the room and
you have to line the piece of wood up
with the room
you can't cut it square if it's not
aligned with the room
which is very interesting in terms of
coordinate frames
so
anyway because i was so hopeless
compared with him i decided i might as
well go back into am
now when you say get back into ai
as i understand this was at the
university of edinburgh where you went
for your phd
yeah
i went to do a phd there and i went to a
phd on neural networks
with
an eminent professor called christopher
longa higgins
who was really very brilliant um
he almost got a nobel prize when he was
in his 30s to figuring out something
about the structure of boron hydride
um
and i i still don't understand what it
is because it'll do with quantum
mechanics but it hinged on the fact that
360 degree rotation is not the identity
operator it's 720 degrees
um
there's a thing you want to find one's
books about it
um anyway
he was interested in neural nets and the
relation to holograms and about the day
i arrived in edinburgh he lost interest
in neural nets
because he read winograd's thesis and he
became completely converted
um
he thought neural nets was the wrong way
to think about it we should do symbolic
ai he was very impressed by one great
thesis
um
and so we had
he had a lot of integrity
so even though he completely disagreed
with what he was i was doing he didn't
stop me doing it
he kept trying to get me to do
stuff more like winner thesis but he let
me carry on doing what i was doing
um
and
yeah i was a bit of a loner everybody
else back then in the early 70s was
saying minsky and papua shown that
neural nets are nonsense why are you
doing this stuff it's crazy
and in fact
the first talk i ever gave to that group
was about how to do true recursion with
neural networks um
so this was talking 1973 so 49 years ago
and so my one of my first projects i
discovered a write-up of it recently
was
um
you want a neural network
that will
be able to draw a shape
and
you want it to pass the shape into
parts
and you want it to be possible for a
part of the shape
to be done drawn by the same neural
hardware as the whole shape's being
drawn by
so
the neural hub where they're storing the
whole shape
has to remember where it's got to in the
whole shape and what the
orientation and position size is for the
whole shape
but now it has to go off and you want to
use the very same neurons
for drawing a part of the shape
so you need somewhere to remember what
the whole shape was and how far you got
in it
so you can pop back to that once you've
finished doing this subroutine this part
of the shape
and the question is how is the neural
network going to remember that because
obviously you can't just copy the
neurons
and so i managed to get a system working
where the neural network remembered it
by having fast heavy and weights
that were just adapting all the time
and were adapting so that
any state that it had been in recently
could be retrieved by giving it part of
that state and then say fill in the rest
and so i had a neural net that was doing
true recursion reusing the same neurons
and the same weights
to do the recursive call as it used for
the high level call
and that was in 1973 and
the
i think people didn't understand the
talk because i wasn't very good at
giving talks but they also said why
would you want to do recursion within
your match you can do recursion with
lisp
um
they didn't understand the point which
is that
unless we get neural nets to do
something like recursion we're never
going to be able to explain a whole
bunch of things
and
now that's become sort of an interesting
question again
so i'm going to wait one more year until
that idea is an antique a genuine
antique it'll be 50 years old and then
i'm gonna sort of write up the research
i did then and it was all about fast
weights for as a member so
i have many questions here jeff one the
first one is
you're standing in this room
where everybody's you're a phd student
or maybe fresh out of phd
you're standing in a room with
essentially everybody telling you what
you're working on
is a waste of time
and
you were convinced somehow was not where
do you get that conviction from
i think a large part of it was my
schooling
so
my father was a communist
but he sent me to an expensive private
school because they had good science
education
and i was there from the age of seven
[Music]
while they had a preschool
and it was a christian school
and all the other kids believed in god
and it was just
at home i was taught that that was
nonsense
and it did seem to me that
it was nonsense
um
[Music]
and so
i was used to
just having
everybody else being wrong and obviously
wrong
and i think that's important
i think you need you need
i was back to say you need the faith
which is funny in this
situation
um you need the faith in science
to um
be willing to work on stuff just because
it's obviously right
even though everybody else says it's
nonsense
and in fact it wasn't everybody else it
was everybody else in the early 70s
doing ai said it was nonsense or nearly
everybody else
um
but if you look a bit early if you look
in the 50s
both von neumann and turing believed in
neural nets
turing in particular believed in neural
nets training with reinforcement um
so
if i i still believe if they hadn't both
died early the whole history of ai might
have been very different
because they were sort of powerful
enough intellects to have swayed a field
and
they were very interested in sort of how
does the brain work
so i think it was just bad luck we both
died early
well british intelligence might have
come into it but
now
you go from believing in this
well
at the time many people didn't
getting the big breaks resulted in that
power almost everything
that's being done today and now there is
this in some sense the next question
right
is
it's not just that deep learning works
and works great the question becomes
is it all we need or
will we need other things
and you've said things
maybe i'm not literally quoting you but
to the extent of deep learning will do
everything what i really meant by that i
i i sometimes say things without
thinking without being accurate enough
and then people call me
like saying we won't need radiologists
um
so
what i really meant was um
using stochastic gradient to send to it
just a whole bunch of parameters that's
what i sort of had in mind when i said
deep learning
the way you get the gradient might not
be back propagation
and the thing you get the gradient of
might not be some final performance
measure but rather these lots of local
objective functions
but i think that's how the brain works
and i think that's gonna explain
everything yes well nice nice to see it
confirmed um
so one other thing i want to say is the
kind of computers we have now
um
are very good for
um doing banking
because they can remember how much you
have in your account it wouldn't be so
good if you went in and they said well
you got roughly this much we're not
really sure because we don't do it to
that precision but roughly this much
um we don't want that in a computer
doing banking
um
or in a computer guiding the space
shuttle or something we would really
rather it got the answer exactly right
um
and they're very different from us
and i think
people aren't sufficiently aware
that we made a decision
about how computing would be
um which is that
um
our computer our
knowledge will be immortal
so if you look at
existing computers you have a computer
program
or maybe you just have a lot of weights
for a neural net that's a different kind
of program
um but if your hardware dies you can run
the same program on another piece of
hardware
and so that makes the knowledge immortal
it doesn't hinge on that particular
piece of hardware surviving now the cost
of the immortality is huge
because it means the two-bit different
bits of hardware have to do exactly the
same thing obviously zero correction all
that but after you've done all the error
correction they have to do exactly the
same thing
which means there better be digital or
mostly digital
um
[Music]
and they're probably gonna do things
like multiplying numbers together which
involves
using lots and lots of energy to make
things very discreet
which is not what hardware really wants
to be
and so as soon as you
commit yourself to the immortality of
your program or your neural net
you're committed to
um very
expensive computations and also to very
expensive manufacturing processes you
need to manufacture these things
accurately and probably in 2d and then
put lots of 2d things together
um if you're just willing to give up on
immortality
sort of in fiction normally what you get
in return is love
um
but if if we're willing to give up
immortality what we'll get in return is
very low energy computation and very
cheap manufacturing
so
instead of manufacturing computers
what we should do is grow them
um we should use nanotechnology to just
grow the things in 3d
and
each one will be slightly different
so the image i have is if you take a pot
plant
and you sort of pull it out of its pot
there's a root ball and it's the shape
of the pot right
and so all the different pot plants have
the same shaped root ball but the
details of the roots are all different
but they're all doing the same thing
they're extracting nutrients from the
soil and they got the same function and
they're pretty much the same
but the details are all very different
um
so that's what real brains are like
and i think that's what
what i call mortal computers will be
like
so these are computers that are grown
rather than manufactured
you can't program them they just learn
they obviously have to have a learning
algorithm sort of built into them
they learn they can do most of their
computation in analog
because analog is very good for doing
things like taking a voltage times a
resistance and turning it into a charge
and then adding up the charge and
already chips that do things like that
the problem is what do you do next
um and how do you learn in those chips
and at present people have suggested
back propagation or various versions of
boxing machines
um
i think we're going to need something
else
but i think
sometime in the not too distant future
we're going to see mortal computers
which are very cheap to create
have to get all their knowledge in there
by learning
and are very low energy
and
these mortal computers when they die
they die and their knowledge dies with
them
and so and it's no use looking at the
weights because those weights only work
for that hardware
um
so what you have to do is distill the
knowledge into other computers so when
these multiple computers get old
they're gonna have to do lots of
podcasts to try and get the knowledge
into younger modern english first one
you build i'll happily have that one on
let me know
so jeff this reminds me of another uh
question that's been on my mind
for you which is
when you think about today's neural nets
the ones that
grab the headlines
are very very large i mean not as large
as as the brain maybe but
in some sense starting to get in that
way
size right the large language models
um
but and and the results look very
very impressive
um
so one i'm i'm curious
about your take on those kinds of models
and what you see in them and we see as
limitations but two i'm also curious
about
what do you think about working on the
other end of the spectrum for example
ants have much smaller brains
obviously than humans
yet
it's fair to say that our
visual motor systems that we have
developed artificially are not yet at
the level of what ants can can pull off
or b's and so forth and so i'm curious
about that spectrum as well as the the
recent big advances in language models
where you think about those
so b's they may look small to you but i
think a b has about a million neurons
so
i think a b is like closer to gpg3 um
certainly closer than romantics um but a
b is actually quite a big neural net
um
my belief is that
um
if you take a system with lots of
parameters and they're tuned sensibly
using some kind of gradient descent in
some kind of sensible objective function
then you'll get wonderful properties out
of it and you'll get all these emerging
properties
um
like you do with gpg3 and
also the
the the google equivalents that i've
talked about so much
that doesn't sort of
settle the issue of whether they're
doing the same way as us
and i think
um
we're doing
a lot more things like recursion which i
think we do in neural nets
and i tried to address some of these
issues in a paper i put on the web last
year called glom
um well i call it glom it's how you do
part hole hierarchies in neural nets so
you definitely have to have structure
and if what you mean by symbolic
computation is just that you have part
whole structure
then we do symbolic computation that's
not normally what people meant by
symbolic computation
the sort of hardline symbolic
computation means
you're using
symbols and you're operating on symbols
using rules that just depend on the form
of the symbol string you're processing
and that a symbol
the only property a symbol has is that
it's either identical or not identical
to some other symbol
and perhaps that it points to something
it can be used as a pointer to get
something
um the neural nets are very different
from that
so
the sort of hard-line symbol processing
i don't think we do that but we
certainly
deal with
pothole hierarchies
but i think we do it in great big neural
nets
and i'm sort of up in the air at present
as to
to what extent does gpt3 really
understand what it's saying
i think it's fairly clear
it's not just like the old eliza program
which just rearranges strings of symbols
and had no clue what it was talking
about
um and the reason for believing that is
you say you say in english show me a
picture of a hamster wearing a red hat
and it draws a picture of a hamster
wearing a red hat
um
and you're fairly sure it never got that
pair before
so it has to understand the relationship
between the english string and the
picture
and
before it had done that if you'd asked
any of these um doubters
these neural net skeptics
um neural net deniers let's call them
neural net deniers um
if you'd ask them well how would you
show that it understands
i think they'd have accepted that well
if you asked to draw a picture something
draws a picture of that thing then it
understood
just as with winograd's thesis you ask
it to put the blue
the blue block in the green box and it
puts the blue block in the green box and
so that's pretty good evidence it
understood what you said
um
but now that it does it of course the
skeptics then say well you know that
doesn't really count
there's nothing that was satisfied
basically yeah
the goal line's always moving uh
for true skeptics
yeah
now there's the recent one um
the google won the paul model that uh in
in the paper showed how it was
explaining
effectively how jokes work that was
extraordinary that just seemed a very
deep understanding of language no it was
just rearranging the words it had in its
training you think so no
no it had i i didn't see how it could
generate those explanations without sort
of understanding what's going on now i'm
still open to the idea that
because it was framed with back
propagation it's going to end up with a
very different sort of understanding
from us
and obviously
adversarial images um
tell you a lot that you can recognize
objects by using their textures
and you can be correct about it in the
sense that it'll generalize to other
instances of those objects
but it's a completely different way of
doing it from what we do
and i like to think of the example of
insects and flowers so insects seeing
the ultraviolet
so two flowers that look the same to us
can look completely different to insects
and now
because the flowers look the same to us
do we say the insects are getting it
wrong
um because these flowers evolved with
the insects to give signals to the
insects in the ultraviolet to tell them
which flower it is so it's clear the
insects are getting it right and we just
can't see the difference
and that's another way of thinking about
adversarial examples
um
it looks you know this this thing that
it says is an ostrich looks like a looks
like a school bus to us
but actually if you look in the texture
domain then it's actually an ostrich so
um
the question is who's right in the case
of the insects
um just because two flowers look
identical to us it doesn't mean they're
really the same the insects are right
about them being very different in that
case it's different parts of the
electromagnetic spectrum that are
indicating the difference that we can
pick up on but it could be in the case
of image recognition for our current
neural nets so you could argue maybe
that
um
since we build them and we want them to
do things for us in our world then
we really
don't want to
just say okay they got it right and we
got it wrong i mean they need to
recognize the car and the pedestrian
yeah i agree i just want to show it's
not as simple as you might think about
who's right and who's wrong
and
part of the point of my glom paper was
to try and build
perceptual systems that work more like
us
so they're much more likely to make the
same kinds of mistakes as us
and not make
very different kinds of mistakes and
obviously
um
if you've got a self-driving car for
example if it makes a mistake that any
normal human driver would have made
that seems much more acceptable than
making a really dumb mistake
so jeff as i understand it sleep is
something you also think about can you
say a bit more
yes i often think about it when i'm not
sleeping at night
um so
there's something funny about sleep
which is um
animals do it fruit flies sleep
and it may just be to stop them flying
around in the dark but um
if you deprive people of sleep
then they go really weird
like if you surprise someone for three
days they'll start hallucinating if you
surprise someone for a week they'll go
psychotic and they never recover
um
these are nice experiments done by the
cia i think um
and
the question is why
why do we what what is the computational
function of sleep there's presumably
some pretty important question for it if
depriving you of it makes you just
completely fall apart
and so current theories are things like
it's for consolidating memories or maybe
for downloading things from hippocampus
into cortex which is a bit odd since i
had to come through court exactly on the
campus in the first place
um
so a long time ago in the early 80s
terry sanofsky and i had this theory
called baltimore machines
and it was partly based on an insight of
francis crick
um when he was thinking about hopfield
nets princess quickly rare mitchelson
had a paper about sleep and the idea
that
um
you would hit the net with random things
and tell it not to be happy with random
things so in a hot field next you give
it something you wanted to memorize and
it changes the weights so the energy of
that vector is lower
and the idea is if you also give it
random vectors and say make the energy
higher the whole thing works better
and that led to boltzmann machines where
we figured out that um
if you instead of giving it random
things you get things generated from a
markov chain the model's own markov
chain and you say make those less likely
and make the data more likely that is
actually a maximum likelihood learning
and so we got very excited about that
because we thought okay that's what
sleep is for sleep is this negative
phase of learning
it comes up again now in contrasting
learning
where you have
two patches from the same image you try
and get get them to have similar
representations
and two patches from different images
you try and get them to have
representations that are sufficiently
different once they're different you
don't make them any more different
but you stop them being too similar and
that's how contrastive learning works
now with boltzmann machines
you couldn't actually separate the
positive face from the negative face you
had to interleave positive examples and
negative examples
otherwise the whole thing would go wrong
and i went to i tried a lot not
interleaving them it's quite hard to do
a lot of positive examples followed by a
lot of negative examples
what i discovered a couple of years ago
they got me very excited and caused me
to agree to give lots of talks that i
then cancelled when i couldn't make it
work better um
was that
with contrastive learning
you can actually separate the positive
and negative phases
so you can do
lots of examples of positive pairs
followed by lots of examples in negative
pairs
and that's great because what that means
is
you can have something like a video
pipeline
where you're just trying to make things
similar while you're awake
and trying to make things dissimilar
while you're asleep
um
if you can figure out how sleep can
generate video for you um
so it makes
it makes a contrastive learning area
much more plausible if you can separate
the positive and negative phases and do
them at different times and do a whole
bunch of positive updates followed by a
whole bunch of negative updates
and
even for the standard contrastive
learning you can do that moderately well
you have to use lots of momentum and
stuff like that there's all sorts of
little tricks to make it work but you
can make it work
um
so i think it's quite likely that the
function of sleep
is to do unlearning or negative examples
and that's why you don't remember your
dreams you don't want to remember them
you're unlearning them
qrik pointed this out you'll remember
the ones that are in the fast weights
when you wake up
um
because the fast weights are a temporary
store so
that's not unrunning that still works
the same way
but the long-term memory um
the whole point is to get rid of those
things and that's why you dream for many
hours a night but when you wake up you
can just remember the last minute of the
dream you're having when you woke up
um
and i think this is a much more
plausible theory of sleep than any other
i've seen because it explains why if you
got rid of it the whole system would
just fall apart
you'll go disastrously wrong and start
hallucinating and doing all sorts of
weird things
and let me say a little bit more about
the need for negative examples that
you're having a trust of learning if
you've got a neural net
and it's trying to optimize some
internal objective function something
about the kinds of representations it
has or something about the agreement
between contextual predictions and local
predictions
it wants this agreement to be
a property of the real data
and the problem inside a neural net is
that
you might get all sorts of correlations
in your inputs i'm a neuron right so i
get all sorts of correlations in my
inputs and those correlations have
nothing to do with the real data they're
caused by the wiring of the network and
the way it's in the network
if these two neurons are both looking at
the same pixel they'll have a
correlation but that doesn't tell you
anything about the data
and so the question is how do you learn
to extract structure that's about
the real the real data
and not about the wiring of your network
and the way to do that is to feed it
positive examples and say find structure
in the positive examples
that isn't in the negative examples
because the negative examples are going
to go through exactly the same wiring
and if the structure is not in the
negative examples but it is in the
positive examples
then the structure is about the
difference between the positive and
negative examples not about your wiring
so as people don't think about this much
but
if you have
powerful learning algorithms they you
better not make them learn about the
neural network's own weights and wine
that's not what's interesting
now when you think about people who
don't get sleep then and start
hallucinating is hallucinating just
effectively trying to do the same thing
you're just doing it while you're awake
obviously you can have little naps and
that's very helpful and maybe
hallucinating when you're awake is
serving the same function in sleep
and it's i mean all the experiments i've
done say it's better to not have
16 hours awake and eight hours of sleep
it's better to have a few hours awake in
a few hours of sleep so
and a lot of people have discovered that
little naps help einstein used to
take little naps all the time and he did
okay
yeah he did very well no
for sure
now
there's this other thing
you uh you've brought up this notion of
student beats teacher
what does that refer to
okay so
um a long time ago ago i didn't
experiment on mnist
which is a standard digit database for
every 900 digits
where
um
you take the data the training data
and you corrupt it and you corrupt it by
substituting
the wrong label one of the other nine
labels
eighty percent of the time
so now you've got a data set in which
the labels are correct
um 20 of the time
and wrong 80 of the time
and the question is
um can you learn from that
and how well do you learn from that
and the answer is you can learn to get
like 95 correct on that
so now you've got a teacher who's wrong
80 of the time
then the student
is right 95 of the time
so the student is much much better than
the teacher
and this isn't
um each time you get an example you
corrupted you take the training examples
you can wrap them once and for all so
you can't average away the corruption
over different
you might be able to average it away
over different training cases that
happen to have similar images but
and if you ask well how many training
cases do you need if you have corrupted
ones
and this was a great interest because of
the tiny images data set some time ago
where they had 80 million tiny images
with a lot of wrong labels in
and the question is
would you rather have a million
things that are flakily labeled or would
you rather have 10 000 things with
accurate labels
and i had a hypothesis that what counts
is the amount of mutual information
between the label and the truth
so if the labels are correct corrupted
ninety percent of the time there's no
mutual information between the labels
and the truth
if they corrupt eighty percent of the
time there's only a small amount of
mutual information is that's what you
think is i think it's about my memory is
it's 0.06 bits per case
whereas if it's uncorrected it's about
3.3 bits per case
um so it's only a tiny amount and then
the question is well
suppose i balance the size of the
training set by putting as much mutual
information in there
um
so if if there's like a 50th of the
mutual information i have 50 times as
many examples do i now get the same
performance
and the answer is yes you do to within a
factor of two i mean the training set
actually needs to be twice that big but
roughly speaking
you can see how useful a training
example is by the amount of mutual
information between the label and the
truth
and i noticed recently you have
something for doing sim to real
where you're labeling real data using a
neural net and those labels aren't
perfect
and then you take the student that
learned from those labels and the
student is better than the teacher it
learned from
and people are always puzzled by how
could the student be better than the
teacher
um
but in neural nets it's very easy
the student will be better than the
teacher
if there's enough training data
even if the teachers are very flaky and
i have a paper
a few years ago with melody guan about
this for some medical data
the first part of paper talks about this
but the the rule of thumb is basically
what counts is the mutual information
between
the assigned label and the truth
and that tells you how valuable a
training example is
and so you can make do with lots of
flaky ones
that's so interesting now in the work we
did that you just referenced javen and
the work i've seen
quite popular recently
usually the teacher
provides
noisy labels
but then
not all the noise labels are used
there's a notion that only look at the
ones where the teacher is more
confident
your description doesn't matter that's
obviously a good hatch
yeah you don't need to do that you don't
need to do that it's a good hack
and it probably helps to only look at
the ones where you have reason to
believe the teacher got it right but
it'll work even if you just look at them
all and there's a phase transition
so with with mnist
melody plotted a graph and as soon as
you get like 20 of the labels right
your student will get like 95 correct
wow
but as you get down to about 15 right
you suddenly get a phase transition
where you don't do any better than
chance
because somehow the student has to get
it the teacher is saying these labels
and the student has to in some sense
understand
which cases are right and which case is
wrong and sort of see the relationship
between the labels and the inputs
and then once the student's seen that
relationship a wrongly labeled thing is
just very obviously wrong
um so it's fine if it's randomly wrongly
enabled
but there is a phase transition where
you have to have it good enough so the
students sort of get the idea
but that explains how our students are
all smarter than us
we all need to get it right a small
fraction of the time
right and i'm sure the students do some
of this data curation where you say
something and the student thinks oh
that's rubbish i'm not going to listen
to that
those are the very best students you
know
yeah those are the ones that can
surprise us um
now one of the things that
is really important in
neural net learning and especially when
you're building models is to get an
understanding of what is it what is it
learning and often people
try to somehow visualize what's
happening during learning and one of the
most prevalent visualization techniques
is called disney
which is something you invented jeff so
i'm curious how do you come up with that
what if maybe first describe what it
does and then what's the story behind it
so if you have some high dimensional
data
and you
try and draw a 2d or a 3d map of it
you could take the first two principal
components and just plot the first two
principal components
but what principal components cares
about is getting the big distances right
so if two things are very different
principal components is very concerned
to get them very different
in the 2d space
it doesn't care at all about the small
differences because it's it's sort of
operating on the squares of the big
differences
so it won't preserve similarity very
well
high dimensional similarity
and
you're often interested in just the
opposite you've got some data you're
interested in what's very similar to
what when you don't care if it gets the
big distances a bit wrong as long as it
gets the small distances right
so i had the idea
a long time ago that
what if we took the
distances and we turned them into
probabilities of pairs
there's various versions of tc's but
suppose we turned them into the
probability of a pair
such that we say
pairs with a small distance are probable
and players with a big distance are
improbable
so we're converting distances into
probabilities in such a way that small
distances correspond to big
probabilities
and we do that by putting a gaussian
around a point a data point
and computing the density of the other
data point under this gaussian
and that's an unnormalized probability
then you normalize these things
and then you try and lay the points out
in 2d so as to preserve those
probabilities
and so it won't care much if two points
are far apart they'll have a very low
pairwise probability and it doesn't care
the relative positions of those two
points but it cares about the relative
positions of the ones with
hyperventilators
and that produced quite nice maps and
that was called stochastic neighbour
embedding
because we thought of this you put a
gaussian and you stochastically pick a
neighbor according to the density under
the gaussian
um and i did that work with samurais and
it had very nice simple derivatives
which convinced me that we were onto
something and we got nice maps but they
tended to crowd things together
and
there's obviously a basic problem
in converting high dimensional data into
low dimensional data
so sneak tends to crowd things together
stochastic in everybody and that's
because of the nature of high
dimensional space and low dimension
spaces
in a high dimensional space a data point
can be close to lots of other points
without them all being too close to each
other
in a low dimensional space they all have
to be close to each other if they're all
close to this data point
so
you've got a problem in embedding
closenesses from high dimensions to low
dimensions
and
i had the idea when i was doing snee
that since i was using probabilities as
this kind of intermediate currency
there should be a mixture model it
should be a mixture version where you
say in high dimensions the probability
of a pair is proportional to
e to the minus s squared distance
on my gaussian um
and in low dimensions suppose you have
two different maps
the probability of a pair is the sum of
e to the minus the distance in the first
2d map
and e to the minus the squared distance
in the second 2d studio
and that way if we have a word like bank
and we're trying to put similar words
near one another bank can be close to
greed in one map
and can be close to river in the other
map
without river ever being close to greed
so
i really pushed that idea because i
thought this was a really neat idea and
you could have a mixture of maps
and we managed to get to where elio was
one of the first people to work on that
and james cook worked on it a lot
and several other students worked on it
and we never really got it to work well
um
and i was very disappointed that someone
hadn't been able to make use of the
mixture idea
and then i went to a simpler version
which i called unicene
which was a mixture of a gaussian and a
uniform
and that worked much better
um
so the idea is in one map
all pairs
are equally probable
and that gives you a sort of background
probability which goes through the big
distances
a small background probability
and then in the other map
you contribute
um
a probability proportional to your
squared distance in this other map
but it means in this other map
things can be very far apart if they
want to be
because the fact that then
they need some probability is taken care
of by the uniform
and
then i got a review paper from a plumber
called lawrence van der marton
which i thought was actually a published
paper because of the form it arrived in
but wasn't actually a published paper
and he wanted to come do research with
me and i thought he had this published
paper so i invited him to come do
research um it turned out he was
extremely good and it's lucky i've been
mistaken in thinking it was a published
paper um
and we started on unisoning
and then i realized that actually eunice
me
is a special case of using a mixture of
a gaussian
and a very very broad gaussian which is
a uniform
so what if we used a whole hierarchy of
gaussians
many many gaussians with different
widths
and that's called a t distribution um
and that led to t-sne and t c works much
better
and tc has a very nice property
that
um it can show you things at multiple
scales
because it's got a kind of one over d
squared property
that
um
once distances get big
it behaves just like gravity and
clusters of galaxies and things you have
pluses of galaxies and galaxies and
clusters of stars and so on
and you get structured many different
levels in it you get the course
structure and the fine structure all
showing up
now the
objective function used for all this
which was the sort of relative densities
under a gaussian
came from other work i did with alberto
pacinero earlier
um
that we found hard to get published
i got a review saying
yeah i got a review of that work when it
was rejected by some conference
saying hinton's been working on this
idea for seven years and nobody's
interested
um
i take those reviews as telling me i'm
on to something very original
um
and that actually had the function in it
that's now used i think it's called nce
it's using these contrastive methods
um and t-sne is actually a version of
that function
um
but it's being used for making maps
so it's a very long history of tc of
getting the original sne and then trying
to make a mixture version and it's just
not working and not working not working
and then eventually getting the
coincidence of figuring out it was a
t-distribution of what you wanted to use
that was the kind of mixture
and
lauren's arriving and laurence was very
smart in a very good program really made
it all work beautifully
this is really interesting because it
seems a lot of the um
a lot of the progress these days the the
bigger idea place plays a big role but
here it seems it was really getting the
details right
was the only way to get it to fully work
you typically need both
you have to have a big idea for it to be
interesting original stuff but you also
have to get the details right
and that's what graduate students are
for
so
jeff thank you thank you for such a
wonderful uh conversation for our part
one of our season finale
you
Browse More Related Video
![](https://i.ytimg.com/vi/sitHS6UDMJc/hq720.jpg)
Possible End of Humanity from AI? Geoffrey Hinton at MIT Technology Review's EmTech Digital
![](https://i.ytimg.com/vi/CC2W3KhaBsM/hq720.jpg)
In conversation with the Godfather of AI
![](https://i.ytimg.com/vi/mG31I9mfVLU/hq720.jpg)
【人工智能】直觉的力量 | 杰弗里辛顿最新对话 | Sana AI峰会 | 回忆AI生涯 | Ilya的能力和直觉 | 缩放法则 | 多模态 | 语言与认知 | 神经网络 | AI情感 | 反向传播
![](https://i.ytimg.com/vi/iWPo7Yhg7Vc/hqdefault.jpg?sqp=-oaymwEXCJADEOABSFryq4qpAwkIARUAAIhCGAE=&rs=AOn4CLCg1cpuTJZN5kBXRVm90zJlQ4wIjA)
Geoffrey Hinton 2023 Arthur Miller Lecture in Science and Ethics
![](https://i.ytimg.com/vi/lLBbsif2Xt4/hq720.jpg)
Geoffrey Hinton is a genius | Jay McClelland and Lex Fridman
![](https://i.ytimg.com/vi/-eyhCTvrEtE/hq720.jpg)
Heroes of Deep Learning: Andrew Ng interviews Geoffrey Hinton
5.0 / 5 (0 votes)