Geoffrey Hinton | Ilya's AI tutor Talent and intuition have already led to today’s large AI models

Me&ChatGPT
9 Jun 202424:32

Summary

TLDR这段视频脚本记录了一位研究者在英格兰的科研单位经历,以及他在卡内基梅隆大学和爱丁堡大学从事人工智能研究的回忆。他讲述了自己对脑科学、哲学和人工智能的探索,以及与Terry Sejnowski和Peter Brown等人的合作。讨论了神经网络、深度学习、语言模型和多模态学习的发展,以及GPU在神经网络训练中的重要性。他还分享了对人工智能未来的看法,包括模型的推理能力和创造力。

Takeaways

  • 🍺 在英国的研究单位,晚上六点大家都会去酒吧喝一杯。
  • 🔬 在卡内基梅隆,周六晚上实验室里充满了学生,他们相信自己的工作会改变计算机科学的未来。
  • 🧠 剑桥的脑研究课程让人失望,只讲述了神经元如何传导动作电位。
  • 📚 读了Donald Hebb和John von Neumann的书后,对AI产生了兴趣。
  • 🤔 脑学习的方式并非通过逻辑推理,而是通过修改神经网络中的连接权重。
  • 👨‍🏫 与Terry Sejnowski的合作研究Boltzmann机器最为令人兴奋。
  • 🎤 Peter Brown教会了我很多关于语音识别和隐藏马尔可夫模型的知识。
  • 🤝 Ilia展示了出色的直觉,他认为只要模型足够大,就会表现得更好。
  • 💡 大规模数据和计算使得新的AI算法得以实现。
  • 📈 多模态模型将会更加高效,并能更好地理解空间关系。

Q & A

  • 为什么在英格兰的周六晚上9点,实验室里会有很多学生?

    -在英格兰的实验室里,学生们之所以在周六晚上9点还聚集在实验室,是因为他们相信他们正在进行的工作是未来,他们所做的下一步将会改变计算机科学的方向。

  • 为什么作者最初在剑桥学习生理学时感到失望?

    -作者在剑桥学习生理学时感到失望,因为他们所教授的只是关于神经元如何传导动作电位,而这并没有解释大脑是如何工作的。

  • 作者最终为何选择了去爱丁堡学习人工智能?

    -作者选择去爱丁堡学习人工智能是因为他发现至少在人工智能领域,他可以通过模拟来测试理论,这比他在剑桥学习生理学和哲学时的经历要有趣得多。

  • Donald Hebb的哪本书对作者产生了影响?

    -Donald Hebb的书中关于学习如何改变神经元网络中的连接强度的内容对作者产生了很大的影响。

  • 作者在卡内基梅隆大学的主要合作伙伴是谁,他们一起研究了什么?

    -作者在卡内基梅隆大学的主要合作伙伴是Terry Sejnowski,他们一起研究了玻尔兹曼机,这是一种他们认为能够解释大脑工作方式的模型。

  • Peter Brown教授给作者带来了哪些影响?

    -Peter Brown教授不仅在统计学方面给予了作者很多指导,还向作者介绍了隐马尔可夫模型,这些对作者后来的研究有着深远的影响。

  • Ilya Sutskever在作者的办公室提出了什么有洞察力的问题?

    -Ilya Sutskever提出了关于为什么不将梯度直接给一个优化函数的问题,这个问题后来花费了他们好几年时间去思考。

  • 为什么作者认为大型神经网络模型能够超越其训练数据?

    -作者认为大型神经网络模型能够超越其训练数据,因为它们可以从错误标记的数据中学习并得出更好的结果,这表明它们能够识别训练数据的错误并进行自我修正。

  • 作者如何看待当前的神经网络模型在多模态学习方面的潜力?

    -作者认为当神经网络模型能够处理图像、视频和声音等多模态数据时,它们将更好地理解空间关系,从而在理解对象和进行创造性推理方面变得更加强大。

  • 为什么作者认为使用GPU进行神经网络训练是一个好主意?

    -作者认为使用GPU进行神经网络训练是一个好主意,因为GPU擅长进行矩阵乘法,这是神经网络训练中的基本操作,使用GPU可以显著加快训练速度。

  • 作者如何看待数字计算与模拟计算在未来的发展?

    -作者认为虽然模拟计算可能更接近大脑的工作方式,但数字计算在知识共享和效率方面具有优势,因此数字系统可能在未来的发展中占据主导地位。

Outlines

00:00

😀 实验室的夜与AI的启蒙

在英格兰的研究单位,作者回忆了周六晚上实验室里的情景。当时,尽管是休息日,但学生们都在实验室里工作,因为他们相信他们的工作将改变计算机科学的方向。这种氛围与作者之前在英格兰的经历截然不同,给他带来了新的启发。作者还提到了自己在剑桥学习生理学和哲学的经历,以及最终转向爱丁堡大学从事人工智能研究的过程。他对AI的兴趣部分受到了Donald Hebb和John von Neumann的书籍的影响,这些书籍探讨了大脑如何学习和计算。

05:01

🤖 AI研究的早期探索与合作

作者讲述了他在卡内基梅隆大学的主要合作经历,特别是与巴尔的摩约翰霍普金斯大学的Terry Sejnowski的合作,他们共同研究玻尔兹曼机。这段合作对作者来说是极具启发性的,尽管后来他们意识到玻尔兹曼机可能并不是大脑工作的真正方式。此外,作者还提到了与统计学家Peter Brown的合作,Peter Brown向他介绍了隐马尔可夫模型,并从他们的交流中互相学习。

10:02

👨‍💻 与Ilya的合作和AI的直觉

作者回忆了与Ilya的合作经历,Ilya是一位对AI有深刻直觉的年轻学生。Ilya对反向传播的理解提出了质疑,认为应该将梯度信息提供给一个更合理的函数优化器,这个观点后来启发了他们多年的思考。作者认为Ilya的直觉和创新能力对AI领域的发展有着重要影响。他们一起进行了许多有趣的项目,包括数据映射和自动编程接口的开发。

15:02

🧠 大型语言模型的学习和创新

作者讨论了大型语言模型如何通过寻找共同结构来学习,并预测这些模型将如何变得更加创造性。他提到了AlphaGo的例子,说明了AI如何在特定领域超越现有知识。作者认为,随着模型规模的扩大,它们将能够进行更多的推理,并可能通过自我对弈等方式发展出创新的策略。

20:02

🔢 神经网络与GPU的结合

作者分享了他如何最早提出使用GPU来训练神经网络的想法。他的一个学生Rick Zisy建议使用图形处理卡来进行矩阵乘法,这启发了作者。他们最初使用游戏GPU,后来转向了Tesla系统,大大加快了训练速度。作者还在NIPS会议上推荐了Nvidia GPU,并尝试向Nvidia索要免费的GPU作为感谢,虽然没有成功,但后来Jensen给了他一个。作者对数字计算的不朽性表示赞赏,因为它允许知识在不同的硬件之间高效地共享。

🌐 多模态学习与人类思维

作者探讨了多模态学习如何使模型更好地理解空间概念,并提高推理能力。他认为,尽管理论上可以从语言中学习很好的模型,但多模态系统可以使学习变得更容易。此外,作者还讨论了语言与认知的关系,提出了三种不同的语言观点:符号主义、向量空间模型和嵌入模型,并认为嵌入模型可能是人类思维的一个更合理的模型。

Mindmap

Keywords

💡神经网络

神经网络是一种模仿人脑神经元结构的计算模型,用于处理复杂的数据模式识别和决策任务。在视频中,神经网络是讨论的核心,特别是在讨论人工智能和机器学习的发展时。例如,提到了如何通过改变神经网络中的权重来实现复杂的功能,这体现了神经网络在AI领域的基础性作用。

💡深度学习

深度学习是机器学习的一个子领域,它使用多层神经网络来模拟人类学习过程,解决图像识别、语言处理等复杂问题。视频中提到了深度学习在语音识别、自然语言处理等领域的应用,展示了其在现代AI技术中的重要性。

💡反向传播

反向传播是一种在神经网络中用于优化权重的算法,通过计算损失函数的梯度来调整网络参数,提高学习效率。视频中提到了反向传播算法的发展,以及它如何帮助神经网络更好地学习和预测,例如在制作语言模型时预测下一个符号。

💡隐藏层

隐藏层是神经网络中不直接与输入或输出层相连的层,它们负责提取和组合特征,形成更高级的特征表示。视频中提到了隐藏层在神经网络中的作用,尤其是在讨论早期神经网络模型和它们如何通过隐藏层来处理数据。

💡嵌入

嵌入是一种将符号或词汇转换为固定大小的向量表示的技术,这些向量可以捕捉词汇之间的语义关系。视频中讨论了嵌入在语言模型中的应用,如何通过嵌入来理解和生成文本。

💡梯度下降

梯度下降是一种优化算法,用于最小化损失函数,通过迭代调整参数来找到函数的最小值。在视频中,梯度下降与反向传播算法结合使用,以训练神经网络并优化其性能。

💡逻辑规则

逻辑规则是一系列用于推理和决策的数学或计算规则,它们在传统计算机科学中用于处理确定性问题。视频中提到了逻辑规则与神经网络学习方法的对比,强调了神经网络通过模拟而非逻辑规则来实现复杂的认知任务。

💡多模态

多模态指的是能够处理和理解多种不同类型的数据输入,如文本、图像、声音等。视频中讨论了多模态系统如何增强模型对空间和对象的理解,以及它们在未来AI发展中的潜力。

💡强化学习

强化学习是一种让机器通过试错来学习最优行为策略的方法,它在视频中被提及,尤其是在讨论AlphaGo和AlphaZero等系统如何通过自我对弈来超越人类专家的棋艺。

💡GPU

GPU(图形处理单元)是一种专门设计用于处理图形和图像计算的硬件,但它们也被用于加速深度学习中的大规模矩阵运算。视频中提到了GPU在训练大型神经网络中的关键作用,以及它们如何帮助推动AI技术的进步。

Highlights

在英格兰的研究单位,晚上六点后大家会去酒吧喝酒,而在卡内基梅隆大学,学生们在周六晚上9点还在实验室工作,因为他们相信他们的工作将改变计算机科学的未来。

作者在剑桥学习生理学时对大脑的理解感到失望,因为所学仅限于神经元如何传导动作电位,并未解释大脑如何工作。

作者转向哲学寻求对心智的理解,但同样感到失望,最终选择去爱丁堡学习人工智能(AI),因为AI可以模拟事物来测试理论。

Donald Hebb的书籍对作者影响很大,特别是关于神经元连接强度如何通过学习改变的理论。

作者早期对AI的直觉是大脑学习的方式肯定不是通过预设的逻辑规则进行推理,而需要找出大脑如何修改神经网络的连接。

作者在卡内基梅隆大学的主要合作是与巴尔的摩约翰霍普金斯大学的Terry Sejnowski,他们共同研究玻尔兹曼机。

作者与统计学家Peter Brown的合作,学习了隐马尔可夫模型,并将其应用于神经网络的隐藏层。

Ilya Sutskever作为一个有直觉和独立思考能力的学生,对作者的研究产生了重要影响。

Ilya的一个关键直觉是,增加模型规模可以使模型工作得更好,这一观点最初被作者认为是逃避问题,但后来被证明是正确的。

2011年,作者、Ilya和James Martins使用字符级预测的方法在Wikipedia上取得了显著的成功。

作者认为,模型预测下一个词或符号不仅仅是预测,而是需要理解之前的内容,这涉及到推理过程。

大型语言模型通过寻找共同结构来编码信息,这种能力使它们能够进行创造性的类比和推理。

AlphaGo的例子显示了通过强化学习,AI可以在特定领域超越当前的科学知识。

作者认为,即使没有强化学习,大型神经网络也有可能通过自我修正错误来超越其训练数据。

多模态模型的引入,如结合视觉、触觉等,将使模型在空间理解方面更加强大。

作者提出,语言可能既进化以适应大脑,大脑也可能进化以适应语言,两者相互影响。

使用GPU进行神经网络训练的早期尝试极大地加速了机器学习的发展,作者在这方面有先见之明。

作者对数字计算的欣赏,以及对模拟计算的探索,表明了计算技术与AI领域的共同进步。

Transcripts

play00:00

in England at a Research Unit it would

play00:02

get to be six o'cl and you'd all go for

play00:03

a drink in the pub um at Caril melon I

play00:07

remember after I've been there a few

play00:08

weeks it was Saturday night I didn't

play00:10

have any friends yet and I didn't know

play00:12

what to do so I decided I'd go into the

play00:13

lab and do some programming because I

play00:15

had a list machine and you couldn't

play00:16

program it from home so I went into the

play00:18

lab at about 9:00 on a Saturday night

play00:21

and it was swarming all the students

play00:23

were there and they were all there

play00:26

because what they were working on was

play00:27

the future they all believed that what

play00:29

they did next was going to change the

play00:31

course of computer science and it was

play00:33

just so different from England and so

play00:35

that was very refreshing take me back to

play00:38

the very beginning Jeff at Cambridge uh

play00:42

trying to understand the brain uh what

play00:45

was that like it was very disappointing

play00:48

so I did physiology and in the summer

play00:50

term they were going to teach us how the

play00:52

brain worked and it all they taught us

play00:54

was how neurons conduct Action

play00:56

potentials which is very interesting but

play00:58

it doesn't tell you how the brain works

play01:00

so that was extremely disappointing I

play01:02

switched to philosophy then I thought

play01:04

maybe they'd tell us how the mind worked

play01:06

um that was very disappointing I

play01:08

eventually ended up going to Edinburgh

play01:09

to do Ai and that was more interesting

play01:12

at least you could simulate things so

play01:14

you could test out theories and did you

play01:16

remember what intrigued you about AI was

play01:19

it a paper was it any particular person

play01:22

that exposed you to those ideas I guess

play01:25

it was a book I read by Donald Hebb that

play01:27

influenced me a lot um he was very

play01:31

interested in how you learn the

play01:33

connection strengths in neuronet I also

play01:35

read a book by John Fon neyman early on

play01:38

um who was very interested in how the

play01:41

brain computes and how it's different

play01:43

from normal computers and did you get

play01:46

that conviction that these ideas would

play01:49

work out at at that point or what would

play01:52

was your intuition back at the edur days

play01:55

it seemed to me there has to be a way

play01:57

that the brain learns

play02:00

and it's clearly not by having all sorts

play02:03

of things programmed into it and then

play02:05

using logical rules of inference that

play02:07

just seemed to me crazy from the outset

play02:10

um so we had to figure out how the brain

play02:14

learned to modify Connections in a

play02:15

neural net so that it could do

play02:17

complicated things and F noyman believed

play02:20

that churing believed that so F noyman

play02:22

and churing were both pretty good at

play02:23

logic but they didn't believe in this

play02:25

logical approach and what was your split

play02:28

between studying the

play02:30

from from

play02:31

neuroscience and just doing what seem to

play02:34

be good algorithms for for AI how much

play02:37

inspiration did you take early on so I

play02:40

never did that much study of

play02:41

Neuroscience I was always inspired by

play02:43

what I'd learned about how the brain

play02:45

works that there's a bunch of neurons

play02:48

they perform relatively simple

play02:49

operations they're nonlinear um but they

play02:52

collect inputs they weight them and then

play02:55

they give an output that depends on that

play02:57

weighted input and the question is how

play02:59

do you change those weights to make the

play03:00

whole thing do something good it seems

play03:02

like a fairly simple question what

play03:04

collaborations do you remember from from

play03:07

that time the main collaboration I had

play03:09

at Carnegie melon was with someone who

play03:11

wasn't at Carnegie melon I was

play03:13

interacting a lot with Terry sinowski

play03:14

who was in Baltimore at Johns Hopkins

play03:17

and about once a month either he would

play03:19

drive to Pittsburgh or I would drive to

play03:20

Baltimore it's 250 miles away and we

play03:23

would spend a weekend together working

play03:24

on boltzman machines that was a

play03:26

wonderful collaboration we were both

play03:28

convinced it was how the brain worked

play03:29

that was the most exciting research I've

play03:31

ever done and a lot of technical results

play03:33

came out that were very interesting but

play03:35

I think it's not how the brain works um

play03:37

I also had a very good collaboration

play03:39

with um Peter Brown who was a very good

play03:43

statistician and he worked on speech

play03:45

recognition at IBM and then he came as a

play03:49

more mature student to Cary melon just

play03:50

to get a PhD um but he already knew a

play03:53

lot he taught me a lot about speech and

play03:56

he in fact taught me about hidden Markov

play03:58

models I think I learned more from him

play04:00

and then he learned from me that's the

play04:01

kind of student you want and when he

play04:04

taught me about hidden Markov models I

play04:06

was doing back propop with hidden layers

play04:09

only they weren't called hidden layers

play04:10

then and I decided that name they use in

play04:13

Hidden Markov models is a great name for

play04:15

variables that you don't know what

play04:16

they're up to um and so that's where the

play04:20

name hidden in neural NS came from me

play04:23

and P decided that was a great name for

play04:25

the hidden hidden L of neural Mets um

play04:29

but I learned a lot from Peter about

play04:30

speech take us back to when Ilia showed

play04:33

up at your at your office I was in my

play04:36

office I probably on a Sunday um and I

play04:39

was programming I think and there was a

play04:42

knock on the door not just any knock but

play04:44

it won't

play04:45

cutter that's sort of an urgent knock so

play04:47

I went and answer the door and this was

play04:49

this young student there and he said he

play04:51

was cooking Fries over the summer but

play04:53

he'd rather be working in my lab and so

play04:55

I said well why don't you make an

play04:56

appointment and we'll talk and so said

play04:59

how about now

play05:00

and that sort of was Ila's character so

play05:03

we talked for a bit and I gave him a

play05:05

paper to read which was the nature paper

play05:07

on back

play05:08

propagation and we made another meeting

play05:11

for a week later and he came back and he

play05:13

said I didn't understand it and I was

play05:15

very disappointed I thought he seemed

play05:16

like a bright guy but it's only the

play05:19

chain rule it's not that hard to

play05:20

understand and he said oh no no I

play05:22

understood that I just don't understand

play05:24

why you don't give the gradient to a

play05:26

sens a sensible function Optimizer which

play05:29

took us quite a few years to think about

play05:32

um and it kept on like that with AIA he

play05:34

had very good his raw intuitions about

play05:36

things were always very good what do you

play05:38

think had enabled those uh those

play05:41

intuitions for for Ilia I don't know I

play05:44

think he always thought for himself he

play05:46

was always interested in AI from a young

play05:48

age um he's obviously good at math so

play05:52

but it's very hard to know and what was

play05:54

that collaboration between the two of

play05:56

you like what part would you play and

play05:59

what part with play it was a lot of fun

play06:02

um I remember one occasion when we were

play06:06

trying to do a complicated thing with

play06:08

producing maps of data where I had a

play06:11

kind of mixture model so you could take

play06:12

the same bunch of similarities and make

play06:15

two maps so that in one map Bank could

play06:18

be close to Greed and in another map

play06:20

Bank could be close to River um because

play06:23

in one map you can't have it close to

play06:24

both right because River and greed along

play06:26

way so we'd have a mixture maps and

play06:30

we were doing it in mat lab and this

play06:31

involved a lot of reorganization of the

play06:33

code to do the right Matrix multiplies

play06:35

and then got fed up with that so he came

play06:37

one day and said um I'm going to write a

play06:40

an interface for matlb so I program in

play06:43

this different language and then I have

play06:45

something that just converts it into

play06:46

Matlab and I said no Ilia um that'll

play06:49

take you a month to do we've got to get

play06:51

on with this project don't get diverted

play06:53

by that and said it's okay I did it this

play06:58

morning and that's that's quite quite

play07:00

incredible and throughout those those

play07:03

years the biggest shift wasn't

play07:06

necessarily just the the algorithms but

play07:08

but also the the skill how did you sort

play07:11

of view that skill uh over over the

play07:15

years Ilia got that intuition very early

play07:17

so Ilia was always preaching that um you

play07:21

just make it bigger and it'll work

play07:22

better and I always thought that was a

play07:24

bit of a copout that you're going to

play07:25

have to have new ideas too it turns out

play07:28

I was basically right new ideas help

play07:30

things like Transformers helped a lot

play07:32

but it was really the scale of the data

play07:35

and the scale of the computation and

play07:36

back then we had no idea computers would

play07:39

get like a billion times faster we

play07:41

thought maybe they' get a hundred times

play07:42

faster we were trying to do things by

play07:45

coming up with clever ideas that would

play07:46

have just solved themselves if we had

play07:48

had bigger scale of the data and

play07:49

computation in about

play07:51

2011 Ilia and another graduate student

play07:54

called James Martins and I had a paper

play07:57

using character level prediction so we

play07:59

took Wikipedia and we tried to predict

play08:01

the next HTML character and that worked

play08:05

remarkably well and we were always

play08:07

amazed at how well it worked and that

play08:10

was using a fancy Optimizer on

play08:13

gpus and we could never quite believe

play08:16

that it understood anything but it

play08:18

looked as though it

play08:19

understood and that just seemed

play08:21

incredible can you take us through how

play08:24

are do models Trend to predict the next

play08:28

word and

play08:30

why is it the wrong way of of thinking

play08:32

about them okay I don't actually believe

play08:35

it is the wrong way so in fact I think I

play08:38

made the first neuronet language model

play08:41

that used embeddings and back

play08:42

propagation so it's very simple data

play08:45

just

play08:45

triples and it was turning each symbol

play08:49

into an embedding then having the

play08:51

embeddings interact to predict the

play08:53

embedding of the next symbol and then

play08:55

from that predict the next symbol and

play08:57

then it was back propagating through

play08:58

that whole process to learn these

play09:00

triples and I showed it could generalize

play09:03

um about 10 years later yosua Benji used

play09:06

a very similar Network and showed it

play09:07

work with real text and about 10 years

play09:09

after that linguist started believing in

play09:12

embeddings it was a slow process the

play09:14

reason I think it's not just predicting

play09:17

the next symbol is if you ask well what

play09:19

does it take to predict the next symbol

play09:21

particularly if you ask me a question

play09:24

and then the first word of the answer is

play09:27

the next symbol um you have to

play09:30

understand the question so I think by

play09:33

predicting the next

play09:34

symbol it's very unlike old fashioned

play09:37

autocomplete oldfashioned autocomplete

play09:39

You' store sort of triples of words and

play09:42

then if you store a pair of words you

play09:44

see how often different words came third

play09:46

and that way you can predict the next

play09:47

symbol and that's what most people think

play09:49

autocomplete is like it's no longer at

play09:52

all like that um to predict the next

play09:54

symbol you have to understand what's

play09:55

been said so I think you're forcing it

play09:57

to understand by making it predict the

play09:59

next symbol and I think it's

play10:01

understanding in much the same way we

play10:03

are so a lot of people will tell you

play10:05

these things aren't like us um they're

play10:07

just predicting the next symbol they're

play10:09

not reasoning like us but actually in

play10:12

order to predict the next symbol it's

play10:13

have going to have to do some reasoning

play10:15

and we've seen now that if you make big

play10:17

ones without putting in any special

play10:19

stuff to do reasoning they can already

play10:21

do some reasoning and I think as you

play10:22

make them bigger they're going to be

play10:23

able to do more and more reasoning do

play10:25

you think I'm doing anything else than

play10:27

predicting the next symbol right now I

play10:29

think that's how you're learning I think

play10:31

you're predicting the next video frame

play10:34

um you're predicting the next sound um

play10:37

but I think that's a pretty plausible

play10:39

theory of how the brain is learning what

play10:41

enables these models to learn such a

play10:45

wide variety of of fields what these big

play10:47

language models are doing is they're

play10:49

looking for common structure and by

play10:51

finding common structure they can encode

play10:53

things using the common structure and

play10:55

that's more efficient so let me give you

play10:57

an example if you ask G before why is a

play11:01

compost heap like an atom

play11:02

bomb most people can't answer that most

play11:05

people haven't thought they think atom

play11:07

bombs and compost he are very different

play11:08

things but gp4 will tell you well the

play11:11

energy scales are very different and the

play11:13

time scales are very different but the

play11:15

thing that's the same is that when the

play11:17

compost Heep gets hotter it generates

play11:19

heat faster and when the atom bomb

play11:21

produces more NE neutrons it produces

play11:24

more neutrons faster and so it gets the

play11:27

idea of a chain reaction and I believe

play11:29

it's understood they're both forms of

play11:31

chain reaction and it's using that

play11:33

understanding to compress all that

play11:35

information into its weights and if it's

play11:37

doing that then it's going to be doing

play11:39

that for hundreds of things where we

play11:41

haven't seen the analogies yet but it

play11:43

has and that's where you get creativity

play11:45

from from seeing these analogies between

play11:47

apparently very different things and so

play11:49

I think gp4 is going to end up when it

play11:51

gets bigger being very creative I think

play11:53

this idea that it's just regurgitating

play11:56

what it's learned just pasing together

play11:58

text it's learned already that's

play12:00

completely wrong it's going to be even

play12:02

more creative than people I think you'd

play12:04

argue that it won't just repeat the

play12:08

human knowledge we've developed so far

play12:10

but could also progress beyond that I

play12:13

think that's something we haven't quite

play12:15

seen yet we've started seeing some

play12:17

examples of it but to a to a large

play12:20

extent we're sort of still at the

play12:22

current level of of of science what do

play12:24

you think will enable it to go beyond

play12:26

that well we've seen that in more

play12:28

limited context like if you take Alpha

play12:30

go in that famous competition with liso

play12:34

um there was move 37 where Alpha go made

play12:37

a move that all the experts said must

play12:39

have been a mistake but actually later

play12:41

they realized it was a brilliant move um

play12:44

so that was created within that limited

play12:46

domain um I think we'll see a lot more

play12:48

of that as these things get bigger the

play12:51

difference with u alphao as well was

play12:54

that it was using reinforcement learning

play12:56

that that subsequently sort of enabled

play12:58

it to to go beyond the current state so

play13:01

it started with imitation learning

play13:03

watching how humans play the game and

play13:05

then it would through selfplay develop

play13:07

Way Beyond that do you think that's the

play13:09

missing component of the current I think

play13:12

that may well be a missing component yes

play13:14

that the the self-play in Alpha in Alpha

play13:17

go and Alpha zero are are a large part

play13:20

of why it could make these creative

play13:21

moves but I don't think it's entirely

play13:24

necessary so there's a little experiment

play13:27

I did a long time ago where you your

play13:29

training in neuronet to recognize

play13:31

handwritten digits I love that example

play13:33

the Mist example and you give it

play13:35

training data where half the answers are

play13:38

wrong um and the question is how well

play13:41

will it

play13:43

learn and you make half the answers

play13:46

wrong once and keep them like that so it

play13:49

can't average away the wrongness by just

play13:51

seeing the same example but with the

play13:53

right answer sometimes and the wrong

play13:54

answer sometimes when it sees that

play13:55

example half for half of the examples

play13:58

when it sees the example the answer is

play13:59

always

play13:59

wrong and so the training data has 50%

play14:03

error but if you train up back

play14:06

propagation it gets down to 5% error or

play14:09

less in other words from badly labeled

play14:14

data it can get much better results it

play14:17

can see that the training data is wrong

play14:20

and that's how smart students can be

play14:21

smarter than their advisor and their

play14:23

advisor tells them all this stuff and

play14:26

for half of what their advisor tells

play14:27

them they think no rubbish and they

play14:29

listen to the other half and then they

play14:31

end up smarter than the advisor so these

play14:32

big neural Nets can actually do they can

play14:35

do much better than their training data

play14:37

and most people don't realize that so

play14:39

how how do you expect these models to

play14:41

add reasoning in into them so I mean one

play14:45

approach is you add sort of the

play14:46

heuristics on on top of them which a lot

play14:49

of the research is doing now where you

play14:51

have sort of Shan of thought you just

play14:53

feedback its reasoning um in into itself

play14:56

and another way would be in the model

play14:58

itself as you scale scale scale it up

play15:01

what's your intuition around that so my

play15:04

intuition is that as we scale up these

play15:06

models I get better at reasoning and if

play15:08

you ask how people work roughly speaking

play15:12

we have these

play15:13

intuitions and we can do reasoning and

play15:16

we use the reasoning to correct our

play15:18

intuitions of course we use the

play15:20

intuitions during the reasoning to do

play15:21

the reasoning but if the conclusion of

play15:23

the reasoning conflicts with our

play15:24

intuitions we realize the intuitions

play15:26

need to be changed that's much like in

play15:29

Alpha go or Alpha zero where you have an

play15:32

evaluation function um that just looks

play15:35

at a board and says how good is that for

play15:36

me but then you do the Monte Cara roll

play15:39

out and now you get a more accurate idea

play15:43

and you can revise your evaluation

play15:44

function so you can train it by getting

play15:46

it to agree with the results of

play15:48

reasoning and I think these large

play15:49

language models have to start doing that

play15:52

they have to start training their raw

play15:54

intuitions about what should come next

play15:56

by doing reasoning and realizing that's

play15:58

not right and so that way they can get

play16:01

more training data than just mimicking

play16:04

what people did and that's exactly why

play16:06

alphago could do this creative move 37

play16:08

it had much more training data because

play16:10

it was using reasoning to check out what

play16:13

the right next move should have been and

play16:15

what do you think about multimodality so

play16:18

we spoke about these analogies and often

play16:20

the analogies are Way Beyond what we

play16:22

could see it's discovering analogies

play16:25

that are far beyond humans and at maybe

play16:27

abstraction levels that will never be be

play16:29

able to to to understand now when we

play16:32

introduce images to that and and video

play16:35

and sound how do you think that will

play16:37

change the models and uh how do you

play16:40

think it will change the analogies that

play16:42

it will be able to make um I think it'll

play16:45

change it a lot I think it'll make it

play16:48

much better at understanding spatial

play16:49

things for example from language alone

play16:52

it's quite hard to understand some

play16:53

spatial things although remarkably gb4

play16:56

can do that even before it was multi mod

play16:59

Al um but when you make it multimodal if

play17:02

you have it both doing vision and

play17:05

reaching out and grabbing things it'll

play17:08

understand object much better if it can

play17:09

pick them up and turn them over and so

play17:10

on so although you can learn an awful

play17:14

lot from language it's easier to learn

play17:17

if you multimodal and in fact you then

play17:20

need less language and there's an awful

play17:22

lot of YouTube video for predicting the

play17:24

next frame so or something like that so

play17:27

I think these multimodule models are

play17:29

clearly going to take over um you can

play17:31

get more data that way they need less

play17:33

language so there's really a

play17:34

philosophical point that you could learn

play17:37

a very good model from language alone

play17:39

but it's much easier to learn it from a

play17:40

multimodal system and how do you think

play17:43

it will impact the models reasoning I

play17:46

think it'll make it much better at

play17:47

reasoning about space for example

play17:49

reasoning about what happens if you pick

play17:51

objects up if you actually try picking

play17:52

objects up you're going to get all sorts

play17:54

of training data that's going to help do

play17:55

you think the human brain evolved to

play17:59

work well with with language or do you

play18:01

think language evolved to work well with

play18:04

the human brain I think the question of

play18:06

whether language evolved to work with

play18:08

the brain or the brain evolved to work

play18:09

with the language I think that's a very

play18:10

good question I think both happened I

play18:14

used to think we would do a lot of

play18:16

cognition without needing language at

play18:18

all um now I've changed my mind a bit so

play18:23

let me give you three different views of

play18:25

language um and how it relates to

play18:27

cognition there's the oldfashioned

play18:29

symbolic view which is cognition

play18:31

consists of having strings of symbols in

play18:36

some kind of cleaned up logical language

play18:38

where there's no ambiguity and applying

play18:40

rules of inference and that's what

play18:41

cognition is it's just these symbolic

play18:43

manipulations on things that are like

play18:45

strings of language symbols um so that's

play18:48

one extreme view an opposite extreme

play18:50

view is no no once you get inside the

play18:53

head it's all vectors so symbols come in

play18:56

you convert those symbols into big

play18:58

vectors

play18:59

and all the stuff inside's done with

play19:00

being vectors and then if you want to

play19:02

produce output you produce symbols again

play19:04

so there was a point in machine

play19:06

translation in about

play19:08

2014 when people were using neural

play19:10

recurrent neural Nets and words will

play19:13

keep coming in and that have a hidden

play19:14

State and they keep accumulating

play19:16

information in this hidden state so when

play19:18

they got to the end of a sentence that

play19:21

have a big hidden Vector that capture

play19:22

the meaning of that sentence that could

play19:25

then be used for producing the sentence

play19:26

in another language that was called a

play19:28

thought fact vector and that's a sort of

play19:30

second view of language you convert the

play19:31

language into a big Vector that's

play19:34

nothing like language and that's what

play19:36

cognition is all about but then there's

play19:38

a third view which is what I believe now

play19:41

which is that you take these

play19:46

symbols and you convert the symbols into

play19:49

embeddings and you use multiple layers

play19:51

of that so you get these very rich

play19:52

embeddings but the embeddings are still

play19:54

tied to the symbols in the sense that

play19:56

you've got a big Vector for this symbol

play19:57

and a big Vector for that

play19:59

symbol and these vectors interact to

play20:01

produce the vector for the symbol for

play20:03

the next word and that's what

play20:05

understanding is understanding is

play20:07

knowing how to convert the symbols into

play20:09

these vectors and knowing how the

play20:10

elements of the vector should interact

play20:12

to predict the vector for the next

play20:14

symbol that's what understanding is both

play20:15

in these big language models and in our

play20:18

brains and that's an example which is

play20:21

sort of in between you're staying with

play20:23

the symbols but you're interpreting them

play20:26

as these big vectors and that's where

play20:28

all work is and all the knowledge is in

play20:31

what vectors you use and how the

play20:32

elements of those vectors interact not

play20:34

in symbolic rules um but it's not saying

play20:39

that you get away from the symbols Al

play20:40

together it's saying you turn the

play20:42

symbols into big vectors but you stay

play20:44

with that surface structure of the

play20:45

symbols and that's how these models are

play20:47

working and that's now seem to me a more

play20:49

plausible model of human thought too you

play20:51

were one of the first folks to get the

play20:54

idea of using gpus and I know Jensen

play20:58

loves you uh for that uh back in 2009

play21:02

you mentioned that that you told Yensen

play21:03

that this could be a quite good idea um

play21:05

for for training training neural Nets

play21:08

take us back to that early intuition of

play21:10

of using gpus for for training neural

play21:13

Nets so actually I think in about

play21:16

2006 I had a former graduate student

play21:19

called Rick zisy who's a very good

play21:21

computer vision guy and I talked to him

play21:24

at a meeting and he said you know you

play21:26

ought to think about using Graphics

play21:27

processing cards because they're very

play21:29

good at Matrix multiplies and what

play21:31

you're doing is basically all Matrix

play21:33

multiplies so I thought about that for a

play21:35

bit and then we learned about these

play21:37

Tesla systems that had um four gpus in

play21:42

and initially we just got um gaming gpus

play21:47

and discovered they made things go 30

play21:48

times faster and then we bought one of

play21:50

these Tesla systems with 4 gpus and we

play21:53

did speech on that and it worked very

play21:56

well and then in 2009 I gave a talk at

play21:59

nips and I told a thousand machine

play22:01

learning researchers you should all go

play22:03

and buy Nvidia gpus they're the future

play22:05

you need them for doing machine learning

play22:07

and I actually um then sent mail to

play22:10

Nvidia saying I told a thousand machine

play22:12

learning researchers to buy your boards

play22:14

could you give me a free one and they

play22:15

said no actually they didn't say no they

play22:17

just didn't reply um but when I told

play22:20

Jensen this story later on he gave me a

play22:21

free

play22:23

one that's uh that's very very good I I

play22:26

think what's interesting is um as well

play22:28

is sort of how gpus has evolved

play22:31

alongside the the field so where where

play22:33

do you think we we should go go next in

play22:36

in the in the computer so my last couple

play22:39

of years at Google I was thinking about

play22:41

ways of trying to make analog

play22:43

computation so instead of using like a

play22:46

megawatt we could use like 30 Watts like

play22:47

the brain and we could run these big

play22:50

language models in analog hardware and I

play22:53

never made it

play22:55

work and but I started a really

play22:58

appreciating digital computation so if

play23:02

you're going to use that low power

play23:04

analog

play23:05

computation every piece of Hardware is

play23:07

going to be a bit different and the idea

play23:09

is the learning is going to make use of

play23:11

the specific properties of that hardware

play23:13

and that's what happens with people all

play23:15

our brains are different um so we can't

play23:18

then take the weights in your brain and

play23:20

put them in my brain the hardware is

play23:22

different the precise properties of the

play23:24

individual neurons are different the

play23:25

learnings you to make has learned to

play23:27

make use of all that and so we're mortal

play23:30

in the sense that the weights in my

play23:32

brain are no good for any other brain

play23:33

when I die those weights are useless um

play23:36

we can get information from one to

play23:38

another rather

play23:39

inefficiently by I produce sentences and

play23:42

you figure out how to change your weight

play23:44

so you would have said the same thing

play23:46

that's called distillation but that's a

play23:48

very inefficient way of communicating

play23:50

knowledge and with digital systems

play23:53

they're Immortal because once you got

play23:55

some weights you can throw away the

play23:57

computer just store the weights on a

play23:58

tape somewhere and now build another

play24:00

computer put those same weights in and

play24:02

if it's digital it can compute exactly

play24:05

the same thing as the other system did

play24:07

so digital systems can share weights and

play24:11

that's incredibly much more efficient if

play24:14

you've got a whole bunch of digital

play24:16

systems and they each go and do a tiny

play24:17

bit of

play24:18

learning and they start with the same

play24:20

weights they do a tiny bit of learning

play24:22

and then they share their weights again

play24:24

um they all know what all the others

play24:25

learned we can't do that and so they're

play24:29

far superior to us in being able to

play24:30

share knowledge

Rate This

5.0 / 5 (0 votes)

Related Tags
人工智能神经网络机器学习深度学习大脑模拟学习算法认知科学技术革新GPU计算多模态学习
Do you need a summary in English?