Geoffrey Hinton: The Foundations of Deep Learning

Elevate
7 Feb 201828:21

Summary

TLDR本视频脚本介绍了神经网络和反向传播算法的基本原理及其在多个领域的应用。首先,解释了反向传播算法的工作原理,它通过模拟进化过程调整网络权重来优化性能。然后,通过实例展示了算法在图像识别、语音识别和机器翻译等方面的成功应用。特别提到了多伦多大学研究生如何利用神经网络改进语音识别系统,并最终超越了当时的技术水平。此外,还讨论了卷积神经网络在图像识别中的巨大成功,以及循环神经网络在处理序列数据时的优势。最后,探讨了神经网络在未来的潜力,包括在医学图像分析中超越放射科医师的可能性,以及在药物设计中预测分子活性的潜力。整个脚本强调了神经网络在现代人工智能中的重要性和其解决复杂问题的能力。

Takeaways

  • 🤖 神经网络通过学习算法模拟人脑网络,通过调整权重来解决问题,而不需要明确编程。
  • 📈 反向传播算法是神经网络训练的核心,通过计算损失函数的梯度来更新网络权重。
  • 🔍 神经网络在图像识别和语音识别上取得了突破性进展,超越了传统算法。
  • 🎓 多伦多大学的研究生通过将神经网络应用于语音识别,显著提高了识别性能。
  • 📚 神经网络在处理大量标记数据时表现出色,尤其是在有足够的计算能力时。
  • 📈 神经网络在图像识别竞赛中取得了压倒性的胜利,错误率远低于传统计算机视觉系统。
  • 🔁 循环神经网络(RNN)适合处理序列数据,如语言和视频,通过自身循环结构记忆信息。
  • 🌐 机器翻译领域,神经网络通过编码器-解码器模型,将一种语言翻译成另一种语言。
  • 📖 神经网络在医学图像分析中显示出超越放射科医师的潜力,能够更准确地诊断疾病。
  • 🧠 神经网络的训练不需要大量的语言学知识,而是依赖于大量的标记数据。
  • 🏆 神经网络在药物分子结合预测竞赛中获胜,展示了其在化学领域的应用潜力。

Q & A

  • 什么是反向传播算法?

    -反向传播算法是一种在神经网络中用于训练和优化权重的算法。它通过计算损失函数关于网络中每个权重的梯度,并利用这些梯度来更新权重,以减少网络的预测误差。

  • 为什么传统的编程方式在某些问题上不如神经网络有效?

    -传统的编程方式需要人类程序员详细地告诉计算机如何执行任务,这在复杂或者人类自身也不太了解的问题上非常困难。而神经网络通过学习算法和大量示例数据,可以自行发现解决问题的方法,无需人类详细编程。

  • 人工神经元是如何工作的?

    -人工神经元接收来自传感器或其他神经元的输入信号,每个输入信号都有一个权重,可以是正数或负数。神经元将输入值与权重相乘,然后求和,得到总输入。然后,它输出一个非线性函数的结果,该函数是总输入的函数。如果总输入不够大,神经元不会产生输出;一旦总输入超过某个阈值,神经元开始响应,并且随着总输入的增加,输出也会增加。

  • 为什么说反向传播算法在20世纪90年代被放弃了?

    -在20世纪90年代,由于数据集相对较小,其他算法表现得更好,而且可以对这些算法进行数学证明。反向传播算法在当时没有证明其有效性,并且不同人运行相同的反向传播算法会得到不同的结果,这使得机器学习领域的研究人员对其失去了兴趣。

  • 为什么近年来反向传播算法又变得流行起来?

    -近年来,随着大数据和计算能力的提升,反向传播算法在大量标记数据和强大计算资源的情况下表现出色。此外,一些技术上的进步,特别是在多伦多和蒙特利尔的研究人员所做的工作,使得反向传播算法在处理图像和语音识别等任务时取得了显著的成果。

  • 神经网络在图像识别中是如何工作的?

    -神经网络通过多层的隐藏单元处理图像中的像素数据。每一层都会提取图像的不同特征,并将这些特征传递到下一层。通过这种方式,神经网络能够识别图像中的物体,并将数百万的像素值转换成描述图像内容的文字。

  • 什么是递归神经网络(RNN)?

    -递归神经网络是一种专门用于处理序列数据的神经网络。它们通过在网络中引入循环来保持对先前信息的记忆,这使得RNN能够处理如时间序列数据、自然语言等具有序列依赖性的任务。

  • 机器翻译是如何利用神经网络实现的?

    -机器翻译使用神经网络首先将一种语言的句子编码成一个高维的“思想”向量,然后使用解码器网络将这个向量翻译成另一种语言的句子。这种方法不需要任何语言学知识,而是完全基于数据驱动的学习方法。

  • 什么是注意力机制?

    -注意力机制是一种在神经网络中增加的模块,它允许网络在生成翻译或执行其他任务时,能够聚焦于输入数据的特定部分。这提高了翻译的准确性,并且使得网络能够在训练时更有效地学习。

  • 神经网络在医学图像分析中的应用前景如何?

    -神经网络在医学图像分析中有巨大的应用潜力。它们能够通过学习大量的医学图像数据来识别和诊断疾病,预计不久将能够超越放射科医生的诊断能力。

  • 神经网络在药物发现领域有何应用?

    -神经网络可以用于预测分子是否会与特定的药物靶标结合,这对于药物发现非常重要。通过分析分子的结构和属性,神经网络可以帮助制药公司在不实际合成分子的情况下筛选出有潜力的候选药物。

Outlines

00:00

🤖 神经网络与反向传播算法简介

本段介绍了神经网络的基础概念和反向传播算法。首先,讲解了传统的编程方式与现代学习算法的对比,强调了通过示例而非详细指令来训练计算机的新方法。随后,通过图像识别的例子展示了神经网络的能力,说明了人工神经元的工作原理,包括输入线、权重以及如何通过改变权重来适应。最后,讨论了如何通过进化算法调整网络中的连接,并引入了反向传播算法,这是一种利用微积分高效调整权重的方法。

05:02

📈 反向传播算法的复兴与应用

这段内容讲述了反向传播算法在20世纪90年代的低谷期以及后来的复兴。尽管反向传播算法曾被放弃,因为当时数据集较小且存在其他更好的算法,但随着数据量和计算能力的增加,它开始展现出惊人的效果。文中提到了多伦多大学学生将算法应用于语音识别并取得突破的案例,以及随后在Android系统中的使用。此外,还提到了两名学生在图像识别领域取得的成就,以及反向传播算法在计算机视觉和语音识别领域的广泛应用。

10:03

🔄 循环神经网络与序列处理

本段介绍了循环神经网络(RNN)的工作原理和应用。循环神经网络特别适合处理序列数据,如语音或视频。通过将输入数据(如单词或图像帧)提供给网络,并让隐藏单元之间的连接形成记忆,网络能够积累信息。然后,使用反向传播算法来训练网络,通过比较网络的输出和期望的正确答案来调整权重。文中还提到了使用RNN进行机器翻译的进展,包括编码器网络将一种语言的句子转换为“思想”,然后解码器网络将这个“思想”转换为另一种语言的句子。

15:03

🌐 思想、语言和机器翻译

这段内容深入探讨了机器翻译的工作原理,特别是编码器-解码器模型如何将一种语言翻译成另一种语言。文章指出,传统上认为语言翻译涉及将一种符号串转换为另一种符号串的观点是错误的,真正的翻译过程更为复杂。介绍了如何通过神经网络将单词转换为向量表示,并将这些向量组合成“思想”。然后,解码器网络基于这些“思想”生成目标语言的句子。此外,还讨论了谷歌翻译使用的技术,以及如何通过训练神经网络来改进翻译质量。

20:06

📚 注意力机制与多模态学习

本段讲述了注意力机制如何改进神经网络的性能,尤其是在机器翻译中。通过让网络在生成目标语言时回顾源语言,可以提高翻译的准确性。此外,还提到了使用不同大小的语言片段进行翻译的效果,以及使用字母或汉字的位图作为输入时,神经网络能够学习到汉字的结构。文章还探讨了将图像识别与语言生成结合起来的可能性,即通过图像得到“感知”,然后生成描述图像的文本。最后,提出了未来神经网络可能需要的参数数量,以及它们在医学图像分析等领域的潜在应用。

25:07

🧠 神经网络的未来与挑战

这段内容讨论了神经网络在医学图像分析中的潜力,以及它们可能很快就会超越放射科医生的诊断能力。文章指出,即使医生之间的诊断存在分歧,神经网络仍然能够通过训练学习并提供更准确的诊断。最后,通过一个关于药物分子结合预测的故事,强调了神经网络在没有领域知识的情况下也能取得显著成果的能力,以及它们在未来可能面临的挑战和机遇。

Mindmap

Keywords

💡反向传播算法

反向传播算法是一种在神经网络中用于训练模型的监督学习技术。它通过计算损失函数关于网络参数的梯度,并利用这些梯度来更新参数,以最小化损失函数。在视频中,反向传播算法被用来训练神经网络,实现图像识别、语音识别和机器翻译等功能,是视频讨论的核心算法之一。

💡神经网络

神经网络是模仿人脑神经元连接方式构建的计算模型,由相互连接的神经元(或称为节点)组成。在视频中,神经网络被用来处理复杂的数据模式,如图像和语音,通过学习输入数据的内在关系来执行分类和识别任务。

💡卷积神经网络

卷积神经网络(CNN)是一种专门用于处理图像数据的神经网络。它通过卷积层来提取图像特征,然后使用这些特征进行对象识别。在视频中,卷积神经网络被用于图像识别任务,如识别图片中的对象。

💡递归神经网络

递归神经网络(RNN)是一种用于处理序列数据的神经网络,它能够记忆先前的输入信息,因此非常适合处理时间序列或自然语言数据。在视频中,递归神经网络被用于语音识别和机器翻译,展示了其处理序列数据的能力。

💡机器翻译

机器翻译是指利用计算机程序将一种自然语言自动翻译成另一种自然语言的过程。在视频中,通过使用神经网络和反向传播算法,实现了高质量的机器翻译,这是人工智能领域的一个重要应用。

💡图像识别

图像识别是指计算机通过分析图像数据来识别和理解图像内容的过程。在视频中,利用卷积神经网络进行图像识别,能够将图像中的像素转换成描述图像内容的文字。

💡语音识别

语音识别是指计算机将人类的语音转换成文字的过程。在视频中,通过神经网络和反向传播算法,实现了对语音信号的识别,将声音波形转换为对应的文本信息。

💡深度学习

深度学习是机器学习的一个分支,它使用多层神经网络来模拟人脑处理数据的方式,以识别模式和分类数据。在视频中,深度学习是实现图像和语音识别、机器翻译等高级功能的基础技术。

💡梯度下降

梯度下降是一种优化算法,用于最小化损失函数,通过调整网络参数来寻找损失函数的最小值。在视频中,梯度下降与反向传播算法结合使用,以训练神经网络并优化其性能。

💡权重

在神经网络中,权重是输入信号与神经元之间连接的强度。权重的调整对网络的学习过程至关重要。在视频中,通过反向传播算法调整权重,以改善网络对输入数据的响应。

💡激活函数

激活函数是神经网络中用于在神经元输出端引入非线性的函数,它允许神经网络模拟更加复杂的函数。在视频中,激活函数被用来确定神经元是否应该被激活以及输出信号的大小。

Highlights

介绍了反向传播算法的基本原理,并解释了如何通过改变神经网络中的权重来实现学习。

通过展示神经网络如何将图像像素转换为描述性文字,说明了神经网络的强大能力。

讨论了人工神经元的工作原理,包括输入线、权重以及非线性函数的输出。

通过一个小批量的训练样本来调整权重,展示了如何逐步改进神经网络的性能。

利用微积分优化了权重调整过程,使得可以同时对所有权重进行高效调整。

尽管反向传播在20世纪90年代一度被放弃,但随着数据集的增大和计算能力的提升,它重新变得有效。

多伦多大学的研究生将反向传播算法应用于语音识别,取得了突破性的进展。

通过展示神经网络在图像识别和语音识别上的应用,说明了其在模式识别领域的潜力。

介绍了循环神经网络(RNN)的基本概念,以及它们在处理序列数据方面的优势。

描述了如何使用RNN进行机器翻译,将一种语言的句子转换为另一种语言。

Google Translate使用的是基于RNN的系统,它通过学习大量的翻译数据来提高翻译质量。

提出了“注意力”机制,它允许翻译系统在生成目标语言时回顾源语言的结构,从而提高翻译的准确性。

展示了如何将图像识别网络与翻译系统集成,以生成图像的描述性文字。

讨论了神经网络在医学图像分析中的潜力,以及它们可能超越放射科医师的可能性。

通过一个关于药物分子结合预测的故事,展示了神经网络在化学领域的应用潜力。

强调了即使在标签数据不完全准确的情况下,神经网络也能通过学习来超越训练数据的限制。

提出了神经网络可能需要与大脑相当的参数数量,以达到类似的处理能力。

讨论了神经网络在自然语言处理和图像处理方面的最新进展,以及它们对未来技术的影响。

Transcripts

play00:03

[Music]

play00:07

[Applause]

play00:13

I'm going to talk about some basic share

play00:17

and I imagine the some people in the

play00:19

audience who don't really have a good

play00:21

grip of what the backpropagation

play00:22

algorithm is so I'm actually going to

play00:24

explain that very briefly so you know

play00:26

what we're talking about and now I'm

play00:27

sure a few examples of what it can do

play00:29

and these are all things that are now a

play00:31

little bit out of date so if you want a

play00:36

computer to do something the old way to

play00:39

do it is to write a program that is you

play00:41

figure out how you do it yourself and

play00:42

its squizz it detail you tell the

play00:44

computer exactly what to do and the

play00:46

computers like you but faster the new

play00:49

way is you tell the computer to pretend

play00:51

to be in your network with a learning

play00:53

algorithm in it that's programming but

play00:55

then after that if you want to solve a

play00:56

particular problem you just show

play00:58

examples so suppose you want to solve

play01:01

the problem of I give you all the pixels

play01:04

in the image

play01:05

that's three numbers per pixel for the

play01:08

color there's like let's say a million

play01:10

of them and you have to turn those three

play01:13

million numbers into a string of words

play01:16

that says what's in the image that's a

play01:20

tricky program to write people tried in

play01:22

air for 50 years and didn't even come

play01:24

close but now a neural net can just do

play01:27

it and I'll show you how it does that

play01:29

that is we have no idea how to write

play01:31

that program but a neural net can do it

play01:35

so we're gonna make our neural net out

play01:37

of artificial neurons an artificial

play01:39

neuron is going to have some input lines

play01:41

that come from the sensors or other

play01:42

neurons on each input line there's going

play01:44

to be a weight that could be positive or

play01:45

negative and it's going to adapt by

play01:47

changing the strengths of those weights

play01:49

and the way it behaves is that it takes

play01:51

the values on the input lines multiplies

play01:53

each value by weight adds it all up and

play01:55

that's this total input and then it

play01:57

gives an output that's a nonlinear

play01:59

function of that total input and the

play02:01

function is shown on the right if the

play02:03

total input isn't big enough it stays

play02:04

silent as soon as the total input gets

play02:06

bigger than that it starts giving a

play02:07

response it gets bigger as the total

play02:09

input gets bigger for 30 years we used a

play02:11

different kind of neuron that didn't

play02:12

work as well and then we tried this one

play02:14

and this works better

play02:15

that gives you some idea about the state

play02:18

of the art in the field there's very

play02:21

simple changes that just make things

play02:22

weren't much better than people haven't

play02:23

explored ok we're going to hook those

play02:28

guys up into networks with multiple

play02:30

layers and we're going to learn the

play02:33

connections on the inputs to the neurons

play02:35

in all the layers and so the problem

play02:39

solved now all we need to do is figure

play02:41

out how to adapt those connections and

play02:42

we're done because these networks can do

play02:44

anything so it's just a question of

play02:46

changing the connections and there's a

play02:48

very interesting and simple algorithm

play02:49

that occurs to anybody who believes in

play02:51

evolution which hopefully is most of you

play02:56

what you do is you take one of the

play02:59

connections you take a small batch of

play03:02

training examples

play03:03

a typical batch you run it through the

play03:05

network and you see how well the network

play03:07

does so how similar are the outputs of

play03:09

the network to the outputs that you

play03:10

think are the correct answers on this

play03:12

training data and then you change that

play03:14

one wait and then you run this batch

play03:17

through again and you see if that

play03:19

improves things if it improves things

play03:21

you keep the change if it doesn't

play03:23

improve things you don't keep the change

play03:25

you leave it like it was that's it and

play03:28

that algorithm works that's a very naive

play03:32

view of evolution but it works it's how

play03:35

many mutations you're doing and if you

play03:37

just do that long enough you'll get a

play03:39

network that does good things the

play03:41

problem is that I have to run a whole

play03:44

bunch of examples through the network

play03:45

actually twice for each weight and I

play03:48

might have a billion weights so what we

play03:50

want we want to do that algorithm that's

play03:52

the basic algorithm you're going to

play03:54

tinker with weights and just keep the

play03:55

tinkers of change but it's hard to do it

play03:59

efficiently and so we're now going to

play04:02

use some calculus and do the same thing

play04:04

efficiently what we do is because we

play04:10

know the weights in the network or your

play04:12

brain knows the weights in your brain we

play04:14

don't actually have to tinker with the

play04:16

weight and see the effect we can imagine

play04:19

tinkering with the weight and figure out

play04:21

what the effect would be if you know all

play04:24

the weights in the network you can say

play04:25

if I were to change this the output

play04:26

would change this way and that would be

play04:28

good

play04:29

so therefore I want to change this so

play04:31

what we can do is look at the

play04:32

discrepancy in the output and from the

play04:36

discrepancy between what we want and

play04:37

what we got we can send information

play04:40

backwards through the net that's doing

play04:42

this computation for every weight of how

play04:45

a small change in that way would affect

play04:47

the output how a small increase in that

play04:49

weight would improve or make worse the

play04:51

output and we can do that for all the

play04:55

weights in parallel so in the same

play04:57

amount of time is the mutation algorithm

play04:59

can figure out what shoe is one wage we

play05:02

can figure out what to do with all the

play05:03

weights and if there's a million weights

play05:04

that's a million times more efficient

play05:05

when a million times more efficient

play05:07

enough to make a difference and

play05:11

backpropagation had great promise but by

play05:13

the 1990s people in machine learning

play05:15

given up on it because they had

play05:17

relatively small data sets and other

play05:18

algorithms worked better

play05:20

and what's more you could prove things

play05:22

about these other events with back

play05:23

propagation you couldn't prove it would

play05:25

work and what's more when different

play05:27

people ran it they got different answers

play05:28

and so if you're obsessed with only

play05:31

being one correct answer and you're

play05:32

being able to prove you get it back

play05:34

propagation is not for you nor is life

play05:37

actually one of the reasons people lost

play05:43

interest is that it doesn't work so well

play05:46

the the naive algorithm didn't work well

play05:49

in GP networks and it didn't work in

play05:50

recurrent networks which are explained

play05:52

in a minute so then a few technical

play05:56

advances were made in Canada in by

play05:59

Canada I mean Toronto and Montreal in

play06:02

New York and we're very concerned about

play06:07

those details of those advances but

play06:09

that's minor February else the main

play06:12

message is if we give you a lot of label

play06:13

data and a lot of compute power back

play06:16

propagation now works amazingly well and

play06:18

the rest of the talk would be trying to

play06:19

convince you of that so here was the

play06:22

first practical problem that it made a

play06:24

big impact on that's not quite fair but

play06:27

since this was done in Toronto I'll

play06:29

pretend with us actually for handwritten

play06:31

digit recognition it made a big impact

play06:33

but people said that's an easy problem

play06:36

by a speech recognition is a tough

play06:37

problem so a couple of graduate students

play06:40

at the University of Toronto

play06:43

took an algorithm that I'd been working

play06:45

on and applied to speech recognition and

play06:48

you take some coefficients that describe

play06:51

the sound wave you put it through

play06:52

multiple layers of hidden units and

play06:54

these were lots of hidden units so there

play06:56

are only a few million training examples

play06:58

and between each pair of layers there's

play07:00

four million parameters that because

play07:02

it's fully connected so a statistician

play07:04

if you do statistics 101 you will know

play07:07

that this cannot possibly work because

play07:09

there's many more parameters than our

play07:10

training examples so it's crazy if you

play07:13

want a critique statistics 101 you might

play07:15

say in your lifetime you make about 10

play07:17

billion fixations and you have about

play07:22

10,000 times more synapses than that so

play07:26

you have about 10,000 synapses for each

play07:29

fixation you make so you don't satisfy

play07:32

statistics 101 either um okay so we

play07:37

trained this up on or they trained it up

play07:39

on sound waves trying to predict which

play07:42

piece of which phoneme was being said so

play07:45

imagine a little bit of a spectrogram

play07:46

which is essentially what the bottom is

play07:48

and you're looking at this piece of the

play07:50

vector and saying in the middle of the

play07:51

spectrogram which piece of which phoneme

play07:53

is this guy trying to say and you get a

play07:55

probabilistic answer and then you take

play07:57

all those probability answers and you

play07:58

string together with something else to

play08:00

find a plausible utterance nowadays you

play08:03

don't do that nowadays what you do is

play08:05

you just put sine waves in and you get

play08:09

the transcription out and the only thing

play08:11

is a neural network so recurrent neural

play08:13

network but back then this was just the

play08:15

front end of the system we replaced the

play08:17

front end of speech recognizers with

play08:18

this and it worked better it worked just

play08:19

a little bit better but good speech

play08:22

people particularly down at Microsoft

play08:24

realized right away that if this works a

play08:26

little bit better and to graduate

play08:27

students did in a few months he's going

play08:29

to completely wipe out the existing

play08:31

state-of-the-art

play08:31

and indeed over the next couple years it

play08:33

did so an avid DJ Taniya grad student of

play08:38

Toronto he wanted to go to rim and take

play08:44

this technology to rim he really wanted

play08:46

to do that and I talked to him and they

play08:47

said they weren't interested in speech

play08:48

recognition I I don't know what became

play08:51

of them um

play08:53

so by 2012 Google was using it in the

play09:00

Android and that was the first there was

play09:03

a big increase in speech recognition

play09:05

performance sense it suddenly got better

play09:07

than Siri now everybody's using this

play09:10

algorithm but more updated versions of

play09:13

it and all the best speech recognition

play09:15

systems when I trained with

play09:16

backpropagation in a neural net and some

play09:20

are just end to end in some solar system

play09:22

there's nothing else you just trained it

play09:24

on data all that how you pronounce

play09:26

things and what the words are and all

play09:27

that I forget it and that we'll learn

play09:29

all that then in 2012 two more graduate

play09:35

students of mine so the trick to all

play09:36

this is you have to always get graduate

play09:39

students who are smarter than you

play09:40

there's no point having a graduate

play09:41

student dumber than you because you

play09:42

could have done that so two other

play09:46

graduate students Ilya sutskever

play09:48

who recently got given a billion dollars

play09:50

by open AI to run and laugh um she's

play09:54

slightly depressing because that's a lot

play09:55

more than I ever got and Aleks Reshevsky

play10:00

they took the image net competition

play10:02

where there were a million images and a

play10:05

thousand of each class and you had to

play10:06

recognize subjects in that class and it

play10:09

was a public competition with a secret

play10:11

test set so you couldn't cheat and the

play10:14

person who ran our system on the test

play10:16

set told me I met him at a conference he

play10:18

told me he didn't believe the results so

play10:20

I'm back and ran it again he still

play10:21

didn't believe the results he had to run

play10:23

it three times before he believed the

play10:24

performance results because they were so

play10:25

much better than anybody else's so

play10:29

here's the results in 2012 all of the

play10:33

conventional computer vision systems

play10:34

that didn't use neural nets had

play10:37

plateaued at about twenty five percent

play10:39

error rate and our system I'm almost

play10:43

half that and as with speech recognition

play10:46

as soon as people switched then you got

play10:50

thousands of smart graduate students and

play10:52

thousands of experienced developers and

play10:54

so on making this work really well and

play10:56

by 2015 we'd reach on that data set

play11:00

people reach human levels one hero

play11:02

called andrew capacity

play11:04

actually did the task himself which took

play11:07

a lot of time and got 5% error and now

play11:11

they're down below 3% error and so it's

play11:16

a tenth of the error rate of the

play11:17

computer vision systems now of the

play11:19

previous computer vision systems so this

play11:20

made a big impact

play11:23

partly because speech' worked already

play11:25

but people thought that was a niche the

play11:27

speech worked first because they were

play11:29

the guys who had big labeled data sets

play11:31

when this worked people got all excited

play11:34

and it was very good for IP lawyers okay

play11:40

so these are examples of the kinds of

play11:44

images and notice they're not images

play11:48

that have one nicely centered object in

play11:52

canonical view point most of the

play11:54

teachers missing and the red bar is what

play11:58

the system thought was his best bet it

play12:00

gets told it's right if he in the top

play12:02

five bets it gets right ants because

play12:03

it's not always clear what the right

play12:04

answer is but you'll notice the bullet

play12:07

train it gets right even there's only a

play12:09

small fraction of the image the hand

play12:11

glass it gets wrong it thinks it's

play12:14

scissors if you look at the other things

play12:17

that thinks it is it thinks it might be

play12:18

a stethoscope or a frying pan and you

play12:21

can see why it thinks that and you can

play12:23

see that it needs glasses but the point

play12:25

is it's got the visual appearance of

play12:28

something if you look at it's wrong

play12:29

answers they tell you more than looking

play12:30

at the right answers

play12:36

now I'm going to go on to recurrent Nets

play12:39

so these fie forward Nets were very good

play12:42

at recognizing a phoneme in speech and

play12:44

recognizing object in an image but for

play12:47

dealing with sequences you want a

play12:49

recurrent net and the kinds of recurrent

play12:53

Nets people use now are based on work by

play12:55

a hawk writer and schmidt hoop in 1997

play12:57

that I'm not going to explain and I'm

play12:58

going to simplify them I'm going to

play13:00

pretend to you these recurrent Nets are

play13:01

simpler than they are because you really

play13:05

don't want to it would be nice if they

play13:09

were this simple but they're not okay um

play13:11

so here's how our current net works it

play13:15

has a bunch of input neurons not just

play13:17

one like it shows here but a bunch and

play13:20

that's representing the data of regular

play13:22

time so it might represent a word in a

play13:24

sentence it might work represent an

play13:25

image in a video that's the input it has

play13:29

a bunch of hidden neurons and these

play13:30

hidden neurons connect to themselves so

play13:34

if you look at the second time slice

play13:35

here the second column and look at that

play13:38

middle unit it's getting input it'll get

play13:41

some input from the input to the system

play13:44

which might be the video frame of the

play13:45

word it'll also get input from the

play13:47

previous state of all the hidden neurons

play13:49

so it's remembering and accumulating

play13:51

information and you can train this same

play13:54

thing with backpropagation what you do

play13:56

is you feed it the inputs and the hidden

play13:58

units of accumulated information when

play14:00

you get to the end you see if they can

play14:03

produce the right answer and if they

play14:06

can't you back propagate information so

play14:08

you just go backwards through all those

play14:09

arrows and one thing you'll notice about

play14:11

those arrows is they form a directed

play14:14

acyclic graph that is you cannot go

play14:16

around in a circle following the arrows

play14:18

and that means you can do back

play14:19

propagation you can go backwards without

play14:22

everyone getting in or not basically so

play14:24

in your suits cover and Oriole villians

play14:27

in quickly and pretty much in parallel

play14:30

yoshua bengio and Barda new and show in

play14:33

Montreal developed a way of using these

play14:37

algorithms for doing machine

play14:38

translations and initially it seemed

play14:41

crazy so what we're going to do is we're

play14:44

going to have an encoder Network that

play14:46

reads the sentence in one language and

play14:48

turns it into a thought

play14:50

and then we're going to take the thought

play14:51

and we're going to turn it into a

play14:53

sentence in another language of course

play14:55

to do that you need to know what a

play14:56

thought is now most people in AI in fact

play15:00

still most people in AI they made a very

play15:03

naive mistake which is they thought that

play15:06

strings of symbols come in as words when

play15:09

you say something strings of symbols

play15:10

come out so they think what's in between

play15:13

must be something like a string of

play15:15

symbols that's the stupidest thinking

play15:17

pixels come in and when you print

play15:20

something pixels come out so everything

play15:22

in between must be pixels and in fact

play15:25

the symbolic AI people were laughing in

play15:27

that view that it's all pixels in

play15:28

between that was a view of someone

play15:29

particularly naive called Steve Gosselin

play15:31

and they laughter that but they had

play15:33

exactly the same mistake they thought

play15:34

the stuff that comes in and the stuff

play15:36

that comes out which is the only stuff

play15:37

we know about from outside it must be

play15:39

the same kind of stuff in the middle

play15:40

even though you know that what's in the

play15:42

brain is just big vectors of neural

play15:44

activity

play15:45

there's no symbols in there this

play15:47

particularly no symbolic expressions in

play15:49

there and there certainly aren't rules

play15:50

for animate manipulating symbolic

play15:52

expressions at least after many years of

play15:55

high school there might be a few of

play15:56

those rules that you can't really follow

play15:57

very well but that's not the basic way

play15:59

of doing business so you're going to put

play16:03

words into this encoding network one at

play16:05

a time it's going to first turn those

play16:07

words into a vector representation which

play16:10

is a whole bunch of features it's going

play16:11

to learn to do that so all of these

play16:13

connections are learn by back

play16:14

propagating and it's gonna basically

play16:17

make say the vector for Tuesday be very

play16:19

similar to the vector for Wednesday and

play16:21

very different to the vector for

play16:23

although the words come in it

play16:25

accumulates information in it's hidden

play16:27

units and at the end of the English

play16:29

sentence of the top there there'll be a

play16:31

state of the hidden units that I will

play16:33

call a thought and that's not meant to

play16:36

be a joke that's what I believe a

play16:37

thought is a thought is an activity

play16:40

pattern in a big bunch of neurons and

play16:42

ass activity pattern that doesn't need

play16:44

to be inspected to thinks out and it

play16:46

causes things to happen so I can say to

play16:49

you John thought dan or John thought is

play16:54

snowing outside anything you can put in

play16:57

quotes John can think and what's more

play17:00

John can say it so if John thought it's

play17:02

snowing outside you might say to you is

play17:03

snowing yes

play17:04

so it's obvious that the way you get at

play17:06

thoughts the way I tell you what I'm

play17:08

thinking is either by the words that

play17:11

would have caused the thought or by the

play17:13

words that the thought would have caused

play17:15

it's hooked up at both ends and but the

play17:18

thought itself doesn't look anything

play17:19

like words it's something completely

play17:20

different inside and in fact it looks

play17:22

like that it's not necessarily red um

play17:26

you take that thought vector and you

play17:30

give it to a decoder network and decoder

play17:34

network says okay that was the thought

play17:36

let's suppose it's doing English to

play17:38

French what's the first word in French

play17:40

so it takes a thought and it says okay I

play17:42

think the house was probably loved but

play17:44

it might be law and it might be

play17:45

something else it gives you

play17:46

probabilities of all the various words

play17:49

one way of seeing what the network one

play17:51

way of decoding the thought not the best

play17:53

way but one way to do it is to say okay

play17:55

take those words it thought were

play17:58

reasonably possible pick one of them

play18:00

according to how probably thought they

play18:02

were and then lie to the network tell it

play18:04

okay that was actually the right then

play18:05

you got it that right okay what do you

play18:07

think comes next and then it gives you a

play18:09

prediction for the next word and you say

play18:10

okay you got that right what do you

play18:12

think comes next and that way it will

play18:13

give you a string of words until it

play18:15

eventually gives you a full stop and

play18:17

then that's the translation now what's

play18:19

amazing is that actually works but if

play18:22

you train the whole thing with

play18:23

backpropagation

play18:24

and Google Translate used to have huge

play18:26

tables of phrases this phrase in English

play18:28

tends to go to that phrase in French and

play18:30

you try and put all these tables

play18:32

together to get a plausible French

play18:33

sentence and it turns out it works much

play18:37

better to have a system that has no

play18:41

linguistic knowledge whatsoever that is

play18:44

this actually got lots of linguistic

play18:45

much but it wasn't put in by people so

play18:48

now the way Google Translate works on

play18:50

many pairs of languages and soon all of

play18:52

them I think is you take a language you

play18:56

automatically break the words of that

play18:58

language into 32,000 fragments so

play19:01

fragments for English would be whole

play19:02

words like the they'd also be the

play19:04

individual letters there'd be things

play19:06

like in and II D and s and you represent

play19:11

the input string by this string of

play19:13

symbols these photoshoot

play19:17

from this alphabet a 32,000 symbols you

play19:20

feed it the English sentence you have a

play19:24

trance big database of translations it

play19:26

then produces the French sentence and it

play19:30

has these probabilities of producing

play19:31

words and you look at at each point in

play19:34

time when it's producing the French

play19:35

sentence you look at the probability of

play19:37

the science of the correct word then you

play19:39

sang in a back propagate through all

play19:41

those connections you see in the net

play19:42

there send information backwards

play19:45

computing how a small change in that

play19:47

connection strength would increase the

play19:49

probability of the right word that's

play19:51

what you do and so you start with random

play19:53

weights and then you send all this

play19:55

information back to change the strengths

play19:56

very slightly so as to increase the

play19:58

probability of the correct word and you

play19:59

do that for a lot of things and hey

play20:01

presto is the best machine translation

play20:03

system there is pretty much one big

play20:05

improvement that was made by researchers

play20:08

in Montreal is attention so the system I

play20:14

described to you turns English into a

play20:15

thought and then turns the thought into

play20:17

French because that's not what a real

play20:19

translator does I mean he could do that

play20:21

but it'll do better if as he's producing

play20:23

the French he looks back at the English

play20:25

and so they made their networks look

play20:27

back not at the input words in the

play20:29

English but at the hidden States when it

play20:31

was getting English words and they made

play20:33

it learn where to look so that's pretty

play20:35

fancy it's an extra but module in the

play20:38

network that's trying to learn where to

play20:40

attend in the English sentence as its

play20:42

producing the French sentence it

play20:44

successfully does that and it makes the

play20:46

whole thing work better and be able to

play20:47

be trained on much less data that's one

play20:49

way the word fragments already described

play20:51

don't use words use pieces of words

play20:53

it'll also work if you use individual

play20:55

letters in fact here's an amazing thing

play20:57

if you're translating Chinese to English

play20:59

and I give you the following choice you

play21:03

could have a big list of Chinese symbols

play21:04

because they're symbols for whole words

play21:06

or I could give you bitmaps of the

play21:10

Chinese symbols which would you rather

play21:12

have as input well it turns out it works

play21:15

better if I give you the bitmaps because

play21:17

back propagation learns that Chinese

play21:19

symbols actually have confidential

play21:20

structure a Chinese symbol you know

play21:23

there's a man running to a house or so I

play21:25

don't I know nothing

play21:26

Chinese but those little bits in there

play21:28

actually have morphemic structure and

play21:30

it'll learn that from bitmaps so this is

play21:36

rather bad mood for news for linguists

play21:38

the number of linguists you need to make

play21:40

a really good speech recognition system

play21:41

is zero actually that's entirely unfair

play21:43

you you need to have a well curated data

play21:47

set and linguists will know a lot about

play21:49

how to get a well curated data set but

play21:51

you don't need them telling the neural

play21:52

network what to do now let's combine

play21:58

that with the vision we did before so

play22:01

we're going to take our net that

play22:04

recognizes objects in images trained on

play22:07

image net and we're going to say when we

play22:11

translate we get a thought and then we

play22:14

say that thought well what if we got a

play22:16

percept and then we said that percept so

play22:19

instead of using English to get the

play22:21

thought we're going to use the image net

play22:23

thing to look at an image and get a

play22:24

thought and then from that thought we're

play22:27

going to produce the output and the

play22:30

thought or percept that the net has is

play22:32

simply the activity of all the units

play22:34

just before the answer the last hidden

play22:36

let because what the net is done is

play22:38

really it's trying to pixels into

play22:41

activity of a bunch of things that's to

play22:43

do with objects but to do with lots of

play22:45

objects in the image and then it makes a

play22:47

choice and says the name of an object

play22:49

but before it sets the name of an object

play22:51

you have stuff to do with lots of

play22:52

objects so we use that percept as the

play22:57

encoding and then we decode it and we

play23:00

train it to decode that is obviously

play23:02

percepts to say those you need some

play23:04

training so you take that last layer of

play23:07

image net and you take a big database

play23:09

that emma that microsoft kindly supplied

play23:12

with a few hundred thousand images each

play23:14

with several possible captions and you

play23:17

train the decoder to turn that percept

play23:21

into a sentence and then it does things

play23:24

like you show it that and it says a

play23:26

group of people shopping around to

play23:27

market the actual transcript of that is

play23:31

the correct answer according to the

play23:33

databases people are crouched around in

play23:35

an open market which is better because

play23:37

it's got the crowd cheers

play23:39

and then the one you saw at the

play23:41

beginning so we reach closure you now

play23:43

know how this worked

play23:44

we trained up something on image net we

play23:47

then trained the out brother of that

play23:49

thing to produce synthesis in English

play23:51

and it says the clothes of a child

play23:54

holding a stuffed animal

play23:55

the real caption a young girl asleep on

play23:57

the sofa cuddling a stuffed bear is

play23:59

somewhat better but I was just

play24:00

completely blown away by this

play24:03

when oreal venules and Sammy Avenger and

play24:07

other researchers at Google showed me

play24:08

that this worked I thought well you know

play24:11

that's the dream of AI to be able to

play24:13

look at the picture and say what's in it

play24:14

I mean that's sort of basic AI if you

play24:16

can do that you're really onto something

play24:17

and it worked and then within about a

play24:20

week lots of other people had done

play24:23

similar things I think I was just

play24:25

slightly better there's all sorts of

play24:30

implications for document processing if

play24:33

you can convert a sentence into a

play24:34

thought and then model the structure of

play24:36

those thoughts you can get natural

play24:38

reasoning you might not want natural

play24:40

reasoning because most people's natural

play24:42

reasoning isn't much good but a tissue

play24:43

can model it I think to do this properly

play24:48

we'll need a number of parameters

play24:50

comparable with the brain which is a

play24:52

hundred trillion and our neural networks

play24:54

currently have a few billion it's as a

play24:56

puzzle here which is we can translate

play24:58

between multiple pairs of languages

play24:59

using a few billion weights that's less

play25:05

than one voxel of a brain scan

play25:07

so either the brain is amazingly much

play25:10

better than what we can do or it's using

play25:13

a different algorithm or it's using back

play25:15

propagation but inefficiently and I

play25:17

don't know which in medical images very

play25:23

soon will be better than radiologists so

play25:26

already for skin cancers there's a

play25:28

system that's comparable with

play25:30

radiologists with dermatologists and

play25:33

actually as soon as he's trained on more

play25:35

images it'll be significantly better

play25:36

that was trained on off the order of

play25:37

100,000 images training on 10 million

play25:39

men will be better one thing to bear in

play25:43

mind you that doctors often worry about

play25:45

is where do you get the correct answers

play25:46

well here's something interesting you

play25:48

train a neural network on labels

play25:50

produced by doctors

play25:52

and the neural network can end up much

play25:54

better than the doctors but is the

play25:56

doctors all disagree I only have 70

play25:57

percent agreement

play25:58

the neural network actually gets what's

play26:02

going on and it can be much better than

play26:05

the labels you used to train it

play26:07

that seems paradoxical but it's not so

play26:12

we don't actually need the ground truth

play26:15

we just need enough something related

play26:17

enough to the ground true so the neural

play26:19

network and figure out what's going on

play26:20

which the doctor couldn't and then we

play26:23

can do better I want to finish with one

play26:27

story about the same student George Dahl

play26:31

as with involved in the speech

play26:33

recognition in 2009 in 2012 or 11 I

play26:40

think 12 he entered a competition on

play26:43

Cargill which he entered quite late and

play26:45

the competition was I give you a few

play26:47

thousand properties of molecules and you

play26:51

have to predict whether this molecule

play26:53

will bind to something the drug

play26:56

companies would like to know this and

play26:58

they'd like to do without synthesizing

play27:00

the molecule so they'd like to predict

play27:02

which one's a good candidates for

play27:03

binding to something

play27:05

George basically through our standard

play27:08

neural network at it multiple layers of

play27:10

rectified linear units far more

play27:12

parameters than the world training cases

play27:14

here he was using probably a million

play27:17

parameters with 15,000 training cases

play27:20

and it worked he combined it with other

play27:25

methods but he didn't need to it

play27:27

actually would have won the competition

play27:28

without being confined with other

play27:29

methods and they were surprised and

play27:31

there's $20,000 prize and so George said

play27:33

okay give me the price and Merck said

play27:35

well it's part of the competition that

play27:37

you have to tell us what cuse are you

play27:39

used and George says what said what's Q

play27:42

so this is slightly embarrassing because

play27:45

there's a field called Kusa q site is

play27:49

quantitative structure-activity

play27:51

relationships that is how does the

play27:55

structure give rise to the activity I

play27:56

mean has this been going for like 20

play27:59

years it has a journal it has an annual

play28:00

conference it has a whole bunch of

play28:02

people who that's what they do and

play28:04

George white the right

play28:06

even knowing the name of the field okay

play28:10

that's it

play28:12

[Applause]

play28:13

[Music]

play28:17

[Applause]

play28:19

[Music]

Rate This

5.0 / 5 (0 votes)

Do you need a summary in English?