Geoffrey Hinton Unpacks The Forward-Forward Algorithm

Eye on AI
18 Jan 202358:55

Summary

TLDR在这段视频脚本中,Craig Smith采访了深度学习领域的先驱Jeffrey Hinton,探讨了他提出的前馈-前馈(forward-forward)算法。Hinton教授对大脑如何处理信息充满好奇,尤其是他所提出的算法,旨在模拟大脑皮层的学习方式。该算法分为在线(清醒时)和离线(睡眠时)两个阶段,在线阶段网络通过高活动性来识别真实数据,而离线阶段则生成自身的数据并尝试低活动性。Hinton认为,这种方法比传统的反向传播算法更符合大脑的工作机制,因为它不需要完美的前向系统模型,也不需要在时间上反向传播。此外,他还讨论了如何生成负数据,以及如何利用这种算法进行有效学习。整个讨论技术性很强,但即便对于那些不熟悉技术细节的听众,Hinton教授的洞察力也提供了宝贵的视角。

Takeaways

  • 🤖 神经网络和深度学习先驱杰弗里·辛顿(Geoffrey Hinton)提出了一种新的学习算法——前馈前馈(forward-forward)算法,他认为这可能是大脑皮层学习的一种更合理的模型。
  • 🧠 辛顿认为传统的反向传播(back propagation)算法并不能完全解释大脑是如何处理信息的,因为大脑中并没有证据表明信息是以反向传播的方式进行处理的。
  • 🔁 前馈前馈算法包含在线(在线相位对应于清醒状态)和离线(离线相位对应于睡眠状态)两个阶段,分别用于处理真实数据和生成数据。
  • 🌐 在线阶段,网络尝试让每一层对真实数据有高活动性,而在离线阶段,网络生成自己的数据并尝试让每一层的活动性降低。
  • 📉 该算法的一个关键特点是它不需要对前向系统有一个完美的模型,这与传统的反向传播算法不同,后者需要精确的前向系统模型来进行权重更新。
  • 🚀 辛顿讨论了将静态图像转换为“视频”,以便在时间维度上实现自上而下的效应,这有助于处理动态视觉信息。
  • 👶 辛顿提到婴儿在很短的时间内就能学习到结构从运动中获得的3D感知,这表明大脑能够快速学习到3D结构与运动之间的关系。
  • 🔧 辛顿在Matlab上进行了大量的实验,以测试和证明前馈前馈算法的有效性,尽管Matlab并不是处理大型系统的最有效工具。
  • 📈 辛顿认为,如果我们能够理解大脑的工作原理并复制它,那么创建具有推理能力的模型是有可能的,但他对于意识(consciousness)的本质持谨慎态度。
  • 🤔 辛顿强调,我们对意识的理解可能类似于100年前的“生命力”概念,即试图用简单的精华来解释复杂机制,而这种理解可能并不准确。
  • 💡 辛顿提到,他的知识很大程度上来自于与人交谈,而不是阅读,他认为与人交谈是获取知识的高效方式。

Q & A

  • 什么是深度学习中的前馈前向(forward-forward)算法?

    -前馈前向算法是杰弗里·辛顿提出的一种新的学习算法,它被认为比反向传播(backpropagation)算法更可能是大脑皮层学习信息的一个合理模型。该算法将学习过程分为在线(清醒时)和离线(睡眠时)两个阶段,在线阶段网络通过输入真实数据并尝试让每一层都保持高活动水平来区分真实数据和假数据;离线阶段网络生成自己的数据,并尝试让每一层的活动水平降低,以此来学习生成模型。

  • 为什么杰弗里·辛顿认为反向传播算法不是大脑处理信息的方式?

    -辛顿认为反向传播算法需要一个完美的前向系统模型,并且在大脑中没有证据表明信息会以反向传播所需的方式反向流动。此外,反向传播在递归网络中的实现会遇到时间上的挑战,例如在处理视频时无法简单地停止并反向传递。

  • 在前馈前向算法中,如何生成负数据(negative data)?

    -在前馈前向算法中,负数据是在网络运行在负阶段时输入的数据,目的是在所有隐藏层中获得低活动水平。负数据可以由模型自身生成,也可以通过手动输入不正确的标签或图像来提供。最终目标是让模型自己生成负数据,当模型足够好时,负数据看起来就像真实数据一样,此时学习过程结束。

  • 前馈前向算法中的“在线阶段”和“离线阶段”是如何工作的?

    -在线阶段,网络接收输入并处理真实数据,目标是让每一层的活动水平足够高,以便于区分真实数据和假数据。而在离线阶段,网络尝试生成自己的数据,并希望在给定自己生成的数据作为输入时,每一层的活动水平降低。这样可以训练网络学习一个生成模型,并区分真实数据和假数据。

  • 为什么杰弗里·辛顿认为前馈前向算法可能更适合低功耗计算机架构?

    -前馈前向算法不需要对前向系统有一个完美的模型,它只需要对单个神经元的行为有一个足够好的模型,以便能够改变进入该神经元的权重,使其更活跃或不那么活跃。这种算法可以在包含未知黑盒的系统中学习,这使得它在低功耗硬件上更具潜力。

  • 杰弗里·辛顿是如何开发和测试他的新学习算法的?

    -辛顿通过在Matlab中编写和测试小规模的模型来开发和测试他的新学习算法。他会花费时间思考概念,并在Matlab中实现这些想法,以检验它们是否有效。对于大多数原始想法,他发现它们是错误的,而Matlab的便利性在于能够快速证明这些想法的错误。

  • 为什么杰弗里·辛顿认为意识不是一个简单的概念?

    -辛顿认为意识是一个极其复杂的概念,人们常常将其混为一谈,试图用简单的“本质”来解释复杂的机制。他认为,意识的讨论类似于过去的“生命力”概念,后者是一个无法具体定义的概念,用来模糊地解释生物为何是活的。辛顿强调,意识的感知和体验实际上是一种关于假设性状态的陈述,而不是某种特殊的内部本质。

  • 在前馈前向算法中,网络如何处理静态图像和视频数据?

    -在处理静态图像时,前馈前向算法可以将其视为视频数据的一个特例,其中视频帧没有随时间变化。而在处理视频数据时,网络会在时间维度上进行前向传递,允许顶层的预测影响底层的表示,从而实现对动态场景的理解。

  • 杰弗里·辛顿是如何与他人交流和学习新知识的?

    -辛顿通过与不同领域的专家交谈来交流和学习新知识。他与认知科学家、神经科学家和心理学家等人交流,从他们那里获取信息。由于阅读速度较慢,且在阅读方程时会受到阻碍,他认为与人交谈是获取知识的有效方式。

  • 为什么杰弗里·辛顿认为在当前阶段,将前馈前向算法扩展到大型系统还为时过早?

    -辛顿认为,由于前馈前向算法在实现时需要一些特定的技巧,而这些技巧比反向传播算法更为复杂,因此在基本算法的基本原理和最佳实践方法得到充分理解和验证之前,将其扩展到大型系统是不成熟的。

  • 杰弗里·辛顿是否认为前馈前向算法在未来有可能超越基于反向传播的模型?

    -辛顿并不确定前馈前向算法是否会超越基于反向传播的模型。他认为反向传播在给定数量的连接中可以压缩更多的知识,而前馈前向算法则在处理大量连接时可能不那么有效。但他也指出,大脑的主要问题不是知识压缩,而是如何有效地利用经验来获取信息。

  • 在前馈前向算法中,网络是如何区分真实数据和假数据的?

    -网络通过每一层的活动水平来区分真实数据和假数据。在线阶段,网络处理真实数据时,每一层都试图保持高活动水平;而在离线阶段,网络生成自己的数据(假数据)时,每一层都试图降低活动水平。通过这种方式,网络可以学会区分真实数据和假数据。

Outlines

00:00

🤖 深度学习先驱杰弗里·辛顿的前馈前向算法

本段落介绍了人工智能领域的杰出科学家杰弗里·辛顿,他对神经网络和深度学习的贡献,以及他提出的前馈前向算法。辛顿认为传统的反向传播算法并不能完全模拟大脑处理信息的方式,因此他提出了一种新的学习算法,旨在更合理地模拟大脑皮层可能的学习方式。

05:02

🔍 前馈前向算法的在线与离线学习阶段

第二段详细描述了前馈前向算法的在线(觉醒)和离线(睡眠)学习阶段。在线阶段,网络处理输入数据,尝试区分真实数据和假数据。而在离线阶段,网络生成自己的数据,并尝试区分自己生成的假数据和真实数据。辛顿还讨论了如何通过正负数据的区分来训练网络,以及这一过程与生成对抗网络(GANs)的相似之处。

10:03

📈 正负数据的生成与学习信号

在第三段中,辛顿探讨了正负数据的概念,以及如何在网络的负相位中使用这些数据来降低隐藏层的活动。他解释了如何通过模型生成负数据,以及如何利用正负数据的统计差异作为学习信号。此外,辛顿还讨论了在监督学习中如何手动输入负数据,并解释了如何通过这种方式训练网络。

15:06

🔢 数字识别与预测的简单学习模型

第四段通过一个简单的数字识别和预测任务来阐述前馈前向算法的工作原理。辛顿描述了如何使用正相位来提高隐藏层的活动,并利用这些活动来预测下一个字符。在负相位,网络则尝试降低由预测字符引起的隐藏层活动,以此来优化模型。

20:07

🕒 时间维度在神经网络中的应用

第五段讨论了时间维度在神经网络中的作用,特别是在处理视频数据时。辛顿解释了如何在网络中模拟时间的流逝,以及如何通过这种方式来处理动态图像。他还提到了如何通过快速输入来挑战网络的稳定性,并探讨了在时间维度上进行信息处理的复杂性。

25:07

🧠 胶囊网络与前馈前向算法的结合

第六段中,辛顿讨论了胶囊网络的概念,以及如何将它们与前馈前向算法结合起来。他提到了胶囊网络如何能够表示不同类型的对象,并且在不同的层次上构建对象的表示。辛顿还提到了胶囊网络如何能够处理3D结构,并预测从不同视角观察对象时的图像。

30:08

🌐 3D现实感知与前馈前向算法的未来

在第七段,辛顿表达了他对于前馈前向算法在处理3D现实感知方面的期望。他认为,如果算法能够成功模拟大脑皮层的信息处理方式,那么对深度和3D现实世界的感知将自然地出现。他还讨论了如何通过视频和观察角度的变化来训练网络学习3D结构。

35:10

⚙️ 算法的优化与硬件的配合

第八段中,辛顿讨论了算法优化和硬件配合的重要性。他提到了如何通过不同的目标函数来寻找数据中的特征或约束,并讨论了在大型系统中应用前馈前向算法的潜力。辛顿还提到了对于新型计算机的需求,以及如何利用硬件的自然属性来实现更高效的计算。

40:10

🧐 意识、感知与人工智能的未来

在最后一段中,辛顿探讨了意识和感知的概念,以及它们如何与人工智能的未来相联系。他批评了当前对意识的理解,认为意识是一个复杂的机制,而不是一个简单的概念。辛顿还讨论了如何通过描述可能引起某种大脑状态的假设情况来传达个人的感受和经验。

Mindmap

Keywords

💡深度学习

深度学习是一种机器学习的方法,它通过模拟人脑的神经网络来处理复杂的数据。在视频中,提到了深度学习先驱杰弗里·辛顿(Geoffrey Hinton)对深度学习的贡献,尤其是他提出的反向传播算法,这在人工智能领域引发了一场革命。

💡反向传播

反向传播是一种训练神经网络的算法,通过计算损失函数对每个参数的梯度,并利用这些梯度来更新网络的权重。视频中讨论了辛顿对反向传播的看法,他认为大脑并不采用这种方式处理信息。

💡前馈-前馈算法

前馈-前馈算法是辛顿提出的一个新的学习算法,他认为这可能是大脑皮层学习的一个更为合理的模型。该算法将学习过程分为在线(wake)和离线(sleep)两个阶段,旨在更高效地处理信息。

💡清醒(wake)阶段

在前馈-前馈算法中,清醒阶段是指网络在线处理输入数据的阶段,此时网络尝试通过高活动水平来识别真实数据。这与视频中提到的网络如何通过活动水平来区分真实数据和假数据有关。

💡睡眠(sleep)阶段

睡眠阶段在前馈-前馈算法中是离线学习阶段,网络在这一阶段生成自己的数据,并尝试通过低活动水平来识别假数据。这与视频中讨论的网络如何通过生成模型来学习区分真假数据的过程相关。

💡特征和约束

特征是指数据中具有高方差的元素,而约束则是数据中具有低方差的元素。视频中提到,可以通过调整目标函数来让网络学习到特征或约束,这有助于网络更好地理解和处理输入数据。

💡胶囊网络

胶囊网络是辛顿提出的另一种神经网络架构,它通过一组神经元来表示不同类型的实体,能够捕捉现实世界的三维结构。视频中提到胶囊网络与前馈-前馈算法的关联,以及它们如何帮助理解深度和三维空间。

💡生成模型

生成模型是一种机器学习模型,它能够生成或模拟数据集中的样本。在前馈-前馈算法中,网络需要学习一个生成模型来产生假数据,以便在睡眠阶段进行学习。

💡梯度

梯度是数学中描述函数变化率的概念,在深度学习中,梯度用于描述损失函数相对于网络权重的变化,是权重更新的基础。视频中讨论了梯度在反向传播中的作用,以及前馈-前馈算法如何避免直接使用梯度。

💡认知科学

认知科学是一门研究人类认知过程的跨学科领域,包括学习、记忆、感知等。视频中提到了辛顿与其他领域专家的交流,以及认知科学在理解大脑和人工智能中的重要作用。

💡意识

意识是哲学和认知科学中探讨的一个复杂概念,涉及个体的感知、体验和自我意识。视频中辛顿讨论了他对意识的看法,认为意识的讨论常常混淆了不同的概念,并且是一个难以明确定义的领域。

Highlights

克雷格·史密斯介绍了杰弗里·辛顿,一位在神经网络和深度学习领域的先驱,他提出了前馈-前馈算法,认为这可能是大脑皮层学习过程的一个更合理的模型。

辛顿不认为反向传播算法解释了大脑如何处理信息,他提出了前馈-前馈算法,该算法将学习过程分为在线和离线两个阶段。

在线阶段,网络的目标是区分真实数据和假数据,每一层都应有高活动性以表示真实数据。

离线阶段,网络需要生成自己的数据,尝试让每一层对假数据有低活动性,这需要网络学习一个生成模型。

辛顿讨论了前馈-前馈算法在处理视频等时间序列数据时的优势,因为它允许信息的流水线处理。

算法的正相位旨在获得高活动性,而负相位则尝试获得低活动性,这与生成对抗网络中的判别和生成模型有相似之处。

辛顿提到,尽管前馈-前馈算法在技术上具有挑战性,但它不需要完美的前向系统模型,这使得它在硬件上的实现更加灵活。

他强调了在学习算法中区分真实数据和假数据的重要性,并解释了如何通过活动性高低来实现这一目标。

辛顿讨论了在正相位和负相位之间切换的策略,以及如何在没有混合这两个阶段的情况下进行有效学习。

他解释了负数据的概念,即在负相位中用于尝试降低隐藏层活动性的数据,并且可以由模型本身生成或手动输入。

辛顿提供了一个简单学习练习的示例,说明了正数据和负数据如何用于训练网络。

他讨论了前馈-前馈算法如何可能适用于处理自然语言、视觉处理和常识推理等任务。

辛顿认为,如果我们能够理解大脑的工作方式并复制它,那么就有可能实现具有推理能力的模型。

他对于意识的讨论表明,意识可能不像人们想象的那样是一个简单的概念,而是许多不同概念的混合体。

辛顿分享了他在Matlab上的工作,包括他对前馈-前馈算法的实验和探索。

他强调了在计算机视觉领域,深度学习模型的成功是如何挑战传统研究范式的。

辛顿提到,他希望通过分享Matlab代码,激励更多的人参与到前馈-前馈算法的研究和实现中。

他讨论了在幼儿学习中观察到的一些现象,以及这些现象如何启发他对算法的研究。

辛顿分享了他对于如何有效地将信息整合到大脑的思考,以及这与前馈-前馈算法的潜在联系。

他对于未来计算机架构的展望,特别是在处理自然语言和视觉任务方面的低功耗计算机。

Transcripts

play00:00

seeing a pink elephant notice the words

play00:03

pink and elephant refer to things in the

play00:05

world

play00:06

so what's actually happening is I'd like

play00:08

to tell you what's going on inside my

play00:09

head hi I'm Craig Smith and this is I

play00:13

onai

play00:15

[Music]

play00:26

Jeffrey Hinton a Pioneer in neural

play00:29

networks and the man who coined the term

play00:31

deep learning has been driven throughout

play00:34

his career to understand the brain

play00:38

well his application of the back

play00:41

propagation of error algorithm to deep

play00:44

Networks

play00:45

set off a revolution in artificial

play00:48

intelligence he doesn't believe that it

play00:52

explains how the brain processes

play00:55

information

play00:56

late last year he introduced a new

play00:59

learning algorithm

play01:01

which he calls the forward forward

play01:03

algorithm that he believes is a more

play01:07

plausible model for how the cerebral

play01:09

cortex might learn a lot has been

play01:12

written about the forward forward

play01:14

algorithm in recent weeks but here Jeff

play01:17

gives us a deep dive into the algorithm

play01:20

and the journey that led him to it the

play01:23

conversation is Technical and assumes a

play01:26

lot of knowledge on the part of

play01:28

listeners

play01:29

but my advice for those that don't have

play01:32

that knowledge is to let the technical

play01:35

stuff wash over you and listen instead

play01:38

for Jeff's insights before we begin I'd

play01:42

like to mention our sponsor clearml an

play01:45

open source end-to-end ml op solution

play01:49

you can try it for free at clear.ml

play01:53

that's

play01:55

c-l-e-a-r dot ml

play01:57

tell them I on a I sent you now here's

play02:02

Jeff I hope you find the conversation as

play02:06

fascinating as I did

play02:09

foreign

play02:20

to listeners forward forward networks

play02:24

and why you're looking for something

play02:27

Beyond back propagation despite its

play02:31

tremendous success let me start with

play02:33

explaining why I don't believe the brain

play02:35

is doing back population one thing about

play02:37

back propagation is you need to have a

play02:39

perfect model of the forward system

play02:41

that is impact propagation it's easiest

play02:43

to think about for a layered net but it

play02:45

also works for recurrent Nets

play02:48

for a layered net you do a forward pass

play02:51

where the input comes in at the bottom

play02:53

and goes through these layers so the

play02:55

input might be pixels and what comes out

play02:57

the top might be a classification of is

play02:59

it a cat or a dog

play03:01

you go forwards through the layers

play03:04

and then you look at the error in the

play03:07

output

play03:08

if it says cat when it should say dog

play03:11

that's wrong and you'd like to figure

play03:14

out how to change all the weights in the

play03:16

forward pass so that

play03:18

next time it's more likely to say the

play03:21

right category rather than the wrong one

play03:23

so you have to figure out how a change

play03:26

in a weight would affect

play03:29

the how much it gives the right answer

play03:33

and then you want to go off and change

play03:35

all the weights in proportion to how

play03:36

much they help in getting the right

play03:38

answer

play03:39

and back propagation is a way of

play03:42

figuring out that gradient we're

play03:43

figuring out how much a change in the

play03:46

weight would make the system have less

play03:49

error and then you change the weight in

play03:51

proportion to how much it helps and

play03:53

obviously if it hurts you change it in

play03:54

the opposite direction

play03:58

now about propagation

play04:01

looks like the forward pass but it goes

play04:03

backwards it has to use the same

play04:05

connectivity pattern with the same

play04:06

weights but in the backwards Direction

play04:09

and it has to go backwards through the

play04:11

non-linearity of the neuron there's no

play04:13

evidence that the brain is doing that

play04:16

and there's lots of elements it's not

play04:18

doing that

play04:19

so the worst case is if you're doing

play04:21

back propagation in a recurrent net

play04:23

because then you run the recurrent net

play04:25

forwards in time

play04:27

and it outputs an answer at the end of

play04:30

running forwards in time

play04:32

and then you have to run it backwards

play04:34

through time

play04:36

in order to get all these derivatives so

play04:38

I had to change the weights

play04:40

and that's particularly problematic if

play04:43

for example if you're trying to process

play04:44

video

play04:46

you can't stop and go backwards in time

play04:50

so combined with the fact that there's

play04:52

no evidence the brain does it

play04:54

well no good evidence

play04:57

there's the problem that just for

play04:58

technology it's a mess it interrupts the

play05:01

pipelining of stuff through so you'd

play05:04

really like something like video there's

play05:07

been multiple stages of processing and

play05:09

you'd like to just pipeline the inputs

play05:11

through those multiple stages and just

play05:12

keep pipelining it through

play05:15

and so the idea of the Ford algorithm

play05:19

is that if you can divide

play05:22

the learning the process of getting the

play05:26

gradients you need into two separate

play05:28

phases you can do one of them online and

play05:30

one of them offline

play05:32

and the way you do online can be very

play05:35

simple and will allow you to just

play05:37

pipeline stuff through

play05:39

so the online phase which is meant to

play05:43

correspond to wake

play05:46

you put input into the network

play05:49

and let's take the recurrent version

play05:51

input keeps coming into the network

play05:54

and what you're trying to do

play05:56

for each layer at each time step

play05:59

you're trying to make

play06:02

the layer of high activity

play06:05

or rather high enough activity so that

play06:08

it can

play06:10

figure out that this is real data

play06:14

so the underlying idea is for Real data

play06:16

you want every layer to have high

play06:18

activity and for fake data what comes

play06:20

out we get that later you'd like every

play06:23

layer to have low activity

play06:25

and the task of the network or the thing

play06:27

it's trying to achieve

play06:29

is not to give the correct label as in

play06:33

back propagation is trying to achieve

play06:35

this property but being able to tell the

play06:37

difference between real data and fake

play06:39

data at every layer by each layer having

play06:43

high activity for real data and no

play06:44

activity for fake data

play06:46

so each layer has its own objective

play06:48

function

play06:49

in fact to be more precise we take the

play06:53

sum of the squares of the activities of

play06:55

the units in a layer

play06:57

we subtract off some thresholds

play07:00

and then we feed that to a logistic

play07:02

function that simply decides what's the

play07:05

probability that this is a is real data

play07:08

as opposed to fake data

play07:10

and if the logistical function gets a

play07:12

lot of input it will say it's definitely

play07:13

real data

play07:15

and so there's no need to change

play07:17

anything if it's getting lots of input

play07:19

you won't learn on that example because

play07:22

it's already getting it right

play07:26

and that explains how you can run lots

play07:28

of positive examples without running any

play07:30

negative examples which are fake data

play07:33

because it'll just saturate on positive

play07:35

examples it's getting right

play07:38

so that's what it does in the positive

play07:39

phase it tries to get high sum of

play07:42

squared activities in every layer so

play07:44

that it can tell high enough so it can

play07:46

tell that it's real data

play07:48

in the negative phase

play07:51

which is run Offline that is during

play07:53

sleep

play07:55

the network needs to generate its own

play07:57

data

play07:58

and try and get given its own data as

play08:02

input

play08:03

it wants to have low activity in every

play08:07

layer

play08:08

so the network has to learn a generative

play08:10

model

play08:11

and what it's trying to do is

play08:13

discriminate between real data and fake

play08:15

data produced by its generative model

play08:18

obviously if it can't discriminate at

play08:20

all

play08:21

then what's going to happen is the

play08:23

derivatives that it gets for real data

play08:25

and the derivatives we get for fake data

play08:27

will be equal and opposite so it won't

play08:29

learn anything learning will have

play08:31

finished then if you can't tell the

play08:33

difference between what it generates and

play08:35

real data

play08:37

this is very like again if you know

play08:39

about generative adversarial Networks

play08:42

except that the discriminative net

play08:45

that's trying to tell the difference

play08:46

between real and fake and the generative

play08:48

model that's trying to generate fake

play08:50

data use the same hidden units and so

play08:53

they use the same hidden representations

play08:55

that overcomes a lot of the problems

play08:57

that a gun has

play08:58

on the other hand because it's not doing

play09:01

back propagation to learn the generative

play09:02

model it's harder to learn a good

play09:04

General model

play09:08

that's a rough overview of the algorithm

play09:12

let me ask one a couple of questions on

play09:15

the awake and sleep cycle are you

play09:18

cycling

play09:20

quickly between them

play09:22

okay so most of the research what I

play09:24

would do is

play09:26

the preliminary research cycle quickly

play09:29

between them because that's the obvious

play09:30

thing to do

play09:31

and later on I discovered

play09:33

well I've known for some time that with

play09:34

contraceptive learning you can separate

play09:36

the phases

play09:37

and later on I discovered it worked

play09:39

pretty well to separate the phases

play09:41

recent experiments I've done with

play09:43

predicting characters

play09:45

You can predict

play09:48

you can have it predict about a quarter

play09:50

of a million characters so it's running

play09:52

on real data trying to predict the next

play09:54

character is making predictions he's

play09:56

running with mini batches so after

play09:58

making quite a large number of

play09:59

predictions they're going to updates the

play10:01

weights and then it sees more positive

play10:03

examples it updates away scan so in all

play10:05

those phases it's just trying to get

play10:07

higher activity

play10:09

in the hidden layers

play10:11

but only if it's not already got high

play10:13

activity

play10:16

and you can predict like quarter of a

play10:19

million characters in the positive phase

play10:21

and then switch to the negative phase

play10:23

where the Network's generating its own

play10:25

string of characters

play10:27

and

play10:28

it you're now trying to get

play10:32

low activity in the hidden layers for

play10:35

the characters it's predicting

play10:36

it's looking a little window characters

play10:40

and then you run for quarter of a

play10:42

million characters like that and it

play10:43

doesn't actually have to be the same

play10:45

number anymore we've bought some

play10:46

machines it's very important to have the

play10:47

same number of things in the positive

play10:49

phase and negative phase but with this

play10:50

it isn't

play10:53

the most remarkable is

play10:55

that up to a few hundred thousand

play10:58

predictions

play11:00

it works almost as well if you separate

play11:02

the phases

play11:03

as opposed to interleave

play11:05

and that's quite surprising

play11:08

in human learning

play11:12

certainly in the we can sleep for

play11:15

complicated Concepts that you're

play11:17

learning but there's learning going on

play11:19

all the time that

play11:21

doesn't require a sleep phase well there

play11:24

is in this too if you're just running on

play11:26

positive examples

play11:27

it's changing the weights

play11:30

for all the examples where it's not

play11:32

completely obvious that this is a

play11:34

positive data

play11:35

so it will do a lot of it does a lot of

play11:38

learning in the positive phase

play11:41

but if you go on too long you fails

play11:43

catastrophically

play11:45

and people seem to be the same if I

play11:47

probably sleep for a week you'll go

play11:49

completely psychotic

play11:51

and job hallucinations and you may never

play11:54

recover

play11:55

can you explain I think one thing that

play11:59

people are having trouble

play12:01

non-practitioners are having trouble

play12:03

understanding is the concept of negative

play12:06

data I've seen a few articles where they

play12:08

just put it in quotation marks out of

play12:11

your paper

play12:12

which indicates that they don't

play12:14

understand it

play12:15

okay

play12:16

what I mean by negative data is data

play12:18

that you give to the system

play12:21

when it's running in the negative phase

play12:23

that is when it's trying to get low

play12:26

activity in all the hidden layers

play12:29

and there are many ways of generating

play12:31

negative data in the end you'd like the

play12:33

model itself to generate the negative

play12:35

data

play12:36

so this is just like it was in Baltimore

play12:38

machines the data that the model itself

play12:40

generates

play12:42

is negative data and real data is what

play12:46

you're trying to model

play12:47

and once you've got a really good model

play12:49

the negative data looks just like the

play12:51

real data so no loan takes place

play12:54

but negative data doesn't have to be

play12:56

produced by the model so for example

play13:01

you can train it to do supervised

play13:04

learning

play13:06

by inputting both an image and the label

play13:10

so now the label's part of the input not

play13:12

part of the output

play13:14

and what you're asking it to do is

play13:17

when I input an image with the correct

play13:20

label that's going to be the positive

play13:21

data you want to have high activity you

play13:24

want to input an image with the

play13:26

incorrect label which I just put in by

play13:28

hand that's the incorrect as an

play13:29

incorrect label that's negative data

play13:32

now it works best if you get the model

play13:35

to predict the label and you put in

play13:39

the best of the model's predictions it's

play13:41

not correct

play13:42

because then you're giving it the things

play13:43

it's most the mistake is most likely to

play13:46

make as negative data

play13:48

but you can put in negative data by hand

play13:50

and it works fine

play13:52

and the

play13:55

the reconciliation then at the end

play13:58

is it

play14:00

as in boltzmann machines where you're

play14:03

subtracting the negative data from the

play14:05

positive data

play14:06

but in both machines what you do is you

play14:08

give it positive data real data

play14:12

and you let it settle to equilibrium

play14:14

which you don't have to do with the

play14:15

forward-forward algorithm

play14:17

well not exactly anyway

play14:20

and

play14:22

once I started equilibrium you measure

play14:25

the pairwise statistics that is how

play14:27

often two units that are connected are

play14:29

on together

play14:31

and then in the negative phase you do

play14:33

the same thing

play14:35

with stuff you just let the model settle

play14:37

as producing data itself

play14:40

and you mentioned the same statistics

play14:42

and you take the difference of those

play14:44

pairwise statistics and that is the

play14:47

correct learning signal for a basketball

play14:48

machine but the problem is you have to

play14:50

let the model Settle yeah and there just

play14:52

isn't time for that also you have to

play14:55

have all sorts of other conditions like

play14:56

the connections have to be symmetric

play14:57

there's no evidence Connections in the

play14:59

brain symmetric

play15:01

can you give a concrete example of

play15:06

of positive and negative data in a very

play15:09

simple

play15:12

learning exercise you were working on

play15:14

digits

play15:16

in this example I think is if you're

play15:18

predicting a string of characters the

play15:20

positive data you'd see a little window

play15:22

of characters

play15:24

and should have some hidden letters

play15:26

and because that's a positive window of

play15:29

characters you try and make the activity

play15:30

high in all the hidden layers

play15:32

but also from those hidden layers from

play15:35

the activity in those hidden layers

play15:37

you would try to predict the next

play15:39

character

play15:41

that's a very simple geometry model

play15:43

but notice the geometry model isn't

play15:45

having to learn its own representation

play15:46

so representations are learned just to

play15:49

make positive strings of characters give

play15:52

you high activity in all the hidden

play15:53

notes that's the objective of the

play15:55

learning the objective isn't to predict

play15:57

the next character

play15:59

but having done that learning you've got

play16:01

the right representations for these

play16:03

strings of characters these windows of

play16:05

characters you also learn to predict the

play16:08

next character

play16:10

and that's what you're doing in the

play16:11

positive phase

play16:13

seeing Windows of characters you're

play16:15

changing the weights so that all the

play16:17

hidden layers have high activity for

play16:18

those windows or characters

play16:20

but you're also changing

play16:22

top down weights that are trying to

play16:24

predict the next character from the

play16:26

activity in the hidden lads that's

play16:28

what's sometimes called a linear

play16:30

classifier

play16:33

so that's the positive face in the

play16:35

negative phase

play16:38

you as input you use characters that

play16:41

have been predicted already so you've

play16:43

got this window and you're going along

play16:45

and just predicting the next character

play16:47

and then moving the window along one to

play16:49

include the next character you predicted

play16:50

and to drop off the oldest character you

play16:53

just keep going like that

play16:55

and for each of those frames

play16:58

you try and get low activity in the

play17:00

hidden wires because it's negative data

play17:03

and I think you can see that if your

play17:06

predictions were perfect

play17:08

and you start from a string a real

play17:10

string then

play17:13

the what's happening in the negative

play17:15

phase will be exactly like what's

play17:17

happening in the positive phase right

play17:18

and so the two will cancel out

play17:21

but if there's a difference

play17:23

then you'll be

play17:26

learning to make things more like the

play17:28

positive phase and less like the

play17:29

negative phase

play17:31

and so it'll get better and better at

play17:32

predicting

play17:35

is I understood back propagation on

play17:39

static data

play17:41

there are inputs there's an output and

play17:47

you calculate the error

play17:49

and then you run backwards through the

play17:52

network and

play17:54

correct the weights and then do it again

play17:59

and that's not a good model for the

play18:01

brain because there's no evidence of

play18:03

information flowing backward through the

play18:05

neurons it's not that's not exactly the

play18:08

right wish that there's no no good

play18:10

evidence of

play18:12

derivative information thrown back the

play18:15

studies these error gradients flowing

play18:16

backwards okay obviously the brain has

play18:20

top down connections if you look at the

play18:22

perceptual system there's a kind of

play18:23

forward direction that goes from

play18:26

the thalamus up to him for a temporal

play18:29

cortex where you recognize things and

play18:31

the thalamus is a sort of where the

play18:33

input comes in from the eyes

play18:34

and there's Connections in the backward

play18:36

Direction but the connection in the

play18:38

backward Direction don't look at all

play18:40

like what you'd need for back

play18:42

propagation for example in two cortical

play18:44

areas the connection is coming back

play18:46

don't go to the same cells as

play18:48

connections going forward come from

play18:51

it's not reciprocal in that sense yeah

play18:53

there's a loop between the cortical

play18:54

areas but information in one course got

play18:58

area goes through about six different

play18:59

neurons before it gets back to where it

play19:02

started

play19:03

and so it's a loop it's not uh it's not

play19:05

like a mirrored system

play19:08

okay but my question is you talk about

play19:12

turning the static image into a boring

play19:14

video that allows you to have top-down

play19:17

effects that's right yeah so you have to

play19:20

think of the being a forward Direction

play19:21

which is going from lower layers to

play19:24

higher layers

play19:25

and then orthogonal to that was the time

play19:28

dimension

play19:30

and so if I have a video even if it's a

play19:33

video of just a single thing that stays

play19:35

still

play19:37

I can be going up and down through the

play19:39

layers as I go forwards in time

play19:42

and that's what's allowing you to have

play19:43

top down effects

play19:45

okay I understood that yeah each layer

play19:48

can receive inputs from a higher layer

play19:51

in the previous time step exactly yeah

play19:54

so what a layer is doing it's receiving

play19:55

input from higher layers

play19:58

and lower layers at the previous time

play20:00

step and from itself at the previous

play20:01

time step

play20:04

and if you've got static input

play20:07

that whole process over time looks like

play20:10

a network settling down

play20:12

that's a bit more like a Baltimore

play20:13

machine settling down

play20:16

and the idea is that

play20:18

the time that you're using for that is

play20:20

the same as the time you're using for

play20:22

posting video

play20:24

and because of that

play20:25

if I give you fast input that's changing

play20:28

too fast you can never settle down to

play20:31

interpret it

play20:32

so I discovered this nice phenomenon if

play20:34

you take a new regularly shaped object

play20:37

like a potato for example a nice

play20:40

irregularly shaped potato

play20:42

and you throw it up in the air rotating

play20:44

slowly at one or two revolutions per

play20:46

second

play20:47

you cannot see what shape it is you just

play20:49

can't see the shape of it

play20:51

you don't have time to settle on a 3D

play20:53

interpretation

play20:55

because it's the very same

play20:57

time steps that you're using for posting

play21:00

videos you're using for settling with a

play21:02

static image

play21:03

and what I found fascinating about and

play21:06

maybe this is something that that is

play21:09

already

play21:10

in the literature but this idea of going

play21:13

up and down in the layers

play21:15

As you move through time

play21:17

but it's that's always been in recurrent

play21:21

Nets so to begin with recurrentness we

play21:23

just have one hidden layer so typical

play21:26

lstms and so on would have one hidden

play21:29

there and then Alex Graves

play21:31

the idea of having multiple hidden

play21:33

layers and showed that it was a winner

play21:35

so that idea has been around but it's

play21:38

always been paired with back propagation

play21:39

as the learning algorithm and in that

play21:41

case it was back propagation through

play21:42

time which was completely unrealistic

play21:45

but

play21:46

and the Brain real life is not static so

play21:50

you're not perceiving in a truly static

play21:53

fashion how much of this grew out of

play21:56

Sinclair's contrast of learning or end

play21:58

grads activity differences

play22:02

a couple of years ago I got very excited

play22:04

because I was trying to make a more

play22:07

biologically plausible version of things

play22:09

like Sim clear there's a whole bunch of

play22:11

things like simple it simply wasn't the

play22:12

first of them

play22:14

in fact it's something a bit like

play22:15

simpler that Sue Becker and I published

play22:17

in about 19 1992 in nature

play22:21

but we didn't use negative examples we

play22:24

tried to analytically compute the

play22:25

negative phase and that wasn't

play22:27

there was a mistake it just that would

play22:30

never work

play22:31

um

play22:32

once you start using negative examples

play22:35

then you get things like simply

play22:38

and I discovered that you could separate

play22:40

the phases that they didn't and that got

play22:43

me very excited a few years ago because

play22:45

it seemed like I only had an explanation

play22:47

for what sleep was for

play22:50

one big difference is

play22:52

simply is taking two different Patches

play22:55

from the same image

play22:57

and if they're from the same image it's

play22:58

trying to make them have a similar

play22:59

representation is they're from different

play23:01

images it's trying to make them have

play23:03

different representations sufficiently

play23:05

different once they're different it

play23:06

doesn't try and make them more different

play23:10

and

play23:15

when you think how to say this

play23:19

simply involves looking at two

play23:22

representations and seeing how similar

play23:24

they are

play23:26

and that's one way to measure agreement

play23:30

and in fact if you think about the

play23:32

squared difference between two vectors

play23:35

that decomposes into three terms

play23:38

the sun is to do with the square of the

play23:40

first vector

play23:42

there's something to do with the square

play23:43

of the second vector

play23:45

and then there's the

play23:47

scalar product of the two vectors

play23:50

and the scalar product of the two

play23:52

vectors is the only kind is the only

play23:53

interactive term

play23:56

and so it turns out that

play24:00

squared difference is very like a scalar

play24:03

product

play24:05

a big Square difference

play24:07

is like a small scale of product

play24:13

now there's a different way to measure

play24:15

agreement

play24:16

which is to take the things you'd like

play24:18

to agree and feed them into one set of

play24:21

neurons

play24:22

and now if two sources coming into that

play24:26

set of neurons are green

play24:28

you'll get high activity in those

play24:29

neurons it's like positive interference

play24:31

between light waves

play24:33

and if they disagree you'll get low

play24:35

activity

play24:37

and if you measure agreement just by the

play24:41

activity in a layer of neurons

play24:43

you're measuring an agreement between

play24:44

the inputs then you don't have to have

play24:47

two things you can have as many things

play24:49

as you like you don't have to

play24:51

divide the input into two patches and

play24:53

say to the representation of the two

play24:54

patches agree you can just say I've got

play24:57

a hidden letter does this hidden layer

play24:58

get highly active

play25:02

and it seems to me that's a better way

play25:04

to measure agreement it's easier for the

play25:05

brain to do

play25:07

and it's particularly interesting if you

play25:11

have spiking neurons

play25:12

because what I'm using at present

play25:14

doesn't use Spike Insurance it just says

play25:18

a hidden layer is really asking

play25:21

are my inputs agreeing with each other

play25:23

in which case I'll be highly active or

play25:24

are they disagree in which case I won't

play25:27

but if the inputs arrive at specific

play25:29

times very precise times like spikes do

play25:32

then you can ask not just other stem

play25:35

neurons being stimulated

play25:39

but are they being stimulated at exactly

play25:41

the same time

play25:43

and that's a much sharper way to measure

play25:45

agreement so spiking neurons seem

play25:47

particularly good for measuring

play25:49

agreement which is what I need

play25:52

that's the objective function to get

play25:54

agreement in the positive phase is not

play25:56

in the negative phase

play25:58

and

play26:00

I'm thinking about ways of trying to

play26:02

implant you spiking neurons to make this

play26:05

work better but that's one big

play26:07

difference from simpler that you're not

play26:09

taking two things and saying do they

play26:11

agree you're just taking all the inputs

play26:13

coming into a layer and saying do all

play26:14

those inputs agree

play26:17

when you talk about the activity that's

play26:21

similar to what you were doing with n

play26:22

grads where

play26:24

you're comparing top-down predictions

play26:26

and bottom-up predictions okay okay okay

play26:30

this when you do the recurrent version

play26:32

of the forward algorithm

play26:35

at each time step

play26:37

neurons in a Larry getting top down

play26:39

input and bottom-up input right

play26:41

and

play26:43

they'd like them to agree

play26:46

and if your objective function is to

play26:48

have high activity

play26:50

they'd like to make things highly active

play26:52

there's another version of the forward

play26:53

algorithm where the objective is to have

play26:55

low activity

play26:56

and then you want the top down to cancel

play26:58

out the bottom up

play27:00

and then it looks much more like

play27:01

predictive coding it's not quite the

play27:03

same but it's very similar but let's

play27:05

stick with the version where you're

play27:06

going for high activity you want the top

play27:08

down and bottom up to agree and give you

play27:10

high activity

play27:12

but notice that

play27:15

it's not like the top down is a

play27:16

derivative

play27:18

so in attempts to

play27:22

Implement back crop in neural Nets

play27:25

you try and have top down things which

play27:28

are like derivatives

play27:30

and bottom-up things which are like

play27:31

activities

play27:33

and you try and use temporal differences

play27:35

to give you the derivatives

play27:37

and that's somewhat different

play27:41

here everything's activities you're

play27:43

never propagated derivatives

play27:47

and this algorithm also

play27:52

does away with the idea of dynamic

play27:54

routing that you talked about with yes

play27:57

stacked capsule encoders yeah yes so

play28:00

with capsules I moved on from the

play28:04

dynamic routing to having what are

play28:07

called Universal capsules

play28:10

capsule would be a small collection of

play28:13

neurons

play28:14

and in the original capsules models that

play28:17

collection of neurons would only be able

play28:19

to represent one type of thing like a

play28:20

nose and a different kind of capsule

play28:22

would represent a mouse

play28:24

in Universal capsules what you'd have is

play28:27

that each capsule

play28:30

could represent any type of thing so it

play28:32

would have different activity patterns

play28:34

to represent the different kinds of

play28:35

things that might be there the capsule

play28:37

would be dedicated to a location in the

play28:39

image so a capsule will be representing

play28:41

what kind of thing you have at that

play28:43

location at a particular level of

play28:46

butthole hierarchy

play28:48

so it might be representing you that at

play28:50

the part level you have a nose

play28:53

um and then at a higher level you'd have

play28:55

other capsules that are representing

play28:56

other at the object level you have a

play28:59

face or something

play29:01

but when you get rid of the dedication

play29:04

of a bunch of neurons to a particular

play29:06

type of thing you don't need to do

play29:07

routing anymore

play29:09

and in the forward fold algorithm

play29:12

I'm not doing routine and one of the

play29:14

diagrams in the paper from the product

play29:16

is actually taken from my paper on

play29:19

pothole hierarchies my last paper on

play29:21

capsule models

play29:22

so I had a system called glom an

play29:24

imaginary system and the problem with it

play29:26

was I never had a plausible learning out

play29:28

of it and the thought algorithm is a

play29:30

plausible learning algorithm for glom is

play29:32

something that's neurally reasonable

play29:35

what was fascinating to me at least

play29:37

about capsules is that they captured the

play29:41

3D nature of reality right lots of

play29:44

neural Nets are now doing that

play29:46

so Nerf models neural Regions Field

play29:49

models

play29:50

now giving you very good 3D models in

play29:54

neural Nets so you can see something

play29:57

from a few different viewpoints

play30:00

and then

play30:01

produce an image of what it would look

play30:03

like from a new viewpoint

play30:05

that's very good for example making

play30:08

smooth videos

play30:09

from frames that are taken a quite long

play30:13

time intervals but in the forward

play30:16

forward algorithm what's your intuition

play30:20

that that this is the if indeed

play30:23

everything works out that this is a

play30:26

model for information processing in the

play30:29

cerebral cortex

play30:31

and that

play30:32

perception of depth and the 3D nature of

play30:37

reality would emerge

play30:40

yeah

play30:43

yeah in particular if I'm showing you a

play30:46

video

play30:47

and the Viewpoint is changing during the

play30:49

video

play30:51

then

play30:54

what you'd want is that the hidden

play30:56

layers should represent 3D structure

play30:59

that's all pie in the sky at present go

play31:02

ahead reach that stage but yeah but with

play31:04

capsules because I think you you

play31:08

referred to pixels having depth

play31:12

so that if one object moved in front of

play31:16

another the system understood that the

play31:18

that it was behind

play31:20

the thing in front of it

play31:23

do you capture that with forward

play31:27

you would want it to learn to deal with

play31:29

that yes yeah I wouldn't wire that in

play31:33

but it's an obvious feature video that

play31:36

it should learn about with babies

play31:40

they learn in just a few days to get

play31:44

structure from motion that is if I take

play31:47

a static scene

play31:49

and I move the Observer

play31:51

or if I take keep the Observer

play31:53

stationary

play31:55

and the experiments were done with a

play31:57

piece of paper folded into a w

play32:01

and if you see it the wrong way around

play32:02

it looks weird

play32:06

and so

play32:08

experiments done by Elizabeth Stokey and

play32:10

other people use the idea that

play32:13

you can tell a lot about the perception

play32:16

of a baby by seeing what they're

play32:18

interested in because they're interested

play32:20

in things that look odd and so they'll

play32:21

pay more attention to things that look

play32:23

hard and within a few days

play32:27

they learn to deal with how 3D structure

play32:31

ought to be related to motion and if you

play32:33

make it related wrong they think it's

play32:35

weird

play32:37

so they learn that very fast whereas it

play32:39

takes them like at least six months I

play32:41

think to learn to do stereo

play32:43

to get it from the true eyes it's just

play32:46

much easier to get from video than from

play32:48

stereo but from evolutionary point of

play32:50

view if something's really easy to learn

play32:52

there's not much Point wiring it in

play32:54

you've been working in Matlab famously

play32:57

now

play32:58

on toy problems are you starting to

play33:02

scale are you still refining

play33:05

I'm doing a bit of scanning I'm using a

play33:07

GPU to make these go a bit faster but

play33:10

I'm still at the stage where there's

play33:12

very basic properties of the algorithm

play33:13

I'm exploring in particular how to

play33:17

generate negative data effectively from

play33:19

the model

play33:20

and until I've got the sort of basic

play33:22

stuff working nicely

play33:23

I think it's silly to scale it up as

play33:26

soon as you scale it up it's slower to

play33:29

investigate changes in the basic

play33:31

algorithm and I'm still at the stage

play33:33

where there's lots and lots of different

play33:34

things I want to investigate for example

play33:37

here's just one little thing that I

play33:39

haven't had time to invest in yet you

play33:41

can use

play33:42

as your objective function to have high

play33:44

activity

play33:45

in the positive phase and low activity

play33:47

in the negative phase

play33:49

and if you do that it'll find nice

play33:52

features in the hidden units

play33:54

or you can have a zero objective

play33:56

function to have low activity in the

play33:57

positive phase

play33:59

if you do that it'll find nice

play34:01

constraints

play34:02

if you think about what physicists do

play34:05

they try and understand nature

play34:08

by finding apparently different things

play34:10

that add up to zero

play34:12

another way of saying is that they're

play34:14

equal and opposite but

play34:16

if you take force and you subtract mass

play34:19

times acceleration you get zero

play34:21

but that's a constraint

play34:23

okay

play34:25

so if you have two sorts of information

play34:26

one of which is force and the other

play34:28

which is mass times acceleration

play34:31

you'd like to

play34:33

have hidden units that see both those

play34:36

inputs and that say zero

play34:40

no activity

play34:42

and then when they see things that don't

play34:45

fit the physics

play34:47

they'll have high activity they'll be

play34:48

the negative data

play34:50

so that's called a constraint

play34:52

and so if you make your objective

play34:54

function B have low activity for real

play34:55

things and high activity for

play34:57

things that aren't real you'll find

play34:59

constraints in the data as opposed to

play35:01

features

play35:03

so features are things that have high

play35:05

variance and constraints of things that

play35:06

have low variance

play35:07

a feature something that's got higher

play35:09

variance and it should have constrained

play35:11

as low various than it should now

play35:12

there's no reason why you shouldn't

play35:14

have two types of neurons one's looking

play35:17

for features and one's looking for

play35:18

constraints

play35:19

and we know with just linear models

play35:22

that

play35:23

a method like principal components

play35:25

analysis

play35:26

looks for the directions in the space at

play35:29

the highest variance they're like

play35:31

features

play35:32

and it's very stable

play35:33

there's other methods like minor

play35:35

components analysis that look for

play35:37

directions in the space that have the

play35:38

lowest variance they're looking for

play35:40

constraints

play35:41

they're less numerically stable

play35:45

but we know that it pays to have both

play35:50

and so that for example is a direction

play35:52

that might make things work better but

play35:54

there's lots

play35:56

there's about 20 things like that I need

play35:57

to investigate

play35:59

and my feeling is until I've got a good

play36:03

recipe for whether you should use

play36:05

features or constraints or both

play36:09

what's the most effective way to

play36:10

generate negative data and so on

play36:12

it's premature to investigate really big

play36:15

systems

play36:16

with regard to really big systems one of

play36:18

the things you talk about is the need

play36:21

for a new kind of computer and I've seen

play36:25

confusion about this too in the Press

play36:28

I've seen people talk about how you talk

play36:31

about getting rid of the annoyman

play36:34

yeah you obviously want computers where

play36:37

the hardware and software are separate

play36:39

yeah and you want them to do things like

play36:41

keep track of your bank account

play36:45

this is for things that where we want

play36:48

computers to be like people to process

play36:51

natural language to process vision all

play36:55

those things that

play36:56

some years ago Bill Gates said computers

play36:58

couldn't do like they're blind and deaf

play37:01

they're not blind and deaf anymore but

play37:04

for processing natural language or doing

play37:06

motor control or doing Common Sense

play37:09

reasoning

play37:11

we probably want a different kind of

play37:14

computer if we want to do a very low

play37:15

energy

play37:16

we need to make much better use of all

play37:18

the properties of the hardware your

play37:20

interest is understanding the brain well

play37:23

I have a side interest in getting low

play37:25

energy computation going and the point

play37:27

about the forward forward is it works

play37:28

when you don't have a good model of the

play37:30

hardware so if for example I take a a

play37:34

neural net and I insert a black box so I

play37:36

have a layer that's just a black box I

play37:38

have no idea how it works

play37:40

it does stochastic things

play37:44

I don't know what's going on

play37:46

the question is can the whole system

play37:48

learn with that black box in there

play37:50

and it has absolutely no problem you've

play37:53

done something different because the

play37:54

black box is changing what happens on

play37:55

the forward pass

play37:57

but the point is it's changing into

play37:58

exactly the same way for both forward

play38:00

passes so it all cancels out

play38:02

whereas in back propagation you're

play38:04

completely sunk at this back box the

play38:05

best you can do is try and learn a

play38:07

differentiable model of the black box

play38:09

and that's not going to be very good if

play38:10

the black box is wandering in its

play38:12

Behavior

play38:13

so the forward algorithm doesn't need to

play38:17

have a perfect model of a forward system

play38:18

it needs to have a good enough model of

play38:20

what one neuron is doing so that it can

play38:23

change the incoming weights of that

play38:25

neuron to make it more active or less

play38:26

active but that's all it needs it

play38:29

doesn't need to be able to

play38:31

invert the forward pass

play38:33

and you're not talking about replacing

play38:36

back propagation which has obviously had

play38:39

enormous success there's plenty of

play38:41

compute plenty of power then

play38:44

back crop is fine but and this is

play38:47

speculative I understand where you are

play38:49

in the research but can you imagine if

play38:52

you had low power

play38:53

computer

play38:55

architecture that that could handle Ford

play38:59

algorithms and you scale them imagine

play39:02

that it would be great I've actually

play39:04

been talking to someone called Jack

play39:05

Kendall who works for a company called

play39:07

rain who is very insightful about what

play39:10

you can do with analog Hardware using

play39:13

properties of the circuits using word

play39:15

for circuits positives of the electrical

play39:16

circuits natural properties of

play39:18

electrical circles

play39:20

um initially

play39:21

that was very interesting for doing a

play39:24

form of Baltimore machine learning

play39:27

but it's also going to be very

play39:28

interesting for the forward algorithm so

play39:30

I can imagine it's scaling up very well

play39:32

but there's a lot of work to be done to

play39:34

make that happen

play39:35

and if it did scale up very well to the

play39:38

degree that large language models have

play39:41

been successful do you think that its

play39:44

abilities would Eclipse those of models

play39:50

based on back propagation I'm not at all

play39:53

sure I think they may not so I think

play39:56

back propagation might be a better

play39:57

algorithm in the sense that a given

play39:59

number of connections you can get more

play40:02

knowledge into those connections using

play40:03

back propagation than you can with the

play40:05

thought algorithm

play40:07

so the Network's moving forward

play40:10

better if they're somewhat bigger than

play40:12

the best size networks for back

play40:13

propagation

play40:14

it's not good at squeezing a lot of

play40:16

information into a few connections

play40:18

back propagation will squeeze lots of

play40:20

information into a few connections if

play40:22

you force it to

play40:24

it is much more happy not having to do

play40:26

that but it'll do it if you force it to

play40:28

and the full algorithm isn't good at

play40:30

that

play40:31

so if you take these large language

play40:32

models

play40:34

so take something with a trillion

play40:36

connections

play40:38

which is about the largest language

play40:39

model that kind of size

play40:43

that's about a cubic centimeter of

play40:45

Cortex

play40:46

and our cortex is like we got a thousand

play40:49

times that much cortex

play40:52

so these large language models that

play40:54

actually know a lot more facts than your

play40:56

ideal

play40:57

because they've read everything on the

play40:59

web not everything but an awful lot yeah

play41:03

the sense in which they know them is a

play41:05

bit dodgy but

play41:07

if you had a sort of general knowledge

play41:09

quiz

play41:10

I think gpg3 even would beat me at a

play41:14

general knowledge quiz

play41:16

there'd be all sorts of people it knows

play41:18

about and when they were born and what

play41:19

they did but I don't know about it and

play41:22

it all fits in a cubic centimeter cortex

play41:24

if you measure by connections

play41:26

so it's got much more knowledge than me

play41:28

I mean much less brain

play41:31

so I think back crop is much better at

play41:33

squeezing information

play41:35

but that's not the brain's main problem

play41:38

broad brains we've got plenty of

play41:40

synapses the question is how do you

play41:42

effectively get information into them

play41:45

how do you make good use of experience

play41:48

David Chalmers talked about the

play41:52

possibility of Consciousness and you're

play41:54

certainly interested in

play41:57

the possibility if you understand how

play42:00

the brain works and you can

play42:02

replicate it this kind of a model let's

play42:05

imagine that it scales beautifully

play42:09

do you see the potential for reasoning

play42:12

and

play42:13

oh I see the potential for reasoning

play42:15

sure

play42:16

but Consciousness is a different kind of

play42:18

question so I think people

play42:22

I'm amazed that anybody thinks they

play42:25

understand what they're talking about

play42:26

when they talk about consciousness

play42:28

they talk about as if we can Define it

play42:30

and it's really a jumble of a whole

play42:33

bunch of different concepts yeah and

play42:35

they're all mixed together into this

play42:37

attempt to explain a really complicated

play42:40

mechanism in terms of an essence

play42:43

yeah so we've seen that before like 100

play42:46

years ago if you asked philosophers what

play42:49

makes something alive or even if you ask

play42:51

biologists what makes something alive

play42:53

they say Well it has vital force but if

play42:56

you say what is vital force and can we

play42:58

make machines have vital force

play43:00

they can't really Define vital force

play43:02

other than saying is what makes people

play43:03

alive and as soon as you start

play43:05

understanding biochemistry

play43:08

you give up on the notion of vital force

play43:10

you understand about biochemical

play43:13

processes that are stable and things

play43:14

breaking down and

play43:17

so it's not that we cease to have vital

play43:20

force we've got as much vital force as

play43:22

we had before it's just that it's not a

play43:24

useful concept because in an attempt to

play43:27

explain something complicated in terms

play43:29

of some simple essence

play43:33

so

play43:34

another model like that is

play43:37

so sports cars have oomph and some have

play43:40

a lot of them

play43:42

like an Aston Martin with big noisy

play43:44

exhausts and lots of acceleration and

play43:46

bucket seats

play43:48

has lots of

play43:50

and

play43:52

month is an intuitive concept you can

play43:55

ask it doesn't Aston Martin have more

play43:57

umph than my Toyota Corolla and it

play43:59

definitely has a lot more oomph so we

play44:01

really need to find out what oomph is

play44:03

because umph is what's it what it's all

play44:04

about if you're interested in cars or

play44:07

fast cars anyway

play44:09

but the concept of umph it's a perfectly

play44:11

good concept but it doesn't really

play44:13

explain much but if you want to know why

play44:16

is it that when I press the accelerator

play44:17

it goes very fast the concept of oomph

play44:19

isn't going to help you you need to get

play44:21

into the mechanics of it

play44:23

how it actually works

play44:25

and that that's a good analogy because

play44:28

what I was going to say is it doesn't

play44:30

really matter what Consciousness is it

play44:32

matters whether

play44:35

we as humans perceive something as

play44:38

having Consciousness and I think there's

play44:40

a lot to I think there's a lot to be

play44:42

said to that yes yeah so if the if this

play44:44

forward in a large model that scaled

play44:48

relatively low power consumption

play44:51

if it can reason there'll always be

play44:54

philosophers that say yeah but it's not

play44:56

conscious

play44:57

but it doesn't really matter if you

play44:59

can't tell the difference it matters to

play45:01

the philosophers I think it would be

play45:04

nice to show them the way out of

play45:06

their trap they make for themselves

play45:07

which is I think most people have a

play45:10

radical misunderstanding of how terms

play45:13

about perception and experience and

play45:16

sensation and feelings actually work

play45:17

I've had the language works

play45:20

if for example I say I'm seeing a pink

play45:22

elephant notice the words pink and

play45:25

elephant refer to things in the world

play45:28

so what's actually happening is I'd like

play45:30

to tell you what's going on inside my

play45:31

head yeah

play45:34

but telling you what the neurons are

play45:36

doing won't do you much good

play45:37

particularly since all our brains are

play45:39

wired slightly differently it's just no

play45:41

use to you to tell you what the neurons

play45:42

are doing

play45:44

but I can tell you that whatever it is

play45:47

my neurons are doing it's the kind of

play45:49

thing that's normally caused by Pink

play45:51

Elephant being out there if I was doing

play45:53

veridical perception the cause of my

play45:56

brain state would be a pink elephant I

play45:58

can tell you that and that doesn't mean

play46:00

a pink elephant exists in some spooky

play46:04

thing inside my head or it's just a

play46:06

mental thing what it really tells you is

play46:08

I'm giving you a counterfactual I'm

play46:11

saying the world doesn't really contain

play46:13

a pink elephant but if it did contain a

play46:15

pink elephant

play46:16

that would explain my brain stage

play46:19

that plus normal perceptual causation

play46:21

will explain my brain stage

play46:23

so when I say I'm having the experience

play46:26

of a pink elephant the word experience

play46:31

many people think experience refers to

play46:33

some funny internal goings on it's an

play46:35

experience it's some internal no what

play46:38

I'm denoting when I use the word

play46:39

experience is that it's not real

play46:44

I'm took I'm giving you a hypothetical

play46:46

statement but if this hypothetical thing

play46:48

were out there in the world that would

play46:51

explain this brain State and so I'm

play46:53

giving you insight into my brain state

play46:54

by talking about a hypothetical world

play46:57

what's not real about experience is that

play46:59

it's a hypothetical I'm giving them it's

play47:01

not that it lives in some other Spooky

play47:03

World

play47:04

and it's the same for feelings

play47:06

if I say I feel like hitting you

play47:10

what I'm doing is I'm giving you

play47:14

a sense of what's going on in my head

play47:17

via what it would normally cause so in

play47:21

perception it's the world causing of a

play47:23

sexual state with feelings it's the

play47:26

internal State causing an action and I'm

play47:28

giving you insight into my internal

play47:30

state by telling you what kind of action

play47:33

it would cause now

play47:35

I might feel like hitting you or anybody

play47:37

else or kicking the cat or whatever in

play47:40

which case I instead of giving you any

play47:42

one of those actions I just use a term

play47:44

like angry

play47:46

but really that shorthand for all those

play47:48

angry actions

play47:50

so I'm giving you

play47:52

I'm giving you a way of seeing what's

play47:54

going on in my head via describing

play47:56

actions I might do but they're just

play47:58

hypothetical actions

play48:00

and that's what the word feel means when

play48:02

I say I feel

play48:04

typically if I say I feel and then say I

play48:08

feel like blah

play48:10

it's not that there's some special

play48:11

internal Essence that's feeling and

play48:13

computers don't have it computers are

play48:15

just transistors they don't have feeling

play48:17

you have to have a soul to have feeling

play48:19

or something

play48:20

no I'm describing my internal State via

play48:23

the actions it would cause if I were to

play48:25

disinhibit it

play48:27

from another human's point of view if

play48:29

you were a machine and you were saying

play48:33

things like that I would perceive it

play48:35

as you having feelings

play48:38

right so let's take the perception cases

play48:40

it's slightly simpler I think suppose we

play48:42

make a

play48:43

big complicated neural network that can

play48:45

do perception and can also produce

play48:47

language we have those now yeah

play48:51

and so you can show them a minute and

play48:53

they can give you a description what's

play48:55

there

play48:56

and suppose we now take one of those

play48:58

networks and we say

play49:01

I want you to just imagine something

play49:04

and okay so it imagine something

play49:07

and then it tells you what it's

play49:09

imagining so it says I'm experiencing a

play49:11

pink elephant

play49:13

that's experiencing the Pink Elephant

play49:15

just as much as a person is when they

play49:16

say they experience something elephant

play49:17

it's got an internal perceptual state

play49:19

that would normally be caused by a pink

play49:21

elephant but in this case it's not

play49:22

caused by a pink elephant and so it uses

play49:24

the word experience to denote that there

play49:27

you go I think it's got just as much

play49:30

perceptual Sensations as we have

play49:32

although at the current state of large

play49:35

language models don't exhibit that kind

play49:38

of cohesive internal logic you know but

play49:42

they will they will you you think they

play49:44

will oh yeah

play49:46

yeah I don't think I don't think

play49:48

Consciousness is people treat it like

play49:49

it's like the sound barrier that

play49:53

you're either below the speed of sound

play49:54

or you're above the speed of sound

play49:55

you've either got a model that hasn't

play49:57

yet got Consciousness or you've got

play49:58

there it's not like that at all

play50:02

I think a lot of people were impressed

play50:04

by you talking about using Matlab

play50:07

I'm not sure impressed is the right word

play50:09

they were interested they were surprised

play50:11

but what is your day-to-day work like

play50:15

you have other responsibilities but you

play50:18

spend more time on conceptualizing and

play50:21

that could happen while taking a walk or

play50:24

taking a shower or do you spend more

play50:27

time on

play50:30

experimenting like on Matlab or do you

play50:34

spend more time on running large

play50:38

experiments

play50:40

okay

play50:41

it varies a lot over time so I'll often

play50:45

spent a long time like when I wrote that

play50:47

paper about glom I spent a long time

play50:49

thinking about how to organize a

play50:51

perceptual system that was more neurally

play50:52

realistic and could deal with pothole

play50:54

hierarchists without having to do

play50:57

Dynamic setting up and connections

play51:00

and so I spent many months just thinking

play51:02

about how to do that and writing a paper

play51:04

about that I spent a lot of time trying

play51:06

to think about more biologically

play51:08

possible learning algorithms yes and

play51:10

then programming little systems in

play51:12

Matlab and discovering why they don't

play51:14

work so the point about most original

play51:17

ideas is they're wrong

play51:19

and matlab's very convenient for quickly

play51:22

showing that they're wrong and very

play51:25

small toy problems like recognizing

play51:27

handwritten digits I'm very familiar

play51:29

with that task I can very quickly test

play51:32

out an idea to see if it works and I've

play51:33

got I've probably got on my computer

play51:36

thousands of programs that didn't work

play51:40

well that I programmed in an afternoon

play51:43

and

play51:45

an afternoon was sufficient to decide

play51:47

that okay that's not going to work

play51:49

probably that's probably not going to

play51:51

work you never know for sure because

play51:52

there might be some little trick you

play51:53

didn't think of and then there will be

play51:55

periods when I think I've got onto

play51:56

something that does work

play51:58

and I'll spend

play52:00

several weeks programming and running

play52:03

things to see if it works

play52:05

yeah I've been doing that recently with

play52:06

the Ford forward let me see why I use

play52:08

Matlab I learned lots of languages when

play52:11

I was young I learned

play52:13

pop two which was an Edinburgh language

play52:17

UCSD Pascal a lisp

play52:20

common lisp scheme

play52:23

all sorts of lisps and

play52:26

vanilla Matlab which is ugly in some

play52:29

ways but if you're dealing with vectors

play52:31

and matrices it's what you want it makes

play52:34

it convenient and I became fluent in

play52:37

Matlab

play52:38

and I should have learned Python and I

play52:41

should have learned all sorts of other

play52:42

things but when you're old you're much

play52:45

slower learning language and I'd learned

play52:46

plenty of them and I figured since I'm

play52:49

fluent in Matlab and I can test out

play52:51

little ideas in Matlab and then other

play52:54

people can test out running your own big

play52:56

systems I would just stick with testing

play52:58

out things on Matlab

play52:59

there's a lot of things about just

play53:01

literally shaped me but it's also very

play53:04

convenient and you talk a lot about

play53:06

learning in toddlers And

play53:10

is that knowledge base something you

play53:13

accumulated

play53:15

years ago or are you continuing to read

play53:19

and talk to people in different fields I

play53:23

talk to a lot of people and I learned

play53:25

most things from talking to people I'm

play53:27

not very good at reading it I read very

play53:29

slowly and when I come to equations they

play53:31

slow me up a lot so I've learned most of

play53:34

what I know from talking to people

play53:36

and I'm lucky it's only got lots of good

play53:39

people to talk to like I talked to Terry

play53:40

sonoski and he tells me about all sorts

play53:42

of Neuroscience things I talked to Josh

play53:44

Tenenbaum when he tells me about all

play53:46

sorts of cognitive science things

play53:48

I talked to James Howell and he tells me

play53:50

lots of kind of science psychology

play53:51

things

play53:52

so I get most of my knowledge just from

play53:55

talking to people

play53:56

your target nerves you mentioned yeah he

play53:59

corrected my pronunciation of his name

play54:01

look Khan why did you reference him in

play54:05

that talk

play54:06

oh because for many years he was pushing

play54:09

convolutional neural networks oh okay

play54:12

and the vision Community said okay

play54:15

they're fine for little things like

play54:16

handwritten digits but they'll never

play54:17

work for real images

play54:19

and

play54:21

there was a

play54:22

famous paper submitted to a conference

play54:24

where him and his co-workers

play54:27

where he actually did better than any

play54:30

other system on a particular Benchmark I

play54:33

think it was segmenting pedestrians but

play54:35

I'm not quite sure it was something like

play54:36

that and the paper got rejected even

play54:38

though it had the best results

play54:40

and one of the referees

play54:42

so the reason they were rejecting the

play54:44

paper was because

play54:47

the system learned everything so it

play54:49

taught us nothing about vision and this

play54:51

is a wonderful example of a paradigm and

play54:55

the Paradigm for computer vision was you

play54:58

study the task that has to be performed

play55:00

the computation has to be performed you

play55:03

figure out an algorithm that'll do that

play55:04

computation and then you figure out how

play55:07

to implement it

play55:08

efficiently

play55:10

and so the knowledge is all explicit the

play55:13

knowledge that it's using to do the

play55:15

vision

play55:16

is explicit you have to sort it out

play55:18

mathematically and then implement it and

play55:20

sitting there in the program

play55:22

and they just assumed that's the way

play55:25

that computer vision is going to work

play55:27

and because computer vision has to work

play55:29

that way if someone comes along and just

play55:31

learns everything

play55:33

so they're no use to you because they

play55:35

haven't said what the knowledge is what

play55:37

is the heuristic you're using

play55:40

and so it's okay maybe it works but

play55:45

that's just good luck in the end we're

play55:47

bound to work better than that because

play55:48

we're using real knowledge shouldn't we

play55:50

understand what's going on

play55:51

so they completely failed to get the

play55:54

main message which was that it learned

play55:56

everything

play55:58

not quite everything because you're

play55:59

writing convolution

play56:02

but the machine Learning Community they

play56:04

respected him because he's obviously a

play56:05

smart guy but they thought he was on

play56:07

completely the wrong path and they

play56:09

dismissed his work years and years

play56:12

and then when Fife Lee and her

play56:14

collaborators produced the imagenet

play56:16

competition

play56:17

finally we had a big enough data set

play56:20

to show that neural networks would

play56:22

really work well

play56:24

and

play56:25

Jan actually tried to get several

play56:27

different students to

play56:30

make a serious attempt to do the image

play56:33

Network convolutional Nets but he

play56:34

couldn't find a student who was

play56:35

interested in doing it at the same time

play56:37

Elia became very interested in doing it

play56:40

and I was interested in doing it and

play56:42

Alex fishevski was a superb programmer

play56:44

who put a lot of hard work within you

play56:46

into making it work really well

play56:48

so it was very unfortunate for Yan that

play56:51

it wasn't

play56:53

his group

play56:54

that finally convinced the computer

play56:56

vision Community actually this stuff

play56:58

works much better than what you're doing

play56:59

you've now

play57:01

put this paper out there are you hoping

play57:04

to ignite sort of an army of yeah of

play57:07

people trying to put some simple Matlab

play57:09

code out there too

play57:10

yeah because there's a bunch of little

play57:12

things you have to otherwise it won't

play57:13

work and

play57:15

the code needs to get there it's more

play57:18

picky than backup with back propagation

play57:21

you just show people the equations and

play57:25

anybody can go and implement it

play57:27

and it doesn't need a lot of tricks for

play57:30

it to work quite well to work really

play57:32

well it needs lots of Tricks but it's

play57:34

worked quite well it's fine with the

play57:36

forward forwards you need a few tricks

play57:37

for it to work at all

play57:39

the tricks are quite reasonable tricks

play57:40

but once you put them in there then it

play57:42

works and I want to put that Matlab code

play57:44

out there so other people can get it to

play57:46

work

play57:47

but I didn't want to put my very

play57:48

primitive Matlab code out there because

play57:50

it's disgusting

play57:52

[Music]

play57:56

thank you

play58:00

that's it for this week's podcast I want

play58:03

to thank Jeff for his time I also want

play58:06

to thank clear ml for their support

play58:09

we're looking for more sponsors so if

play58:11

you are interested in supporting the

play58:14

podcast

play58:15

please email me at Craig c r a i g at

play58:20

ionai that's

play58:22

e-y-e-hyphen on dot a i

play58:26

as always you can find a transcript of

play58:29

this episode on our website ey hyphen on

play58:34

dot a i

play58:35

I encourage you to read the transcript

play58:38

if you're serious about understanding

play58:40

the forward forward algorithm

play58:43

in the meantime remember

play58:46

The Singularity may not be near but AI

play58:50

is about to change your world so pay

play58:54

attention

Rate This

5.0 / 5 (0 votes)

Related Tags
深度学习神经网络前馈-前馈算法杰弗里·辛顿人工智能机器学习认知科学大脑机制算法革命技术突破
Do you need a summary in English?