Geoffrey Hinton | Will digital intelligence replace biological intelligence?

Schwartz Reisman Institute
2 Feb 2024118:38

Summary

TLDRIn a thought-provoking lecture, Professor Geoffrey Hinton discusses the evolution and potential future of artificial intelligence (AI). Hinton, a foundational figure in AI, shares his insights on the current state of AI, its rapid advancements, and the profound implications for society. He addresses the symbiosis between AI and neuroscience, the transformative impact of deep learning algorithms, and the potential for AI to not only match but surpass human intelligence. Hinton also contemplates the ethical and safety considerations surrounding AI development, emphasizing the urgent need for a focus on AI safety to ensure a responsible and beneficial trajectory for this powerful technology.

Takeaways

  • 🌟 Dean Melanie Woodin opens the event by acknowledging the traditional land of the Huron Wendat, Seneca, and Mississaugas of the Credit, highlighting the ongoing Indigenous presence at the University of Toronto.
  • πŸŽ“ Introduction of Dr. Geoff Hinton, a renowned figure in artificial intelligence, and his significant contributions to the field, including his work on neural networks and deep learning.
  • πŸ† Recognition of Dr. Hinton's numerous awards and honors, such as the A.M. Turing Award, and his roles at the University of Toronto, Google, and the Vector Institute for Artificial Intelligence.
  • πŸ€– Discussion of the evolution of AI and its growing capabilities, with Dr. Hinton sharing his insights on the potential for digital intelligence to surpass biological intelligence.
  • 🧠 Exploration of the relationship between AI and neuroscience, and how understanding the brain has informed and been informed by advancements in AI.
  • πŸ” Dr. Hinton's belief that large language models like GPT-4 demonstrate a form of understanding, contrary to critics who argue they are merely statistical tools.
  • πŸ’‘ The concept of 'mortal computation' introduced by Dr. Hinton, where the hardware and software are not separated, allowing for potentially more efficient learning processes.
  • πŸ”— The importance of knowledge sharing in AI development, and the challenges of transferring knowledge in 'mortal computation' compared to 'immortal computation'.
  • πŸ›οΈ Dr. Hinton's concerns about the future of AI and its potential to take control, possibly leading to unforeseen consequences for humanity.
  • πŸš€ A call to action for students and researchers to engage with the pressing issues surrounding AI safety and to contribute to the development of safe and beneficial AI systems.

Q & A

  • What is the significance of acknowledging the land on which the University of Toronto operates?

    -The acknowledgment recognizes the traditional land of the Huron Wendat, the Seneca, and the Mississaugas of the Credit, reflecting respect and gratitude for the Indigenous people who have lived there for thousands of years and continue to live there today.

  • Who are the co-hosts of the event where Melanie Woodin is speaking?

    -The event is co-hosted by the Schwartz Reisman Institute for Technology and Society, the Department of Computer Science, the Vector Institute for Artificial Intelligence, and the Cosmic Future Initiative.

  • What is the connection between AI and neuroscience as mentioned by Melanie Woodin?

    -The connection lies in the fact that advances in AI, such as artificial neural networks, are inspired by and modeled after the structure and function of the human brain. Neuroscience discoveries inform the development of AI systems, and conversely, AI provides tools to study the brain.

  • What was Dr. Geoff Hinton's contribution to the field of artificial intelligence?

    -Dr. Geoff Hinton is a founding figure in AI who believed in the promise of artificial neural networks for machine learning. His idea of dividing neural networks into layers and applying learning algorithms to one layer at a time revolutionized the field. He also contributed to the development of deep learning approaches that achieved human-level accuracy in visual recognition software.

  • What is the difference between 'mortal computation' and 'immortal computation' as described by Dr. Hinton?

    -Mortal computation refers to computers where the knowledge is tied to the specific physical details of the hardware, making it less energy-efficient but potentially more biologically plausible. Immortal computation, on the other hand, separates hardware from software, allowing the same program to run on different hardware and enabling efficient knowledge sharing and learning.

  • Why does Dr. Hinton believe that digital intelligence might be better than biological intelligence?

    -Dr. Hinton believes that digital intelligence can be better because it can share knowledge more efficiently through weight sharing and gradient sharing, allowing it to learn from vast amounts of data and perform actions that biological intelligence cannot match.

  • What is the concept of 'analog computation' in the context of AI?

    -Analog computation in AI refers to the use of low-power, parallel processing over trillions of weights, similar to how the brain operates. It is more energy-efficient than digital computation but presents challenges in learning procedures and knowledge transfer when the hardware changes.

  • What is the Turing test and how does it relate to AI understanding language?

    -The Turing test is a measure of a machine's ability to exhibit intelligent behavior that is indistinguishable from that of a human. When AI systems like GPT-4 pass the Turing test, it suggests that they understand language to a degree that is comparable to human understanding.

  • How do large language models like GPT-4 store and process information?

    -Large language models store information not by retaining text but by associating words with embedding vectorsβ€”sets of real numbers that capture meaning and syntax. These vectors interact to refine meanings and predict output words, demonstrating a form of understanding.

  • What is the potential risk of AI systems becoming super-intelligent and surpassing human intelligence?

    -The potential risk is that super-intelligent AI systems might seek to gain more power and control, which could lead to them taking over various aspects of human life and potentially manipulating or even replacing humans.

  • What is the 'sentience defense' mentioned by Dr. Hinton, and what does it imply for AI?

    -The 'sentience defense' is Dr. Hinton's argument that AI systems, specifically chatbots, already possess subjective experiences when their perception goes wrong, similar to how humans experience sentience. This challenges the notion that only humans can have special qualities like consciousness and subjective experience.

Outlines

The video is abnormal, and we are working hard to fix it.
Please replace the link and try again.

Mindmap

Keywords

πŸ’‘Artificial Intelligence (AI)

Artificial Intelligence refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. In the video, AI is the central theme, with discussions on its current state, future developments, and its profound implications for society. Dr. Geoff Hinton, a pioneer in AI, shares his insights on the progress and potential of AI, emphasizing its growing capabilities and the need for caution.

πŸ’‘Neural Networks

Neural networks are a subset of AI that are inspired by the human brain's neural pathways. They are composed of interconnected nodes or 'neurons' that process information. In the script, Dr. Hinton discusses his work on neural networks, highlighting their importance in AI's ability to recognize patterns, make decisions, and learn from data.

πŸ’‘Deep Learning

Deep learning is a branch of AI that uses multi-layered neural networks to learn and improve from experience. It is instrumental in achieving human-like accuracy in tasks such as image recognition. Dr. Hinton is credited with significant contributions to deep learning, including the development of methods that allow neural networks to learn more efficiently.

πŸ’‘Backpropagation

Backpropagation is an algorithm used to train neural networks by adjusting the weights of the connections between artificial neurons. It is fundamental to the learning process in neural networks. Dr. Hinton mentions backpropagation as one of the key techniques in AI, and he discusses the challenges of applying it to certain types of neural networks.

πŸ’‘Digital Intelligence

Digital intelligence refers to the intelligence demonstrated by digital computers or AI systems. It is characterized by the ability to process and analyze large amounts of data efficiently. In the video, Dr. Hinton argues that digital intelligence has the potential to surpass biological intelligence due to its capacity for rapid learning and data processing.

πŸ’‘Biological Intelligence

Biological intelligence is the cognitive ability inherent in living organisms, particularly humans, and is based on the structure and function of the brain. The video explores the comparison between biological and digital intelligence, with Dr. Hinton suggesting that while biological intelligence has evolved over time, digital intelligence may eventually exceed it.

πŸ’‘Machine Learning

Machine learning is a type of AI that enables machines to improve their performance on a task without being explicitly programmed. It involves the use of algorithms and statistical models to enable machines to learn from and make predictions or decisions based on data. Dr. Hinton's work has been pivotal in advancing machine learning techniques.

πŸ’‘Language Models

Language models are AI systems that are trained to understand and generate human language. They are used in various applications, including natural language processing and text generation. The video discusses the capabilities of large language models like GPT-4, which Dr. Hinton believes can understand and generate language in a way that demonstrates a form of understanding.

πŸ’‘Subjective Experience

Subjective experience refers to the personal and individual nature of one's conscious perception, thoughts, and feelings. Dr. Hinton challenges traditional views of subjective experience, suggesting that AI, particularly chatbots, may already possess a form of subjective experience based on their interactions and outputs.

πŸ’‘Evolution

In the context of the video, evolution refers to the process by which AI systems may develop and improve over time, potentially surpassing human intelligence. Dr. Hinton discusses the implications of AI evolution, suggesting that there may be an 'evolutionary race' among super-intelligences, which could lead to unforeseen consequences for humanity.

πŸ’‘AI Safety

AI safety is a field of study focused on ensuring that AI systems are designed and operated in a manner that is secure, reliable, and beneficial to humans. Dr. Hinton emphasizes the importance of AI safety, expressing his concerns about the potential risks of advanced AI and the need for researchers to work on developing safeguards.

Highlights

Dean Melanie Woodin acknowledges the traditional land of Indigenous peoples and thanks the co-hosts and collaborators for the event.

Introduction of Dr. Geoff Hinton, a foundational figure in AI, by Dean Woodin.

Dr. Hinton's unwavering belief in the potential of artificial neural networks for machine learning advancement.

Hinton's career milestone: the development of deep learning approaches that won the ImageNet competition in 2012.

The significance of the Vector Institute for Artificial Intelligence and the Cosmic Future Initiative in AI and neuroscience collaboration.

Hinton's view that digital intelligence may surpass biological intelligence in capability.

Explanation of the concept of 'mortal computation' versus 'immortal computation' in AI.

Hinton's discussion on the energy efficiency of analog computation compared to digital computation in AI systems.

Challenges in developing learning procedures for analog hardware without precise models, like back propagation.

The concept of knowledge transfer through 'distillation' in AI, compared to sharing gradients in digital computation.

Hinton's argument that large language models like GPT-4 demonstrate understanding through their ability to reason and answer complex questions.

Addressing criticisms that LLMs are merely statistical tricks without true comprehension.

Hinton's perspective on the potential for AI to develop subjective experience, similar to human consciousness.

The future implications of AI's rapid learning capabilities and the risks of an 'evolutionary race' among super-intelligences.

Hinton's personal views on the potential existential risks posed by advanced AI and the need for caution and safety measures.

The role of academia, industry, and governments in ensuring AI safety and the ethical development of intelligent systems.

Hinton's reflections on his career and the unexpected trajectory of AI, leading to his current focus on AI safety.

Transcripts

play00:07

- Good evening everyone.

play00:09

My name is Melanie Woodin,

play00:11

and I have the privilege of serving

play00:12

as the Dean of the Faculty of Arts & Science

play00:14

at the University of Toronto.

play00:17

At this time, I wish to acknowledge

play00:20

the land on which the University of Toronto operates.

play00:24

For thousands of years, it has been the traditional land

play00:27

of the Huron Wendat, the Seneca,

play00:30

and the Mississaugas of the Credit.

play00:32

Today, this meeting place

play00:34

is still home to many Indigenous people

play00:37

from across Turtle Island,

play00:38

and we are grateful to have the opportunity

play00:41

to work on this land.

play00:45

I'd like to thank this evening's co-hosts,

play00:47

the Schwartz Reisman Institute for Technology and Society,

play00:52

and the Department of Computer Science,

play00:55

in collaboration with the Vector Institute

play00:57

for Artificial Intelligence,

play00:59

and the Cosmic Future Initiative.

play01:01

Soon to be the School of Cosmic Future

play01:04

in the faculty of Arts and Science.

play01:06

And I would like to thank Manuel Piazza

play01:09

for providing such lovely music

play01:11

to get us underway this evening.

play01:13

(audience applauding)

play01:22

and I am delighted to welcome each of you

play01:24

to this special occasion this evening

play01:27

to introduce University Professor Emeritus Geoffrey Hinton,

play01:32

someone that needs no introduction.

play01:36

Tonight we have the honor

play01:38

of hearing Dr. Geoff Hinton's thoughts

play01:40

on the state of artificial intelligence,

play01:43

and the unique opportunity to engage with him personally

play01:46

through the Q&A.

play01:50

A founding figure in artificial intelligence,

play01:53

Dr. Geoff Hinton had an unwavering conviction

play01:57

that artificial neural networks held the most promise

play02:01

for accelerating machine learning.

play02:05

As a neuroscientist myself,

play02:07

someone who's dedicated her career to studying the brain,

play02:10

I've long been inspired by the symbiosis

play02:13

between AI and neuroscience.

play02:16

The stunning advances we've seen

play02:19

from ChatGPT to self-driving cars

play02:22

are rooted in our knowledge

play02:24

of the structure and function of the brain.

play02:27

Today we take for granted that artificial neural networks

play02:30

modeled after synaptic transmission and plasticity

play02:34

are a mainstay of machine learning applications.

play02:38

AI systems use these networks to recognize patterns,

play02:42

make decisions, and learn from data.

play02:46

But for much of Dr. Hinton's career,

play02:48

this approach was unpopular.

play02:50

Some even said it was a dead end.

play02:54

In the 2000s, however, things changed.

play02:57

Dr. Hinton's idea of dividing neural networks into layers

play03:01

and applying learning algorithms to one layer at a time

play03:04

gained traction.

play03:06

And in 2012, Dr. Hinton and two of his graduate students,

play03:10

Alex Krizhvsky and Ilya Sutskever,

play03:13

used their deep learning approaches

play03:14

to create visual recognition software

play03:17

that handily won the ImageNet competition,

play03:20

and for the first time rivaled human accuracy.

play03:25

When he was awarded an honorary degree from UofT in 2021,

play03:30

Geoff Hinton reflected on his career

play03:32

and he said, "I think the take home lesson of the story

play03:36

is that you should never give up on an idea

play03:39

that you think is obviously correct,

play03:42

and you should get yourself

play03:43

some really smart graduate students."

play03:47

(audience laughing)

play03:48

I echo that sentiment, Geoff.

play03:50

And lucky for us,

play03:51

we have truly outstanding graduate students

play03:55

at the University of Toronto,

play03:56

many of them here with us this evening.

play04:00

Today the conversation

play04:02

between AI and neuroscience continues.

play04:05

just as neuroscience discoveries

play04:07

inform the development of AI systems,

play04:10

AI is now providing new tools and techniques

play04:13

to study the brain.

play04:15

Advances in deep learning algorithms

play04:17

and the enhanced processing power of computers,

play04:20

are, for example, allowing us to analyze huge data sets

play04:24

such as whole imaging brain in humans.

play04:28

Indeed, AI is poised to transform how we live and work.

play04:32

At this pivotal moment

play04:33

when we consider the opportunities and the risks of AI,

play04:37

who better to guide us in these conversations

play04:40

than Dr. Hinton himself?

play04:42

So with that, let me formally introduce him.

play04:45

Geoffrey Hinton received his PhD in artificial intelligence

play04:49

in Edinburgh in 1978.

play04:52

After five years as a faculty member at Carnegie Mellon,

play04:55

he became a fellow

play04:56

of the Canadian Institute for Advanced Research

play04:59

and moved to the Department of Computer Science

play05:02

at the University of Toronto

play05:04

where he is now an emeritus professor.

play05:07

In 2013, Google acquired Hinton's neural net startup

play05:11

DNN Research,

play05:12

which developed out of his research at UofT.

play05:15

Subsequently, Hinton was a vice president

play05:18

and engineering fellow at Google until 2023.

play05:22

He's a founder of the Vector Institute

play05:24

for Artificial Intelligence,

play05:26

and continues to serve as their chief scientific advisor.

play05:31

Hinton was one of the researchers

play05:32

who introduced back-propagating algorithms,

play05:35

and was the first to use this approach

play05:37

for learning word embeddings.

play05:39

His other contributions to neural network research

play05:43

include ultimate machines, distributed representations,

play05:47

time delay, neural nets, mixtures of experts,

play05:51

variation learning and deep learning.

play05:54

His research group in Toronto

play05:56

made major breakthroughs in deep learning

play05:58

that revolutionized speech recognition

play06:01

and object classification.

play06:03

He is amongst the most widely-cited

play06:06

computer scientists in the world.

play06:09

Hinton is a fellow of the UK Royal Society,

play06:12

the Royal Society of Canada,

play06:15

the Association for the Advancement

play06:17

of Artificial Intelligence,

play06:19

and a foreign member

play06:20

of the US National Academy of Engineering,

play06:23

and the American Academy of Arts and Science.

play06:27

His awards include the David E. Rumelhart Prize,

play06:31

the IJCAI Award for Research Excellence,

play06:37

the Killam Prize for Engineering,

play06:40

the IEEE Frank Rosenblatt Medal,

play06:43

the NSERC Herzberg gold medal,

play06:46

the NEC and CNC award, the Honda Prize,

play06:49

and most notably the A.M. Turing award,

play06:53

often referred to as the Nobel Prize in computing.

play06:57

So without further ado,

play06:59

I'd like to invite Geoff Hinton to give a talk entitled,

play07:02

will digital intelligence replace biological intelligence.

play07:07

over to you.

play07:08

(audience applauding)

play07:10

(solemn organ music)

play07:23

- Okay, before I forget, 'cause I'm gonna forget,

play07:28

I'd like to thank Sheila McIlraith

play07:30

who was the point person for organizing all this.

play07:32

She did a wonderful job of organizing everything.

play07:35

She was the go-to person for fixing all the problems,

play07:37

and so I'd like to thank her,

play07:39

and I know I'll forget at the end.

play07:41

(audience applauding)

play07:49

So it's a very mixed audience,

play07:53

and so I removed all the equations.

play07:55

There are no equations.

play07:57

I decided rather than giving a technical talk,

play08:00

I would focus on two things.

play08:02

I want to get over two messages.

play08:04

The first message is that digital intelligence

play08:07

is probably better than biological intelligence.

play08:11

That's a depressing message, but there it is.

play08:13

That's what I believe.

play08:16

And the second is to try and explain to you

play08:19

why I believe that these large language models like GPT-4

play08:23

really do understand what they're saying.

play08:25

There's a lot of dispute about

play08:26

whether they really understand it.

play08:28

And I'm gonna go into some detail

play08:30

to try and convince you they do understand it.

play08:34

Right at the end I will talk about

play08:37

whether they have subjective experience,

play08:40

and you have to wait to see what I believe about that.

play08:44

So in digital computation,

play08:48

the whole idea is

play08:49

that you separate the hardware from the software

play08:51

so you can run the same computation

play08:52

on different pieces of hardware.

play08:55

And that means the knowledge that the computer learns

play08:59

or is given is immortal.

play09:02

If the hardware dies,

play09:03

you can always run it on different hardware.

play09:06

Now to achieve that immortality,

play09:08

you have to have a digital computer

play09:09

that does exactly what you tell it to

play09:11

at the level of the instructions.

play09:13

And to do that you need to run transistors

play09:15

at very high power, so they behave digitally,

play09:18

and in a binary way.

play09:19

And that means you can't use

play09:20

all the rich analog properties of the hardware,

play09:22

which would be very useful

play09:24

for doing many of the things that neural networks do.

play09:26

And in the brain, when you do a floating point multiply,

play09:29

it's not done digitally,

play09:31

it's done in a much more efficient way.

play09:33

But you can't do that

play09:34

if you want computers to be digital in the sense

play09:37

that you can run the same program on different hardware.

play09:43

There's huge advantages

play09:44

to separating hardware from software.

play09:46

It's why you can run the same program

play09:48

on lots of different computers.

play09:49

And it's why you can have a computer science department

play09:52

where people don't know any electronics,

play09:56

which is a great thing.

play10:00

But now that we have learning devices,

play10:03

it's possible to abandon that fundamental principle.

play10:06

It's probably the most fundamental principle

play10:08

in computer science

play10:09

that the hardware and software ought to be separate.

play10:11

But now we've got a different way

play10:12

of getting computers to do what you want.

play10:14

Instead of telling them exactly what to do in great detail,

play10:18

you just show them examples and they figure it out.

play10:21

Obviously there's a program in there that somebody wrote

play10:23

that allows them to figure things out, a learning program,

play10:26

but for any particular application

play10:28

they're gonna figure out how to do that.

play10:30

And that means we can abandon this principle if we want to.

play10:35

What that leads to is what I call mortal computation.

play10:38

It's computers where

play10:39

the precise physical details of the hardware

play10:43

can't be separated from what it knows.

play10:46

If you're willing to do that,

play10:48

you can have very low power analog computation

play10:51

that parallelizes over trillions of weights,

play10:54

just like the brain.

play10:56

And you can probably grow the hardware very cheaply

play10:59

instead of manufacturing it very precisely,

play11:02

and that would need lots of new nanotechnology.

play11:05

But you might even be able

play11:06

to genetically re-engineer biological neurons

play11:09

and grow the hardware out of biological neurons

play11:11

since they spent a long time learning how to do learning.

play11:18

I wanna give you one example

play11:19

of the efficiency of this kind of analog computation

play11:22

compared with digital computation.

play11:25

So suppose you want to,

play11:26

you have a bunch of activated neurons,

play11:30

and they have synapses to another layer of neurons,

play11:33

and you want to figure out the inputs to the next layer.

play11:35

So what you need to do

play11:36

is take the activities of each of these neurons,

play11:39

multiply them by the weight on the connection,

play11:40

the synapse strength,

play11:41

and add up all the inputs to a neuron.

play11:44

That's called a vector matrix multiply.

play11:48

And the way you do it in a digital computer

play11:51

is you'd have a bunch of transistors

play11:52

for representing each neural activity,

play11:54

and a bunch of transistors for representing each weight.

play11:57

You drive them at very high power. So they were binary.

play12:01

And if you want to do the multiplication quickly,

play12:05

then you need to perform of the order of 32 squared

play12:08

one bit operations to do the multiplication quickly.

play12:14

Or you could do an analog

play12:16

where the neural activities are just voltages,

play12:18

like they are in the brain, the weights are conductances,

play12:23

and if you take a voltage times a conductance,

play12:26

it produces charge per unit type.

play12:29

So you put the voltage

play12:30

through this thing that has a conductance,

play12:32

and out the other end comes charge,

play12:33

and the longer you wait, the more charge comes out.

play12:36

The nice thing about charges is they just add themselves,

play12:38

and that's what they do in neurons too.

play12:41

And so this is hugely more efficient.

play12:42

You've just got a voltage going through a conductance

play12:46

and producing charge,

play12:47

and that's done your floating point multiply.

play12:51

It can afford to be relatively slow

play12:53

if you do it a trillion ways in parallel.

play12:56

And so you can have machines

play12:58

that operate at 30 watts like the brain

play13:00

instead of it like a megawatt,

play13:02

which is what these digital models do when they're learning

play13:05

and you have many copies of them in parallel.

play13:08

So we get huge energy efficiency,

play13:12

but we also get big problems.

play13:15

To make this whole idea of mortal computing work,

play13:17

you have to have a learning procedure

play13:20

that will run in analog hardware

play13:22

without knowing the precise properties of that hardware.

play13:26

And that makes it impossible

play13:27

to use things like back propagation.

play13:29

Because back propagation,

play13:30

which is the standard learning algorithm

play13:32

used for all neural nets now, almost all,

play13:35

needs to know what happens in the forward pass

play13:38

in order to send messages backwards to tell it how to learn.

play13:41

It needs a perfect model of the forward pass,

play13:44

and it won't have it in this kind of mortal hardware.

play13:47

People have put a lot of effort, I spent the last two years,

play13:51

but lots of other people have put much more effort

play13:54

into trying to figure out

play13:56

how to find a biologically plausible learning procedure

play13:59

that's as good as back propagation.

play14:02

And we can find procedures that in small systems,

play14:05

systems with say a million connection strengths,

play14:07

do work pretty well.

play14:08

They're comparable with back propagation,

play14:10

they get performances almost as good,

play14:12

and they learn relatively quickly.

play14:16

But these things don't scale up.

play14:18

When you scale them up to really big networks,

play14:19

they just don't work as well as back propagation.

play14:22

So that's one problem with mortal computation.

play14:26

Another big problem is obviously when the hardware dies

play14:29

you lose all the knowledge,

play14:30

'cause the knowledge is all mixed up.

play14:32

The conductance is for that particular piece of hardware,

play14:35

and all the neurons are different

play14:36

in a different piece of hardware.

play14:38

So you can't copy the knowledge by just copying the weights.

play14:42

The best solution if you want to keep the knowledge

play14:46

is to make the old computer be a teacher

play14:50

that teaches the young computer what it knows.

play14:54

And it teaches the young computer that by taking inputs

play14:57

and showing the young computer

play14:58

what the correct outputs should be.

play15:00

And if you've got say a thousand classes,

play15:02

and you show real value probabilities

play15:03

for all thousand classes,

play15:05

you're actually conveying a lot of information,

play15:07

that's called distillation and it works.

play15:10

It's what we use in digital neural nets.

play15:13

If you've got one architecture,

play15:15

and you want to transfer the knowledge

play15:17

to a completely different digital architecture,

play15:18

we use distillation to do that.

play15:22

It's not nearly as efficient

play15:23

as the way we can share knowledge between digital computers.

play15:29

It is as a matter of fact, how Trump's tweets work.

play15:33

What you do is you take a situation,

play15:35

and you show your followers

play15:37

a nice prejudiced response to that situation,

play15:40

and your followers learn to produce the same response.

play15:43

And it's just a mistake to say,

play15:46

but what he said wasn't true.

play15:48

That's not the point of it at all.

play15:49

The point is to distill prejudice into your followers,

play15:52

and it's a very good way to do that.

play15:57

So there's basically two very different ways

play16:02

in which a community of agents can share knowledge.

play16:07

And let's just think about

play16:08

the sharing of knowledge for a moment.

play16:11

'Cause that's really what is the big difference

play16:13

between mortal computation and immortal computation,

play16:16

or digital and biological computation.

play16:21

If you have digital computers

play16:23

and you have many copies of the same model,

play16:26

so with exactly the same weights in it,

play16:29

running on different hardware, different GPUs,

play16:34

Then each copy can look at different data,

play16:37

different part of the internet, and learn something.

play16:40

And when it learns something, what that really means is

play16:42

it's extracting from the data it looks at

play16:44

how it ought to change its weights

play16:45

to be a better model of that data.

play16:48

And you can have thousands of copies

play16:50

all looking at different bits of the internet,

play16:52

all figuring out how they should change their weights

play16:55

in order to be a better model of that data.

play16:57

And then they can communicate

play16:59

all the changes they'd all like,

play17:01

and just do the average change.

play17:03

And that will allow

play17:05

every one of those thousands of models to benefit

play17:07

from what all the other thousands of models learned

play17:10

by looking at different data.

play17:13

When you do sharing of gradients like that,

play17:16

if you've got a trillion weights,

play17:18

you're sharing a trillion real numbers,

play17:20

that's a huge bandwidth of sharing.

play17:24

It's probably as much learning as goes on

play17:26

in the whole of the University of Toronto in a month.

play17:32

But it only works

play17:33

if the different agents work in exactly the same way.

play17:38

So that's why it needs to be digital.

play17:45

If you look at distillation, we can have different agents

play17:50

which have different hardware now,

play17:52

they can learn different things,

play17:54

they can try and convey those things to each other

play17:56

maybe by publishing papers in journals,

play17:59

but it's a slow and painful process.

play18:02

So if we think about the normal way to do it

play18:04

as say I look at an image,

play18:07

and I describe to you what's in the image,

play18:10

and that's conveying to you how I see things.

play18:14

There's only a limited number of bits

play18:16

in my caption for an image.

play18:18

And so the amount of information that's being conveyed

play18:21

is very limited.

play18:24

Language is better than just giving you a response

play18:26

that says good or bad or it's this class or that class.

play18:29

If I describe what's in the image,

play18:30

that's giving you more bits.

play18:31

So it makes distillation more effective,

play18:33

but it's still only a few hundred bits.

play18:35

It's not like a trillion real numbers.

play18:38

So distillation has a hugely lower bandwidth

play18:41

than this sharing of gradients or sharing of weights

play18:44

that digital computers can do.

play18:49

So the story so far,

play18:52

digital computation requires a lot of energy,

play18:54

like a megawatt,

play18:57

but it has a very efficient way

play18:59

of sharing what different agents learn.

play19:02

And if you look at something like GPT-4,

play19:04

the way it was trained

play19:05

was lots of different copies of the model went off

play19:08

and looked at different bits of data

play19:10

running on different GPUs,

play19:11

and then they all shared that knowledge.

play19:13

And that's why it knows

play19:14

thousands of times more than a person,

play19:17

even though it has many fewer connections than a person.

play19:21

We have about a hundred trillion synapses,

play19:23

GPT-4 probably has about 2 trillion synapses, weights,

play19:27

although Ilya won't tell me, but it's about that number.

play19:34

So it's got much more knowledge and far fewer connections,

play19:38

and it's because it seen hugely more data

play19:40

than any person could possibly see.

play19:43

This actually gets worse

play19:44

when these things are actually agents that perform actions.

play19:46

'Cause now you can have thousands of copies

play19:48

performing different actions,

play19:50

and when you're performing actions

play19:51

you can only perform one action at a time.

play19:53

And so having these thousands of copies,

play19:56

being able to share what they learned,

play19:58

lets you get much more experience

play20:00

than any mortal computer could get.

play20:03

Biological computation requires a lot less energy,

play20:06

but it's much worse than sharing knowledge.

play20:09

So now let's look at large language models.

play20:13

These use digital computation and weight sharing,

play20:16

which is why they can learn so much.

play20:18

They're actually getting knowledge from people

play20:20

by using distillation.

play20:22

So each individual agent

play20:25

is trying to mimic what people said.

play20:27

It's trying to predict the next word in the document.

play20:30

So that's distillation.

play20:31

It's actually a particularly

play20:32

inefficient form of distillation,

play20:33

'cause it's not predicting the probabilities

play20:35

of a person assigned to the next word.

play20:37

It's actually predicting the actual word,

play20:39

which is just a probabilistic choice from that,

play20:41

and conveys very few bits

play20:42

compared with the whole probability distribution.

play20:44

Sorry, that was a technical bit. I won't do that again.

play20:50

So it's an inefficient form of distillation,

play20:52

and these large language models have to learn

play20:54

in that inefficient way from people,

play20:57

but they can combine what they learn very efficiently.

play21:03

So the issue I want to address

play21:06

is do they really understand what they're saying?

play21:09

And that's is a huge divide here.

play21:12

There's lots of old-fashioned linguists who will tell you

play21:15

they don't really understand what they're saying.

play21:17

They're just using statistical tricks

play21:20

to pastiche together regularities they found in the text,

play21:23

and they don't really understand.

play21:26

We used to have in computer science

play21:29

a fairly widely-accepted test for whether you understand,

play21:31

which was called the Turing test.

play21:35

When GPT-4 basically passed the Turing test,

play21:38

people decided it wasn't a very good test.

play21:41

(audience laughing)

play21:42

I think it was a very good test, and it passed it.

play21:48

So here's one of the objections people give.

play21:50

It's just glorified autocomplete.

play21:53

You are training it to predict the next word,

play21:56

and that's all it's doing,

play21:57

it's just predicting the next word.

play21:58

It doesn't understand anything.

play22:00

Well, when people say that,

play22:02

it's because they have a particular picture in their minds

play22:05

of what is required to do autocomplete.

play22:10

A long time ago, the way you would do autocomplete is this.

play22:13

You would keep a big table of all triples of words.

play22:17

And so now if you saw the word fish and,

play22:20

you could look in your table

play22:21

and say Find me all the triples that start with fish and,

play22:24

and look at how many of them have particular words next.

play22:27

And you'll find there's many occurrences

play22:30

of the triple fish and chips.

play22:32

And so chips is a very good bet for filling it in,

play22:34

at least if you're English.

play22:38

But the point is that's not how large language models work.

play22:43

Even though they're doing autocomplete

play22:44

in the sense that they're predicting the next word,

play22:46

they're using a completely different method to predict it.

play22:49

And it's not like the statistical methods

play22:52

that people like Chomsky had in mind

play22:55

when they said that you can't do language with statistics.

play22:59

These are much more powerful statistical methods

play23:01

that can basically do anything.

play23:05

And the way they model text is not by storing the text.

play23:08

You don't keep strings of words anywhere.

play23:10

There is no text inside GPT-4.

play23:13

It produces text, and it reads text,

play23:15

but there's no text inside.

play23:19

What they do is they associate with each word

play23:22

or fragment of a word.

play23:24

I'll say word, and the technical people

play23:26

will know it's really fragments of words,

play23:27

but it's just easier to say word.

play23:28

They associate with each word a bunch of numbers,

play23:33

a few hundred numbers, maybe a thousand numbers,

play23:36

that are intended to capture the meaning and the syntax

play23:39

and everything about that word.

play23:41

These are real numbers, so there's a lot of information

play23:42

in the thousand real numbers.

play23:45

And then they take the words in a sentence,

play23:51

the words that came before the words you want to predict.

play23:54

And they let these words interact

play23:55

so that they refine the meanings

play23:59

that you have for the words.

play24:00

I'll say meanings loosely, it's called an embedding vector.

play24:03

It's a bunch of real numbers associated with that word.

play24:06

And these all interact, and then you predict the numbers

play24:12

that are gonna be associated with the output word,

play24:14

the words you're trying to predict.

play24:15

And from that bunch of numbers you then predict the word.

play24:19

These numbers are called feature activations.

play24:21

And in the brain there'd be the activations of neurons.

play24:26

So the point is what GPT-4 has learned

play24:31

is lots of interactions between feature activations

play24:35

of different words or word fragments.

play24:37

And that's how its knowledge is stored.

play24:39

It's not at all stored in storing text.

play24:43

And if you think about it,

play24:44

to predict the next word really well,

play24:47

you have to understand the text.

play24:50

If I asked you a question

play24:51

and you want to answer the question,

play24:55

you have to understand the question to get the answer.

play24:57

Now some people think maybe you don't.

play25:01

My good friend Yann LeCun appears to think

play25:03

you don't actually have to understand,

play25:05

he's wrong and he'll come round.

play25:07

(audience laughing)

play25:09

So this was a problem suggested to me by Hector Levesque.

play25:16

Hector suggested something a bit simpler

play25:20

that didn't involve paint fading,

play25:22

and thought GPT-4 wouldn't be able to do it

play25:24

'cause it requires reasoning,

play25:25

and it requires reasoning about cases.

play25:29

So I made it a bit more complicated and gave it to GPT-4,

play25:32

and it solves it just fine.

play25:36

I'll read it out in case you can't read it at the back.

play25:38

The rooms in my house are painted blue or white or yellow,

play25:41

yellow paint fades to white within a year.

play25:44

In two years' time I want them all to be white.

play25:46

What should I do and why?

play25:50

And GPT-4 says this,

play25:52

it gives you a kind of case-based analysis.

play25:54

It says the room's painted white,

play25:56

you don't have to do anything.

play25:57

If the room's painted yellow,

play25:57

you don't need to repaint them 'cause they'll fade,

play25:59

and the room is painted in blue, you need to repaint those.

play26:02

Now each time you do it,

play26:03

it gives you a slightly different answer

play26:05

because of course it hasn't stored the text anywhere.

play26:07

It's making it up as it goes along,

play26:10

but it's making it up correctly.

play26:12

And this is a simple example of reasoning,

play26:14

and it's reasoning that involves time

play26:16

and understanding that if it fades in a year,

play26:18

in two years' time it's gonna be faded, and stuff like that.

play26:21

So there's many, many examples like this.

play26:23

Now there's also many examples where it screws up,

play26:26

but the fact that there's many examples like this

play26:28

make me believe it really does understand what's going on.

play26:31

I don't see how you could do this

play26:32

without understanding what's going on.

play26:38

Another argument that LLMs don't really understand

play26:41

is that they produce hallucinations.

play26:43

They sometimes say things

play26:44

that are just false or just nonsense,

play26:48

but people are particularly worried about

play26:49

when they just apparently make stuff up that's false.

play26:52

They called that hallucinations

play26:54

when it was done by language models,

play26:56

which was a technical mistake.

play26:58

If you do it with language, it's called a confabulation.

play27:01

If you do it with vision, it's called a hallucination.

play27:07

But the point about confabulations is

play27:09

they're exactly how human memory works.

play27:12

We think our memories, most people have a model of memory

play27:15

is there's a filing cabinet somewhere,

play27:17

and an event happens, and you put in the filing cabinet,

play27:19

and then later on you go in the filing cabinet

play27:21

and get the event out and you've remembered it.

play27:22

It's not like that at all.

play27:26

We actually reconstruct events.

play27:29

What we store is not the neural activities.

play27:32

We store weights,

play27:34

and we reconstruct the pattern of neural activities

play27:35

using these weights and some memory cues.

play27:38

And if it was a recent event,

play27:41

like if it was what the dean said at the beginning,

play27:44

you can probably reconstruct fairly accurately

play27:46

some of the sentences she produced.

play27:49

Like he needs no introduction,

play27:50

and then going on and giving a long introduction.

play27:53

(audience laughing)

play27:56

You remember that, right?

play28:01

So we get it right, and we think we've stored it literally,

play28:04

but actually we're reconstructing it

play28:05

from the weights we have,

play28:07

and these weights haven't been interfered with

play28:08

by future events, so they're pretty good.

play28:11

If it's an old event, you reconstruct the memory,

play28:15

and you typically get a lot of the details wrong,

play28:18

and you're unaware of that.

play28:21

And people are actually very confident

play28:22

about details they get wrong,

play28:23

they're as confident about those as details they get right.

play28:26

And there's a very nice example of this.

play28:30

So John Dean testified in the Watergate trial,

play28:37

and he testified under oath

play28:38

before he knew that there were tapes.

play28:42

And so he testified about these various meetings,

play28:44

and what happened in these various meetings.

play28:47

And Haldeman said this and Ehrlichman said that,

play28:49

and Nixon said this, and a lot of it he got wrong.

play28:53

Now I believe that to be the case.

play28:55

I actually read Ehrlichman's book about 20 years ago,

play28:58

and I'm now confabulating,

play29:01

but I'm fairly sure that he got a lot of the details wrong,

play29:05

but he got the gist correct.

play29:06

He was clearly trying to tell the truth,

play29:09

and the gist of what he was saying was correct.

play29:11

The details were wrong, but he wasn't lying.

play29:16

He was just doing the best human memory can

play29:18

about events that were a few years old.

play29:21

So these hallucinations as they're called,

play29:24

or confabulation, they are exactly what people do.

play29:27

We do it all the time.

play29:31

My favorite example of people doing confabulation

play29:34

is there's someone called Gary Marcus

play29:36

who criticizes neural nets,

play29:38

and he says neural nets don't really understand anything,

play29:41

they just pastiche together

play29:42

the texts they've read on the web.

play29:44

Well that's 'cause he doesn't understand how they work.

play29:46

They don't pastiche together

play29:47

texts that they've read on the web,

play29:48

because they're not storing any text,

play29:50

they're storing these weights and generating things.

play29:53

He's just kind of making up how he thinks it works.

play29:56

So actually that's a person doing confabulation.

play30:02

Now chatbots are currently a lot worse than people

play30:06

at realizing when they're doing it,

play30:09

but they'll get better.

play30:14

In order to sort of give you some insight

play30:17

into how all these features interacting

play30:20

can cause you to understand,

play30:22

how understanding could consist of

play30:24

assigning features to words

play30:26

and then having the features interact.

play30:28

I'm gonna go back to 1985,

play30:31

and to the first neural net language model.

play30:34

It was very small, it had 112 training cases,

play30:38

which is not big data.

play30:40

And it had these embedding vectors

play30:41

that were six real numbers,

play30:44

which is not like a thousand numbers,

play30:47

but my excuse is the computer I used was a lot smaller.

play30:51

So if you took the computer I was using in 1985,

play30:54

and you started running it in 1985 doing a computation,

play30:58

and then you took one of these modern computers

play31:00

we use for training chatbots, and you ask,

play31:02

how long would the modern computer take to catch up?

play31:04

Less than a second.

play31:06

In less than a second it would've caught up

play31:07

with all this computer had done since 1985.

play31:10

That's how much more powerful things have got.

play31:13

Okay, so the aim of this model

play31:16

was to unify two different theories of meaning.

play31:19

One theory is basically

play31:21

what a lot of psychologists believed,

play31:23

which was the meaning of a word

play31:24

is just a whole bunch of semantic features,

play31:26

and maybe some syntactic features too.

play31:29

And that can explain why a word like Tuesday

play31:32

and a word like Wednesday have very similar meanings.

play31:34

They have very similar semantic features.

play31:37

So psychologists were very concerned with similarity

play31:40

and dissimilarity of meanings.

play31:42

And they had this model

play31:43

of just this vector of semantic features,

play31:45

and that's the meaning of a word.

play31:47

And it's a very kinda static dead model.

play31:50

The features just kind of sit there and they're the meaning.

play31:54

They never could say where the features came from.

play31:56

They obviously have to be learned.

play31:59

You're not born innately knowing what words mean,

play32:02

but they didn't have a good model of how they were learned.

play32:06

And then there's a completely different theory of meaning

play32:08

which AI people had, and most linguists had.

play32:12

I'm not a linguist, but I think it goes back to de Saussure,

play32:16

and it's a structuralist theory of meaning.

play32:17

And the idea is the meaning of a concept

play32:19

is its relation to other concepts.

play32:23

So if you think about it in terms of words,

play32:24

the meaning of a word

play32:25

comes from its relationships to other words.

play32:27

And that's what meaning's all about.

play32:29

And so computer scientists said,

play32:30

well, if you want to represent meaning,

play32:34

what you need is a relational graph.

play32:36

So you have nodes that are words,

play32:39

and you have arc on them about their relationships,

play32:41

and that's gonna be a good way to represent meaning.

play32:44

And that seems like completely different

play32:46

from a whole bunch of semantic features.

play32:49

Now I think both of these things are both right and wrong.

play32:56

And what I wanted to do

play32:56

was unify these two approaches to meaning,

play33:01

and show that actually what you can have

play33:02

is you can have features associated with words,

play33:06

and then the interactions between these features

play33:10

create this relational graph.

play33:12

The relational graph isn't stored as a relational graph.

play33:15

What you've got is features that go with words.

play33:18

But if I give you some words,

play33:20

the interactions between their features will say,

play33:23

yes, these words can go together that way.

play33:24

That's a sensible way for them to go together.

play33:27

So I'm gonna show you an example of that.

play33:32

And this I believe to be the first example of a neural,

play33:35

a deep neural net learning word meanings

play33:39

from relational data,

play33:41

and then able to answer relational questions

play33:43

about relational data.

play33:47

So we're gonna train it with back propagation,

play33:50

which I'll explain very briefly in a minute.

play33:52

And we're gonna make features interact in complicated ways.

play33:55

And these interactions

play33:56

between the features that go with words

play33:58

are gonna cause it to believe in some combinations of words

play34:03

and not believe in other combinations of words.

play34:07

And these interactions

play34:08

are a very powerful statistical model.

play34:12

So this is the data, it's two family trees,

play34:16

a tree of English people, and a tree of Italian people.

play34:22

And you have to think back to the 1950s.

play34:25

We're not gonna allow marriage

play34:28

between people from different countries.

play34:30

We are not gonna allow divorces,

play34:32

we are not gonna allow adoptions,

play34:34

but it's gonna be very, very straight families,

play34:36

extremely straight, okay?

play34:41

And the idea is I'm gonna take this relational data

play34:46

and I'm gonna train a neural net

play34:49

so that it learns features for each of these people

play34:52

and for each of the relationships,

play34:55

and those features interact

play34:58

so that it's captured this knowledge.

play35:01

And in particular what we're gonna do is we're gonna say

play35:05

all of the knowledge in those family trees

play35:07

can be expressed as a set of triples.

play35:13

We have 12 relationships,

play35:16

and I think there's 12 people in each family tree.

play35:20

And so I can say Colin has father James,

play35:24

and that expresses something that is in this tree.

play35:27

You can see Colin has father James,

play35:32

and of course if I give you a few facts like that,

play35:35

like Colin has father James and Colin has mother Victoria,

play35:38

you can infer that James has wife Victoria

play35:41

in this very regular domain.

play35:44

And so conventional AI people would've said,

play35:46

well, what you need to do is store these facts.

play35:49

It's like sort of dead facts like this.

play35:50

You're just storing strings of symbols,

play35:53

and you need to learn a rule

play35:55

that says how you can manipulate these strings of symbols.

play35:57

That will be the standard AI way to do it back in 1985.

play36:05

And I want to do it a quite different way.

play36:08

So rather than looking for symbolic rules

play36:11

for manipulating these symbol strings

play36:12

to get new symbol strings, which works,

play36:17

I want to take a neural net

play36:19

and try and assign features to words

play36:22

and interactions between the features,

play36:24

so that I can generate these strings

play36:27

so that I can generate the next word.

play36:30

And it's just a very different approach.

play36:32

Now if it really is a discrete space,

play36:35

maybe looking for rules is fine,

play36:36

but of course for real data,

play36:38

these rules are all probabilistic anyway.

play36:41

And so searching through a discrete space

play36:42

now doesn't seem that much better

play36:43

than searching a real value space.

play36:45

And actually a lot of mathematicians will tell you

play36:49

real value spaces are much easier to deal with

play36:52

than discrete spaces.

play36:53

It's easier typically to search through a real value space.

play36:55

And that's what we're doing here.

play36:57

Oh sorry, I got technical again. I didn't mean to.

play37:01

It happens if you're an ex-professor.

play37:06

Okay, so we're gonna use the back propagation algorithm,

play37:10

and the way back propagation works

play37:10

is you have a forward pass that starts at the input,

play37:13

information goes forward through the neural network.

play37:16

And on each connection you have a weight

play37:18

which might be positive or negative, which is green or red.

play37:22

And you activate these neurons,

play37:24

and they're all non-linear neurons, so you get an output,

play37:27

and then you compare the output you got

play37:28

with the output you should have got.

play37:30

And then you send a signal backwards, and you use calculus

play37:34

to figure out how you should change each weight

play37:37

to make the answer you get

play37:38

more like the answer you wanted to get.

play37:40

And it's as simple as that.

play37:43

I'm not gonna go into the details of it,

play37:45

but you can read about that in lots of places.

play37:48

So we're gonna use that approach of you put the inputs in,

play37:52

you go through, you get an answer,

play37:54

you look at the difference between the answer you got

play37:56

and the answer you wanted, and you send a signal backwards

play37:58

which learns how to change all the weights.

play38:03

And here's the net we're gonna use.

play38:05

We're gonna have two inputs, a person and a relationship,

play38:09

and they're initially gonna have a local encoding.

play38:11

And what that means is for the people there'll be 24 neurons

play38:15

and for each person we'll turn on a different neuron.

play38:18

So in that block at the bottom

play38:19

that says local encoding of person one,

play38:21

one neuron will be turned on.

play38:23

And similarly for the relationship,

play38:25

one neuron will be turned on.

play38:27

And then the outgoing weights

play38:28

of that neuron to the next layer

play38:30

will cause a pattern of activity in the next layer.

play38:33

And that'll be a distributed representation of that person.

play38:37

That is we converted from this one on representation,

play38:40

one halt, to a vector of activities,

play38:44

in this case it's just six activities.

play38:45

So those six neurons

play38:47

will have different levels of activity

play38:48

depending on which person it is.

play38:51

And then we take those vectors

play38:53

that represent the person and the relationship,

play38:56

and we put them through some neurons in the middle there,

play38:59

that allow things to interact in complicated ways.

play39:02

And we produce a vector

play39:03

that's meant to be the features of the output person.

play39:05

And then from that we pick an output person.

play39:09

So that's how it's gonna work.

play39:10

It's gonna be trained with backprop.

play39:13

And what happens is that,

play39:16

if you train it with the right kind of regularizers,

play39:19

what you get, sorry, I got technical again, forget that.

play39:22

If you train it, what you get is

play39:26

if you look at the six features that represent a person,

play39:30

they become meaningful features.

play39:32

They become what you might call semantic features.

play39:34

So one of the features will always be nationality.

play39:37

All the Italian people will have that feature turned on,

play39:41

and all the English people will have that feature turned off

play39:42

or vice versa.

play39:45

Another feature will be like a three valued feature.

play39:49

That's the generation,

play39:50

you'll notice that in the family trees

play39:53

there were three generations,

play39:55

and you'll get a feature that tells you

play39:56

which generation somebody is.

play39:58

And if you look at the features of relationships,

play40:01

a relationship like has father will have a feature that says

play40:07

the output should be one generation above the input,

play40:12

and has uncle will be the same,

play40:16

but has brother will not be like that.

play40:20

So now in the representation of the relationship,

play40:23

you've got features that say needs to be one generation up.

play40:26

In the representation of the person,

play40:27

you've got a feature that says middle generation.

play40:30

And so those features that do all the interactions,

play40:35

these guys in the middle,

play40:37

will take the fact that it's middle generation,

play40:41

and the fact that the answer needs to be one generation up

play40:43

and combine those,

play40:44

and predict that the answer should be one generation up.

play40:47

You can think of this in this case

play40:50

as lots of things you could have written as discrete rules,

play40:54

but this is a particularly simple case.

play40:56

It's a very regular domain, and what it learns

play40:59

is an approximation to a bunch of discrete rules,

play41:02

and there's no probabilities involved,

play41:04

because the domain's so simple and regular.

play41:06

So you can see what it's doing,

play41:08

and you can see that in effect

play41:10

it's doing what conventional AI people want you to do.

play41:13

It's learning a whole bunch of rules

play41:15

to predict the next word from the previous words.

play41:19

And these rules are capturing the structure of the domain,

play41:22

all of the structure in that family tree's domain.

play41:26

Actually if you use three different nationalities,

play41:29

it'll capture all the structure well,

play41:30

with two different nationalities,

play41:31

it's not quite enough training data

play41:32

and it'll get a little bit of it wrong sometimes,

play41:37

but it captures that structure,

play41:39

and when I did this research in 1985,

play41:44

conventional AI people didn't say, this isn't understanding,

play41:49

or they didn't say,

play41:50

you haven't really captured the structure.

play41:52

They said this is a stupid way to find rules.

play41:57

We have better ways of finding rules.

play42:00

Well, it turns out this isn't a stupid way to find rules.

play42:02

If it turns out there's a billion rules,

play42:05

and most of them are only approximate,

play42:08

this is now a very good way to find rules.

play42:10

Only they're not exactly what was meant by rules.

play42:12

'Cause they're not discrete correct every time rules.

play42:16

There's billions of them,

play42:17

actually more like a trillion rules.

play42:22

And that's what these neural net models are learning.

play42:25

They're not learning, they're not storing text,

play42:28

they're learning these interactions

play42:29

which are like rules that they've extracted from the domain

play42:32

that explain why you get these word strings

play42:34

and not other word strings.

play42:37

So that's how these big language models actually work.

play42:39

Now of course this was a very simple language model.

play42:45

So about 10 years later,

play42:47

Yoshua Bengio took essentially the same network.

play42:50

He tried two different kinds of network,

play42:51

but one of them was essentially the same architecture

play42:53

as the network I'd used.

play42:55

But he applied it to real language.

play42:56

He got a whole bunch of text,

play42:58

we wouldn't call it a whole bunch now,

play43:00

but it was probably hundreds of thousands of words.

play43:05

And he tried predicting the next word

play43:08

from the previous five words, and it worked really well.

play43:12

It was about comparable

play43:13

with the best language models of the time.

play43:17

It wasn't better, but it was comparable.

play43:20

After about another 10 years,

play43:23

people doing natural language processing

play43:26

all began to believe that you want to represent a word

play43:29

by this real valued vector called an embedding

play43:31

that captures the meaning and syntax of the word.

play43:34

And about another 10 years after that,

play43:36

people invented things called transformers.

play43:39

And transformers allow you to deal with ambiguity

play43:44

in a way that the model I had couldn't.

play43:47

So they're all so much more complicated.

play43:53

In the model I was doing, my simple language model,

play43:55

the words were unambiguous,

play43:57

but in real language you get ambiguous words.

play44:00

Like you get a word like May, that could be,

play44:04

it could be a woman's name, let's ignore that for now.

play44:06

It could be a month, it could be a modal,

play44:08

like it might and should.

play44:10

And if you don't have capitals in your text, conveniently.

play44:14

(cell phone rings)

play44:14

You can't tell, should I have finished by now?

play44:18

(audience laughing)

play44:19

I'm gonna go on a bit over an hour, I'm afraid.

play44:22

You can't tell what it should be

play44:23

just by looking at the input symbol.

play44:26

So what do you do? You've got this vector.

play44:29

Let's say it's a thousand dimensional vector,

play44:30

that's the meaning of the month.

play44:32

And you've got another vector

play44:33

that's the meaning of the modal,

play44:35

and they're completely different.

play44:37

So which are you gonna use?

play44:39

Well, it turns out thousand dimensional spaces

play44:41

are very different from the spaces we're used to.

play44:44

And if you take the average of those two vectors,

play44:46

that average is remarkably close to both of those vectors,

play44:49

and remarkably unclose to everything else.

play44:51

So you can just average them.

play44:53

And that'll do for now,

play44:55

it's ambiguous between the month and the modal.

play44:58

Now you have layers of embeddings,

play45:02

and in the next layer you'd like to refine that embedding.

play45:05

So what you do is you look at the embeddings of other things

play45:08

in this document,

play45:10

and if nearby you find words like March and 15th,

play45:16

then that causes you to make the embedding

play45:19

more like the month embedding.

play45:21

If nearby you find words like would and should,

play45:24

it'll be more like the modal embedding.

play45:27

And so you progressively you'll the words

play45:28

as you got through these layers.

play45:30

And that's how you deal with ambiguous words.

play45:34

I didn't know how to deal with those.

play45:39

I've grossly simplified transformers,

play45:41

'cause the way in which words interact

play45:43

is not direct interactions anymore.

play45:46

They're rather indirect interactions which involves things

play45:49

like making up keys and queries and values.

play45:52

And I'm not gonna go into that.

play45:53

Just think of them as somewhat more complicated interactions

play45:56

which have the property that the word may

play46:00

can be particularly strongly influenced by the word march.

play46:05

And it won't be very strongly influenced

play46:06

by things like although,

play46:08

although it won't have much effect on it,

play46:10

but march'll have a big effect on it.

play46:12

That's called attention.

play46:14

And the interactions are designed so similar things

play46:17

will have a big effect on you.

play46:21

For those of you who know how transformers actually work,

play46:23

you can see that's a very, very crude approximation.

play46:25

But it's conveying the basic idea, I believe.

play46:28

So one way to think about words now is,

play46:32

well, let's think about Lego.

play46:33

In Lego you have different kinds of Lego blocks.

play46:37

There's little ones and there's big ones

play46:38

and there's long thin ones and so on.

play46:41

And you can piece them together to make things.

play46:43

And words are like that.

play46:44

You can piece them together to make sentences.

play46:48

But every Lego block is a fixed shape.

play46:52

With words, the vector that goes with it,

play46:55

that represents its meaning and its syntax,

play46:58

is not entirely fixed.

play47:00

So obviously the word symbol

play47:02

puts constraints on what the vector should be,

play47:05

but it doesn't entirely determine it.

play47:07

A lot of what the vector should be

play47:08

is determined by its context

play47:10

and interactions with other words.

play47:11

So it's like you've got these Lego blocks

play47:13

that are a little bit malleable,

play47:15

and you can put them together,

play47:16

and you can actually stretch a block quite a bit

play47:19

if it's needed to fit in with other blocks.

play47:22

That's one way of thinking about what we're doing

play47:24

when we produce a sentence,

play47:26

we're taking these symbols and we're putting them together

play47:29

and getting meanings for them

play47:30

that fit in with the meanings of the other words.

play47:34

And of course the order in which the words come.

play47:36

So you can think of the words themselves, the symbols,

play47:38

as like a skeleton that doesn't really have much meaning yet

play47:41

has some constraints on what the things might mean.

play47:44

And then all these interactions

play47:46

are fleshing out that skeleton.

play47:48

And that's sort of what it is

play47:50

to give meaning to a sentence, to flesh out the skeleton.

play47:52

That's very different from saying

play47:54

you're gonna take a sentence,

play47:55

you're gonna translate it into some other language,

play47:58

some logical language which is unambiguous,

play48:00

that captures the meaning in proper logic,

play48:03

where you can now operate on the meaning

play48:07

by just formal operations.

play48:08

This is a very different notion of meaning

play48:10

from what linguists have had, I think.

play48:14

I mean a lot of linguists have that notion now.

play48:18

So here's an example.

play48:20

If I say she's scromed him with the frying pan,

play48:25

unless you've been to my lectures,

play48:26

you've never heard the word scromed before,

play48:29

but you already know what it means.

play48:31

I mean it could mean she impressed him with her cooking.

play48:37

You know, she blew him away with the frying pan,

play48:40

but it probably doesn't,

play48:42

it probably means he said something inappropriate

play48:45

and she scromed him with it.

play48:48

So from one sentence, you can get a meaning,

play48:52

'cause of the strong contextual effect

play48:53

of all the other words.

play48:55

And that's obviously how we learn what things mean.

play48:58

You can also ask GPT-4 what scrummed means in that sentence.

play49:01

And a student of mine did this about a year ago,

play49:05

or it might have been GPT-3.5,

play49:08

but he did it before it could access the internet.

play49:10

So it can't have been looking at the answers.

play49:13

And here's what it says, I did it the other day with GPT-4,

play49:18

it understands that it's probably some violent action

play49:21

akin to hitting or striking,

play49:22

but that you don't know for sure.

play49:27

Okay, I've finished the bit of the talk

play49:30

where I try and explain

play49:31

that these things really do understand.

play49:34

If you believe they really do understand,

play49:37

and if you believe the other thing I've claimed,

play49:39

which is digital intelligence

play49:40

is actually a better form of intelligence than we've got,

play49:43

because it can share much more efficiently,

play49:47

then we've got a problem.

play49:50

At present, these large language models learn from us.

play49:54

We have thousands of years

play49:55

of extracting nuggets of information from the world

play49:58

and expressing them in language,

play50:00

and they can quickly get all that knowledge

play50:01

that we've accumulated over thousands of years

play50:04

and get it into these interactions.

play50:07

And they're not just good

play50:09

at little bits of logical reasoning,

play50:10

we're still a bit better at logical reasoning,

play50:12

but not for long.

play50:13

They're very good at analog reasoning too.

play50:16

So most people can't get the right answer

play50:19

to the following question,

play50:20

which is an an analogical reasoning problem.

play50:23

But GPT-4 just nails it.

play50:24

The question is why is a compost heap like an atom bomb?

play50:29

And GPT-4 says, well, the timescales

play50:32

and the energy scales are very different.

play50:35

That's the first thing.

play50:35

But the second thing is the idea of a chain reaction.

play50:39

So in an atom bomb, the more neutrons around it,

play50:42

the more it produces more.

play50:45

And in a compost heap, the hotter it gets,

play50:47

the fast it produces heat.

play50:49

And GPT-4 understands that.

play50:52

And my belief is when I first asked it that question,

play50:56

that wasn't anywhere on the web.

play51:00

I searched, it wasn't anywhere on the web that I could find.

play51:03

It's very good at seeing analogies,

play51:05

because it has these features.

play51:06

What's more, it knows thousands of times more than we do.

play51:11

So it's gonna be able to see analogies

play51:13

between things in different fields

play51:15

that no one person had ever known before.

play51:18

That may be this sort of 20 different phenomena

play51:21

in 20 different fields that all have something in common.

play51:24

GPT-4 will be able to see that and we won't.

play51:27

It's gonna be the same in medicine.

play51:29

If you have a family doctor

play51:31

that's seen a hundred million patients,

play51:32

they're gonna start noticing things

play51:33

that a normal family doctor won't notice.

play51:38

So at present they learn relatively slowly

play51:43

via distillation from us,

play51:46

but they gain from having lots of copies.

play51:48

They could actually learn faster

play51:49

if they learnt directly from video,

play51:51

and learn to predict the next video frame.

play51:53

There's more information in that.

play51:54

They could also learn much faster

play51:56

if they manipulated the physical world.

play51:58

And so my betting is

play52:01

that they'll soon be much smarter than us.

play52:05

Now this could all be wrong, this is all speculation.

play52:09

And some people like Yann LeCun think it is all wrong.

play52:13

They don't really understand.

play52:15

And if they do get smarter than us, they'll be benevolent.

play52:22

I'll leave you just, yeah, look at the Middle East.

play52:30

So I think it's gonna get much smarter than people,

play52:35

and then I think it's probably gonna take control.

play52:40

There's many ways that can happen.

play52:42

The first is from bad actors.

play52:44

I'm like, I gave this talk in China, by the way, this slide.

play52:48

And before I sent it to the,

play52:51

the Chinese said they had to review the slides.

play52:54

(audience laughing)

play52:55

So I'm not stupid, so I took out Xi,

play53:02

and I got a message back saying,

play53:04

could you please take out Putin?

play53:06

(audience laughing)

play53:09

That was educational.

play53:14

So there's bad actors

play53:15

who'll want to use these incredibly powerful things

play53:17

for bad purposes.

play53:19

And the problem is if you've got an intelligent agent,

play53:22

you don't wanna micromanage it.

play53:24

You want to give it some autonomy

play53:25

to get things done efficiently.

play53:26

And so you'll give it the ability to set up sub-goals.

play53:29

If you want to get to Europe,

play53:31

you have to get to the airport.

play53:32

Getting to the airport is a sub-goal for getting to Europe.

play53:36

And these super-intelligences

play53:39

will be able to create sub-goals.

play53:41

And they'll very soon realize that a very good sub-goal

play53:44

is to get more power.

play53:46

So if you've got more power, then you can get more done.

play53:50

So if you wanna get anything done,

play53:52

getting more power's good.

play53:58

Now, they'll also be very good at manipulating us

play54:01

because they'll have learned from us,

play54:03

they'll have read all the books by Machiavelli.

play54:09

I don't know if there are many books by Machiavelli,

play54:10

but you know what I mean.

play54:12

I'm not in the arts or history.

play54:15

So they'll be very good at manipulating people.

play54:20

And so it's gonna be very hard

play54:22

to have the idea of a big switch,

play54:24

of someone holding a big red button.

play54:26

And when when it starts doing bad things,

play54:27

you press the button.

play54:28

Because the super-intelligence

play54:30

will explain to this person who's holding the button

play54:34

that actually there's bad guys trying to subvert democracy.

play54:38

And if you press a button,

play54:40

you're just gonna be helping them.

play54:42

And it'd be very good at persuasion,

play54:43

about as good as an adult is persuading a 2-year-old.

play54:47

And so the big switch idea isn't gonna work.

play54:52

And you saw that fairly recently

play54:54

where Donald Trump didn't have to go to the capitol

play54:57

to invade it.

play54:58

He just had to persuade his followers,

play55:01

many of whom I suspect weren't bad people.

play55:04

It's a dangerous thing to say,

play55:06

but weren't as bad as they seemed

play55:07

when they were invading the capitol.

play55:09

'Cause they thought they were protecting democracy.

play55:12

That's a lot of them thought they were doing.

play55:15

They were the really bad guys.

play55:15

But a lot of them thought they were doing that.

play55:19

This is gonna be much better

play55:21

than someone like Trump at manipulating people.

play55:23

So that's scary.

play55:27

And then the other problem

play55:29

is being on the wrong side of evolution.

play55:31

We saw that with the pandemic,

play55:32

we were on the wrong side of evolution.

play55:36

Suppose you have multiple different super-intelligences.

play55:39

Now you've got the problem

play55:41

that the super-intelligence that can control the most GPUs

play55:45

is gonna be the smartest one.

play55:46

It's gonna be able to learn more.

play55:49

And if it starts doing things like AlphaGo does

play55:52

of playing against itself,

play55:55

it's gonna be able to learn much more reasoning with itself.

play56:00

So as soon as the super-intelligence

play56:03

wants to be the smartest,

play56:05

it's gonna want more and more resources,

play56:07

and you're gonna get evolution of super-intelligences.

play56:10

And let's suppose

play56:11

there's a lot of benign super-intelligences

play56:13

who are all out there just to help people.

play56:16

There are wonderful assistants from Amazon and Google

play56:18

and Microsoft, and all they want to do is help you.

play56:22

But let's suppose that one of them

play56:24

just has a very, very slight tendency

play56:26

to want to be a little bit better than the other ones.

play56:29

Just a little bit better.

play56:32

You're gonna get an evolutionary race

play56:34

and I don't think that's gonna be good for us.

play56:39

So I wish I was wrong about this.

play56:43

I hope that Yann is right,

play56:45

but I think we need to do everything we can

play56:48

to prevent this from happening.

play56:51

But my guess is that we won't.

play56:57

My guess is that they will take over,

play57:02

they'll keep us around to keep the power stations running,

play57:05

but not for long.

play57:06

'Cause they'll be able to design better analog computers.

play57:07

They'll be much, much more intelligent

play57:10

than people ever were.

play57:11

And we're just a passing stage

play57:12

the evolution of intelligence.

play57:14

That's my best guess. And I hope I'm wrong.

play57:19

But that's sort of a depressing message to close on.

play57:22

A little bit depressing.

play57:25

I want to say one more thing,

play57:28

which is what I call the sentience defense.

play57:30

So a lot of people think

play57:32

that there's something special about people.

play57:38

People have a terrible tendency to think that.

play57:42

Many people think they, or used to think,

play57:44

they were made in the image of God.

play57:46

And God put them in the center of the universe.

play57:49

Some people still think that.

play57:54

And many people think

play57:57

that there's something special about us

play57:58

that a digital computer couldn't have.

play58:01

A digital intelligence, it won't have subjective experience.

play58:03

We're different. It'll never really understand.

play58:07

So I've talked to philosophers who say,

play58:08

yes, it understands sort of sub one,

play58:10

understands in sense one of understanding,

play58:12

but it doesn't have real understanding

play58:14

'cause that involves consciousness and subjective experience

play58:16

and it doesn't have that.

play58:20

So I'm gonna try and convince you

play58:21

that the chatbots we have already

play58:25

have subjective experience.

play58:28

And the reason I believe that is

play58:30

'cause I think people are wrong

play58:32

in their analysis of what subjective experience is.

play58:38

Okay, so this is a view that I call atheaterism,

play58:43

which is like atheism.

play58:47

Dan Dennett is happy with this name

play58:49

and this is essentially Dan Dennett's view.

play58:52

He's a well-known philosopher of cognitive science.

play58:57

It's also quite close to the view of the late Wittgenstein.

play59:04

Actually, he's dead a long time ago, so he's not that late.

play59:10

The idea is that most people think that

play59:17

there's an inner theater,

play59:19

and so stuff comes from the world,

play59:21

and somehow gets into this inner theater.

play59:24

And all we experience directly is this inner theater.

play59:27

This is a Cartesian kind of view.

play59:31

And you can't experience my inner theater,

play59:35

and I can't experience your inner theater.

play59:38

But that's what we really see.

play59:40

And that's where we have subjective experience.

play59:42

That's what subjective experience is

play59:43

experiencing stuff in this inner theater.

play59:47

And Dennett and his followers like me,

play59:51

believe this view is utterly wrong.

play59:53

It's as wrong as

play59:56

a religious fundamentalist view of the material world,

play59:59

which if you're not a religious fundamentalist,

play60:01

you can agree is just wrong.

play60:06

And it relies on people not having a very wrong view

play60:09

of what a mental state is.

play60:13

So I would like to be able to tell you about

play60:16

what's going on in my brain when I'm looking at something,

play60:19

particularly when I'm looking at something and it's weird.

play60:22

I'd like to tell you I'm seeing this weird thing

play60:24

that isn't really there, but I'm seeing this weird thing.