Stanford CS25: V2 I Introduction to Transformers w/ Andrej Karpathy

Stanford Online
19 May 202371:40

Summary

TLDR本次课程介绍了深度学习模型——变换器(Transformers),它在自然语言处理、计算机视觉、强化学习等多个领域产生了革命性的影响。课程由斯坦福大学的专家主讲,他们分享了变换器的基础知识、自注意力机制以及如何应用于不同研究领域。此外,还探讨了变换器的未来发展,包括视频理解和生成、金融业务等领域的应用,以及如何通过增强控制性和计算复杂性的降低来提升模型性能。

Takeaways

  • 📚 课程CS 25 Transformers United V2.是斯坦福大学在2023年冬季开设的深度学习模型课程,重点介绍在AI及其他领域产生革命性影响的transformers。
  • 🤖 讲师之一目前在一家机器人初创公司领导AI工作,研究兴趣包括强化学习、计算机视觉和建模。
  • 🎓 另一位讲师是斯坦福大学计算机科学博士生,主要研究自然语言处理和计算机视觉。
  • 🚀 Transformers自2017年由Vaswani等人提出以来,已广泛应用于自然语言处理、计算机视觉、生物学、机器人学等领域。
  • 🌟 Transformers的核心机制是自注意力(self-attention),它允许模型在处理序列时更好地理解上下文。
  • 📈 从2017年到2023年,transformers在AI领域的应用不断扩展,特别是在生成模型(如GPT和DALL-E)和多模态任务中。
  • 🔍 课程介绍了transformers的工作原理,以及它们如何被应用于NLP以外的领域,并探讨了这些主题的新兴研究方向。
  • 🧠 讲师提到transformers的成功可能暗示了大脑的工作方式,因为大脑在整个皮层中表现出高度的均匀性和统一性。
  • 🔑 课程强调了transformers的灵活性,它们可以轻松地将来自不同来源的信息(如图像、音频和文本)整合到一起进行处理。
  • 🌐 讲师讨论了transformers未来的发展方向,包括视频理解和生成、金融和业务应用,以及特定领域的模型(如DoctorGPT和LawyerGPT)。
  • 💡 讲师提出了一些transformers领域的关键挑战,包括提高长序列建模的能力、减少计算复杂性、增强模型的可控性和与人类大脑的对齐。

Q & A

  • CS 25 Transformers United V2.课程是在哪个学校开设的?

    -CS 25 Transformers United V2.课程是在斯坦福大学开设的。

  • 这个课程主要讲授的是什么内容?

    -这个课程主要讲授深度学习模型——变换器(Transformers),它们在自然语言处理、计算机视觉、强化学习、生物学、机器人学等领域的应用,并探讨了变换器在不同研究领域的应用。

  • 变换器(Transformers)最初是由哪篇论文提出的?

    -变换器(Transformers)最初是由Vaswani等人在2017年的论文中提出的。

  • 变换器在自然语言处理(NLP)之外的领域有哪些应用?

    -变换器在自然语言处理(NLP)之外的领域,如计算机视觉、强化学习、生物学、机器人学等都有应用。

  • 课程中提到的RNN和LSTM在处理长序列时存在哪些问题?

    -RNN和LSTM在处理长序列时存在无法有效编码长序列和上下文的问题。

  • 变换器(Transformers)在处理上下文方面有哪些优势?

    -变换器在处理上下文方面的优势包括更好地理解文本的上下文,以及在内容和上下文预测方面更为准确。

  • 课程中提到的Codex、GPT和DALL-E是什么?

    -Codex、GPT和DALL-E是变换器模型的例子,它们在生成模型领域有重要应用,如编程代码生成、文本生成和图像生成。

  • 课程中提到的ChatGPT是如何训练的?

    -ChatGPT是通过强化学习和人类反馈进行训练的,以提高其性能。

  • 变换器(Transformers)在未来可能的发展方向有哪些?

    -变换器在未来可能的发展方向包括视频理解和生成、金融和商业应用、长序列建模、多任务和多输入预测、领域特定模型等。

  • 课程中提到的Transformer的哪些特性使其在AI领域如此有效?

    -Transformer在AI领域之所以有效,是因为它们在前向传播中具有很高的表达能力,易于优化,并且由于其浅层宽网络的结构,非常适合GPU并行处理,从而非常高效。

Outlines

00:00

🎓 课程介绍与背景

本段落介绍了CS 25 Transformers United V2.课程的背景和内容,该课程由斯坦福大学在2023年冬季开设。课程主题不是变形机器人,而是深度学习模型——Transformers,它们在自然语言处理、计算机视觉、强化学习等多个领域产生了革命性影响。课程将通过一系列视频展示不同领域的专家如何应用Transformers进行研究。介绍还包括了讲师的个人背景和研究兴趣,以及课程的教学目标和内容概览。

05:02

🤖 机器人与Transformers的应用

这一部分讨论了Transformers在机器人学和其他领域的应用。提到了机器人学中的一般性机器人,以及如何通过Transformers改进强化学习、计算机视觉等算法。还提到了个人对机器人学的热情,以及在机器人学和自动驾驶领域的研究和出版物。此外,还介绍了个人的兴趣爱好,如音乐、武术和韩剧等。

10:04

🚀 课程内容与未来展望

本段落概述了课程内容和对未来技术发展的展望。讨论了Transformers的基础知识,包括自注意力机制,并深入探讨了BERT、GPT等模型。同时,提出了对未来研究方向的预测,包括视频理解和生成、金融业务应用,以及需要解决的长序列建模问题。还讨论了如何提高模型的泛化能力和控制能力,以及如何使模型更接近人类大脑的工作方式。

15:06

📚 历史回顾与Transformers的起源

这一部分回顾了机器学习和人工智能的历史,特别是Transformers的起源和发展。从早期的特征描述符和支持向量机,到神经网络在图像分类和机器翻译中的应用,再到Transformers的提出和普及。讨论了Transformers如何简化了不同领域的研究,并可能预示着我们正在接近大脑的工作方式。

20:07

🧠 从神经网络到Transformers的演变

本段落详细讨论了从早期的神经网络到Transformers的演变过程。从2003年的简单神经网络语言模型开始,到2014年的序列到序列模型,再到2017年的Attention is All You Need论文,详细解释了注意力机制的提出和Transformers架构的形成。还提到了与注意力机制发明者Dzmitry的交流,揭示了注意力机制背后的灵感和历史。

25:09

🌟 Transformers的核心——注意力机制

这一部分深入探讨了Transformers中注意力机制的工作方式和重要性。通过解释多头注意力和自注意力的概念,讨论了如何在Transformer中实现节点间的通信和信息传递。还提到了位置编码的必要性,以及如何在Transformer中处理不同数据类型(如文本、图像等)。最后,通过一个简化的Transformer模型——nanoGPT的实现,展示了Transformers的灵活性和强大功能。

30:15

📈 训练与生成文本的实例

本段落通过一个具体的例子——Tiny Shakespeare数据集,展示了如何使用Transformer模型进行文本生成。介绍了数据预处理、批处理、模型训练和生成过程。讨论了如何通过位置编码和自注意力机制,使模型能够理解和生成文本序列。还提到了如何通过特殊标记和掩码来控制模型的生成过程,以及如何通过连续的输入和输出来扩展模型的上下文长度。

35:15

🤔 对Transformers的思考与未来方向

这一部分对Transformers的效率、表达能力和优化特性进行了深入的思考。讨论了Transformers如何作为一种通用的、高效的、可优化的计算机模型,以及它们在不同领域的应用。还提到了Transformers的未来发展,包括如何通过增加上下文长度和使用外部记忆来提高模型性能。最后,提出了一些关于Transformers的未来发展和潜在改进方向的想法。

40:17

🎉 课程结束与感谢

本段落是对整个课程的一个总结,感谢听众的参与,并对未来的工作方向进行了简短的展望。提到了目前正在进行的nanoGPT项目,以及对构建更先进的语言模型产品的兴趣。最后,对听众的参与表示感谢,并邀请听众为演讲者鼓掌。

Mindmap

Keywords

💡Transformers

Transformers是一种深度学习模型,它在自然语言处理(NLP)以及其他领域如计算机视觉、强化学习等引起了革命性的变化。在视频中,讲师介绍了Transformers的基础结构和它们如何通过自注意力机制(self-attention mechanism)来处理数据。

💡自注意力机制(Self-Attention)

自注意力机制是Transformers模型的核心,它允许模型在处理序列数据时,对序列中的每个元素分配不同的权重,这样模型就能够捕捉到序列内部的复杂关系。

💡BERT

BERT(Bidirectional Encoder Representations from Transformers)是一种基于Transformers架构的预训练语言表示模型,它通过在大量文本上进行预训练,学习到深层次的语言表示,然后可以用于各种NLP任务。

💡GPT

GPT(Generative Pre-trained Transformer)是一种基于Transformers的预训练语言模型,主要用于文本生成任务。GPT通过大量的文本数据进行预训练,学习到语言的模式,并能够在给定上下文的情况下生成连贯的文本。

💡编码器-解码器架构(Encoder-Decoder Architecture)

编码器-解码器架构是一种常见的模型结构,用于处理序列到序列的任务,如机器翻译。编码器负责理解输入序列,而解码器则基于编码器的输出生成目标序列。

💡位置编码(Positional Encoding)

位置编码是一种添加到Transformers模型中的机制,用于向模型提供序列中元素的位置信息。因为Transformers本身不具备处理序列顺序的能力,位置编码使得模型能够理解序列中元素的相对位置。

💡多头注意力(Multi-Head Attention)

多头注意力是Transformers模型中的一个关键特性,它允许模型同时从不同的表示子空间中学习信息。通过这种方式,模型可以关注序列中的多个不同方面,提高了模型捕捉复杂关系的能力。

💡残差连接(Residual Connection)

残差连接是一种网络结构技术,它通过将输入直接添加到网络的后续层来解决深度网络中的梯度消失问题。在Transformers中,残差连接有助于梯度在网络中的流动,使得模型可以有效地训练。

💡层归一化(Layer Normalization)

层归一化是一种正则化技术,用于稳定神经网络的训练过程。通过对每一层的激活值进行归一化,可以减少内部协变量偏移,从而加速训练并提高模型的泛化能力。

💡少样本学习(Few-Shot Learning)

少样本学习是指模型在只有少量样本的情况下进行学习的能力。这种学习方式对于处理数据稀缺的任务非常重要,因为它允许模型在有限的数据上也能做出有效的预测。

💡元学习(Meta-Learning)

元学习是指模型学习如何更有效地学习的过程,也就是学习学习的方法。在Transformers中,元学习使得模型能够在阅读新的输入序列时快速适应并提取有用的信息,而无需进行额外的梯度下降训练。

Highlights

CS 25 Transformers United V2. 这门课程在2023年冬季由斯坦福大学开设,主要介绍深度学习模型transformers及其在AI和其他领域中的革命性应用。

Transformers 从自然语言处理开始,已经被应用于计算机视觉、强化学习、生物学、机器人学等多个领域。

课程将介绍transformers的基础知识,包括自注意力机制,并深入探讨BERT、GPT等模型。

2017年的论文《Attention is All You Need》标志着transformers的诞生,该论文提出了一种全新的基于注意力机制的架构。

transformers的出现在AI领域引起了一场革命,改变了从NLP到计算机视觉等多个领域的研究和应用。

transformers在处理长序列方面表现出色,能够解决之前RNN和LSTM等模型无法有效处理的问题。

transformers的注意力机制使其在上下文预测方面表现出色,例如能够找到与特定词相关的名词。

2021年标志着生成模型时代的开始,出现了Codex、GPT、DALL-E等模型,这些模型在生成任务上取得了重大进展。

transformers的未来发展将包括视频理解和生成、金融业务应用,以及解决长序列建模和多模态任务的挑战。

transformers的一个重要研究方向是提高模型的可控性,例如通过增强模型的输出稳定性。

transformers的另一个研究方向是与人类大脑的工作方式对齐,以提高模型的理解和推理能力。

transformers的架构在过去五年中保持了惊人的韧性,尽管许多人试图对其进行改进,但基本架构仍然保持不变。

transformers的灵活性使其能够轻松地将来自不同模态的数据整合到一起,例如将图像、文本和音频数据混合处理。

transformers的有效性部分归功于其在GPU上的高效计算能力,这使得它可以处理大规模数据集并训练大型模型。

transformers的注意力机制可以看作是在处理图中的消息传递,其中节点之间的连接是通过注意力分数来确定的。

transformers的自注意力和多头注意力允许模型在并行中寻找和整合来自不同部分的信息。

transformers的编码器-解码器架构使其能够处理如机器翻译和文本生成等序列到序列的任务。

transformers的跨注意力机制允许解码器在生成序列时利用编码器捕获的信息。

Transcripts

play00:05

Hi, everyone.

play00:06

Welcome to CS 25 Transformers United V2.

play00:09

This was a course that was held at Stanford

play00:11

in the winter of 2023.

play00:13

This course is not about robots that

play00:14

can transform into cars as this picture might suggest.

play00:17

Rather, it's about deep learning models

play00:18

that have taken the world by storm

play00:21

and have revolutionized the field of AI and others.

play00:23

Starting from natural language processing,

play00:25

transformers have been applied all over,

play00:27

computer vision, reinforcement learning, biology, robotics,

play00:30

et cetera.

play00:31

We have an exciting set of videos lined up for you

play00:34

with some truly fascinating speakers, talks, presenting

play00:37

how they're applying transformers

play00:39

to the research in different fields and areas.

play00:44

We hope you'll enjoy and learn from these videos.

play00:47

So without any further ado, let's get started.

play00:52

This is a purely introductory lecture.

play00:54

And we'll go into the building blocks of transformers.

play00:58

So first, let's start with introducing the instructors.

play01:03

So for me, I'm currently on a temporary deferral from the PhD

play01:06

program, and I'm leading AI at a robotics startup, Collaborative

play01:09

Robotics, that are working on some general purpose robots,

play01:13

somewhat like [INAUDIBLE].

play01:14

And I'm very passionate about robotics and building FSG

play01:18

learning algorithms.

play01:19

My research interests are in reinforcement learning,

play01:21

computer vision, and remodeling, and I

play01:23

have a bunch of publications in robotics,

play01:25

autonomous driving, and other areas.

play01:28

My undergrad was at Cornell.

play01:29

If someone is from Cornell, so nice to [INAUDIBLE]..

play01:33

So I'm Stephen, currently a first-year CS PhD here.

play01:37

Previously did my master's at CMU and undergrad at Waterloo.

play01:40

I'm mainly into NLP research, anything involving language

play01:43

and text, but more recently, I've

play01:45

been getting more into computer vision as well as [INAUDIBLE]

play01:48

And just some stuff I do for fun, a lot of music

play01:51

stuff, mainly piano.

play01:52

Some self-promo of what I post a lot on my Insta, YouTube,

play01:55

and TikTok, so if you guys want to check it out.

play01:58

My friends and I are also starting a Stanford piano club,

play02:01

so if anybody's interested, feel free to email

play02:04

or DM me for details.

play02:07

Other than that, martial arts, bodybuilding, and huge fan

play02:11

of k-dramas, anime, and occasional gamer.

play02:14

[LAUGHS]

play02:18

OK, cool.

play02:19

Yeah, so my name is Rylan.

play02:20

Instead of talking about myself, I just

play02:21

want to very briefly say that I'm super

play02:23

excited to take this class.

play02:24

I took it the last time-- sorry-- to teach this.

play02:26

Excuse me.

play02:26

I took it the last time I was offered.

play02:28

I had a bunch of fun.

play02:30

I thought we brought in a really great group of speakers

play02:32

last time.

play02:33

I'm super excited for this offering.

play02:35

And yeah, I'm thankful that you're all here,

play02:37

and I'm looking forward to a really fun quarter together.

play02:39

Thank you.

play02:39

Yeah, so fun fact, Rylan was the most outspoken student

play02:42

last year.

play02:43

And so if someone wants to become an instructor next year,

play02:45

you know what to do.

play02:46

[LAUGHTER]

play02:50

OK, cool.

play02:53

Let's see.

play02:54

OK, I think we have a few minutes.

play02:56

So what we hope you will learn in this class is, first of all,

play02:59

how do transformers work, how they

play03:02

are being applied, just beyond NLP,

play03:04

and nowadays, like they are pretty [INAUDIBLE]

play03:06

them everywhere in AI machine learning.

play03:10

And what are some new and interesting directions

play03:12

of research in these topics.

play03:17

Cool, so this class is just an introductory.

play03:19

So we're just talking about the basics of transformers,

play03:22

introducing them, talking about the self-attention mechanism

play03:24

on which they're founded.

play03:26

And we'll do a deep dive more on models like BERT

play03:30

to GPT, stuff like that.

play03:32

So with that, happy to get started.

play03:35

OK, so let me start with presenting the attention

play03:38

timeline.

play03:40

Attention all started with this one paper.

play03:43

[INAUDIBLE] by Vaswani et al in 2017.

play03:46

That was the beginning of transformers.

play03:49

Before that, we had the prehistoric error,

play03:51

where we had models like RNM, LSDMs,

play03:55

and simple attention mechanisms that didn't work

play03:57

or [INAUDIBLE].

play03:59

Starting 2017, we saw this explosion of transformers

play04:02

into NLP, where people started using it for everything.

play04:07

I even heard this quote from Google.

play04:08

It's like our performance increased every time

play04:10

we [INAUDIBLE]

play04:11

[CHUCKLES]

play04:15

For the [INAUDIBLE] after 2018 to 2020,

play04:17

we saw this explosion of transformers

play04:18

into other fields like vision, a bunch of other stuff,

play04:23

and like biology as a whole.

play04:25

And in last year, 2021 was the start

play04:28

of the generative era, where we got a lot of genetic modeling,

play04:31

started models like Codex, GPT, DALL-E,

play04:35

stable diffusions, or a lot of things

play04:37

happening in genetic modeling.

play04:40

And we started scaling up in AI.

play04:44

And now, the present.

play04:45

So this is 2022 and the startup in '23.

play04:49

And now we have models like ChatGPT, Whisperer,

play04:53

a bunch of others.

play04:54

And we're scaling onwards without splitting up,

play04:57

so that's great.

play04:58

So that's the future.

play05:01

So going more into this, so once there were RNNs.

play05:06

So we had Seq2Seq models, LSTMs, GRU.

play05:10

What worked there was that they were good at encoding history,

play05:13

but what did not work was they didn't encode long sequences

play05:17

and they were very bad at encoding context.

play05:21

So consider this example.

play05:24

Consider trying to predict the last word in the text,

play05:27

"I grew up in France, dot, dot, dot.

play05:29

I speak fluent Dutch."

play05:31

Here, you need to understand the context for it

play05:33

to predict French, and attention mechanism

play05:36

is very good at that, whereas if they're just using LSDMs,

play05:39

it doesn't here work that well.

play05:42

Another thing transformers are good at is,

play05:46

more based on content, is also context prediction

play05:50

is like finding attention maps.

play05:52

If I have something like a word like it,

play05:56

what noun does it correlate to.

play05:57

And we can give a property attention

play06:01

on one of the possible activations.

play06:05

And this works better than existing mechanisms.

play06:10

OK, so where we were in 2021, we were on the verge of takeoff.

play06:16

We were starting to realize the potential of transformers

play06:18

in different fields.

play06:20

We solved a lot of long sequence problems

play06:23

like protein folding, AlphaFold, offline RL.

play06:28

We started to see few-shots, zero-shot generalization.

play06:31

We saw multimodal tasks and applications

play06:34

like generating images from language.

play06:36

So that was DALL-E. And it feels like [INAUDIBLE]..

play06:43

And this was also a talk on transformers

play06:45

that you can watch on YouTube.

play06:48

Yeah, cool.

play06:51

And this is where we were going from 2021 to 2022,

play06:55

which is we have gone from the version of [INAUDIBLE]

play06:58

And now, we are seeing unique applications

play07:00

in audio generation, art, music, storytelling.

play07:03

We are starting to see these new capabilities

play07:05

like commonsense, logical reasoning,

play07:08

mathematical reasoning.

play07:09

We are also able to now get human enlightenment

play07:12

and interaction.

play07:13

They're able to use reinforcement learning

play07:15

and human feedback.

play07:16

That's how ChatGPT is trained to perform really good.

play07:19

We have a lot of mechanisms for controlling

play07:21

toxicity bias and ethics now.

play07:24

And there are a lot of also, a lot

play07:26

of developments in other areas like diffusion models.

play07:30

Cool.

play07:33

So the future is a spaceship, and we are all

play07:35

excited about it.

play07:39

And there's a lot of more applications

play07:40

that we can enable, and it'll be great

play07:44

if you can see transformers also up there.

play07:47

One big example is video understanding and generation.

play07:49

That is something that everyone is interested in,

play07:51

and I'm hoping we'll see a lot of models

play07:53

in this area this year, also, finance, business.

play07:59

I'll be very excited to see GPT author a novel,

play08:02

but we need to solve very long sequence modeling.

play08:04

And most transformer models are still

play08:07

limited to 4,000 tokens or something like that.

play08:09

So we need to make them generalize much more

play08:13

better on long sequences.

play08:17

We also want to have generalized agents

play08:19

that can do a lot of multitask, a multi-input predictions

play08:27

like Gato.

play08:28

And so I think we will see more of that, too.

play08:31

And finally, we also want domain specific models.

play08:37

So you might want a GPT model, let's

play08:39

put it like maybe your health.

play08:41

So that could be like a DoctorGPT model.

play08:43

You might have a LawyerGPT model that's

play08:45

trained on only law data.

play08:46

So currently, we have GPT models that are trained on everything.

play08:49

But we might start to see more niche models that

play08:51

are good at one task.

play08:53

And we could have a mixture of experts,

play08:55

so it's like, you can think this is a--

play08:57

how you'd normally consult an expert,

play08:58

you'll have expert AI models.

play09:00

And you can go to a different AI model for your different needs.

play09:05

There are still a lot of missing ingredients

play09:07

to make this all successful.

play09:10

The first of all is external memory.

play09:12

We are already starting to see this with the models

play09:15

like ChatGPT, where the inflections are short-lived.

play09:18

There's no long-term memory, and they

play09:20

don't have ability to remember or store

play09:23

conversations for long-term.

play09:25

And this is something you want to fix.

play09:29

Second is reducing the computation complexity.

play09:32

So attention mechanism is quadratic over the sequence

play09:36

length, which is slow.

play09:37

And we want to reduce it and make it faster.

play09:42

Another thing we want to do is we

play09:44

want to enhance the controllability of these models

play09:46

like a lot of these models can be stochastic.

play09:48

And we want to be able to control what sort of outputs

play09:51

we get from them.

play09:52

And you might have experienced the ChatGPT,

play09:54

if you just refresh, you get different output each time.

play09:56

But you might want to have a mechanism that controls

play09:59

what sort of things you get.

play10:01

And finally, we want to align our state of art language

play10:04

models with how the human brain works.

play10:06

And we are seeing the surge, but we still

play10:09

need more research on seeing how they can make more informed.

play10:12

Thank you.

play10:14

Great, hi.

play10:16

Yes, I'm excited to be here.

play10:18

I live very nearby, so I got the invites to come to class.

play10:21

And I was like, OK, I'll just walk over.

play10:23

But then I spent like 10 hours on the slides,

play10:25

so it wasn't as simple.

play10:28

So yeah, I'm going to talk about transformers.

play10:30

I'm going to skip the first two over there.

play10:32

I'm not going to talk about those.

play10:34

We'll talk about that one just to simplify the lecture

play10:36

since we don't have time.

play10:39

OK, so I wanted to provide a little bit of context

play10:41

on why does this transformers class even exist.

play10:44

So a little bit of historical context.

play10:45

I feel like Bilbo over there.

play10:47

I joined like telling you guys about this.

play10:50

I don't know if you guys saw Lord of the Rings.

play10:52

And basically, I joined AI in roughly 2012, the full course,

play10:56

so maybe a decade ago.

play10:58

And back then, you wouldn't even say

play10:59

that you joined AI by the way.

play11:00

That was like a dirty word.

play11:02

Now, it's OK to talk about, but back then, it

play11:04

was not even deep learning.

play11:05

It was machine learning.

play11:06

That was the term we would use if you were serious.

play11:08

But now, now, AI is OK to use, I think.

play11:11

So basically, do you even realize

play11:13

how lucky you are potentially entering

play11:15

this area in roughly 2023?

play11:17

So back then, in 2011 or so when I was working specifically

play11:20

on computer vision, your pipeline's looked like this.

play11:25

So you wanted to classify some images,

play11:28

you would go to a paper, and I think this is representative.

play11:30

You would have three pages in the paper describing

play11:32

all kinds of a zoo, of kitchen sink,

play11:34

of different kinds of features, descriptors.

play11:36

And you would go to a poster session

play11:38

and in computer vision conference,

play11:40

and everyone would have their favorite feature descriptor

play11:41

that they're proposing.

play11:42

And it's totally ridiculous, and you

play11:44

would take notes on which one you should incorporate

play11:45

into your pipeline because you would extract all of them,

play11:48

and then you would put an SVM on top.

play11:49

So that's what you would do.

play11:51

So there's two pages.

play11:52

Make sure you get your [? Spar ?] SIFT histograms,

play11:54

your SSIMs, your color histograms, textiles,

play11:56

tiny images.

play11:57

And don't forget the geometry specific histograms.

play11:59

All of them have basically complicated code by themselves.

play12:02

So you're collecting code from everywhere and running it,

play12:04

and it was a total nightmare.

play12:06

So on top of that, it also didn't work.

play12:10

[LAUGHTER]

play12:11

So this would be, I think, it represents the prediction

play12:14

from that time.

play12:15

You would just get predictions like this once in a while,

play12:17

and you'd be like, you just shrug your shoulders

play12:19

like that just happens once in a while.

play12:20

Today, you would be looking for a bug.

play12:23

And worse than that, every single chunk of AI

play12:30

had their own completely separate vocabulary

play12:32

that they work with.

play12:33

So if you go to NLP papers, those papers

play12:36

would be completely different.

play12:38

So you're reading the NLP paper, and you're like,

play12:40

what is this part of speech tagging,

play12:42

morphological analysis, and tactic parsing,

play12:44

co-reference resolution?

play12:46

What is MPBTKJ?

play12:48

And you're confused.

play12:49

So the vocabulary and everything was completely different.

play12:51

And you couldn't read papers, I would

play12:52

say, across different areas.

play12:55

So now, that changed a little bit

play12:56

starting 2012 when Al Krizhevsky and colleagues basically

play13:02

demonstrated that if you scale a large neural network

play13:05

on large data set, you can get very strong performance.

play13:08

And so up till then, there was a lot of focus on algorithms.

play13:10

But this showed that actually neural nets scale very well.

play13:13

So you need to now worry about compute and data,

play13:15

and you can scale it up.

play13:16

It works pretty well.

play13:17

And then that recipe actually did copy paste

play13:19

across many areas of AI.

play13:21

So we start to see neural networks pop up everywhere

play13:23

since 2012.

play13:25

So we saw them in computer vision, and NLP, and speech,

play13:28

and translation in RL and so on.

play13:30

So everyone started to use the same kind of modeling

play13:32

toolkit, modeling framework.

play13:33

And now when you go to NLP, and you start reading papers there,

play13:36

in machine translation, for example,

play13:38

this is a sequence to sequence paper

play13:40

which we'll come back to in a bit.

play13:41

You start to read those papers, and you're like, OK,

play13:44

I can recognize these words.

play13:45

Like there's a neural network.

play13:46

There's some parameters.

play13:47

There's an optimizer, and it starts to read things

play13:50

that you know of.

play13:50

So that decreased tremendously the barrier to entry

play13:54

across the different areas.

play13:56

And then, I think, the big deal is

play13:57

that when the transformer came out in 2017,

play14:00

it's not even that just the tool kits and the neural networks

play14:02

were similar-- there's that literally the architectures

play14:05

converged to like one architecture that you

play14:07

copy paste across everything seemingly.

play14:10

So this was kind of an unassuming machine translation

play14:12

paper at the time, proposing to transformer architecture.

play14:15

But what we found since then is that you can just basically

play14:17

copy paste this architecture and use it everywhere.

play14:21

And what's changing is the details of the data,

play14:23

and the chunking of the data, and how you feed it in.

play14:26

And that's a caricature, but it's

play14:28

kind of like a correct first order statement.

play14:29

And so now, papers are even more similar looking

play14:32

because everyone's just using transformer.

play14:34

And so this convergence was remarkable to watch

play14:38

and unfolded over the last decade.

play14:40

And it's pretty crazy to me.

play14:42

What I find interesting is I think

play14:44

this is some kind of a hint that we're maybe converging

play14:46

to something that maybe the brain is doing

play14:48

because the brain is very homogeneous and uniform

play14:50

across the entire sheet of your cortex.

play14:52

And OK, maybe some of the details are changing,

play14:54

but those feel like hyperparameters

play14:56

like a transformer.

play14:57

But your auditory cortex and your visual cortex

play14:59

and everything else looks very similar.

play15:01

And so maybe we're converging to some kind

play15:02

of a uniform powerful learning algorithm here.

play15:06

Something like that, I think, is interesting and exciting.

play15:09

OK, so I want to talk about where the transformer came

play15:11

from briefly, historically.

play15:12

So I want to start in 2003.

play15:15

I like this paper quite a bit.

play15:17

It was the first popular application of neural networks

play15:21

to the problem of language modeling,

play15:22

so predicting in this case, the next word

play15:24

in the sequence, which allows you to build

play15:26

generative models over text.

play15:27

And in this case, they were using multi-layer perceptron,

play15:29

so very simple neural net.

play15:30

The neural nets took three words and predicted the probability

play15:33

distribution for the fourth word in a sequence.

play15:36

So this was well and good at this point.

play15:39

Now, over time, people started to apply this

play15:41

to machine translation.

play15:43

So that brings us to sequence to sequence paper

play15:45

from 2014 that was pretty influential,

play15:48

and the big problem here was OK, we

play15:49

don't just want to take three words and predict the fourth.

play15:52

We want to predict how to go from an English sentence

play15:55

to a French sentence.

play15:56

And the key problem was OK, you can

play15:58

have arbitrary number of words in English and arbitrary number

play16:00

of words in French, so how do you

play16:03

get an architecture that can process

play16:04

this variably sized input?

play16:06

And so here they used a LSDM, and there's basically

play16:10

two chunks of this, which are covered by the slack, by this.

play16:16

But basically have an encoder LSDM on the left,

play16:19

and it just consumes one word at a time

play16:22

and builds up a context of what it has read.

play16:24

And then that acts as a conditioning vector

play16:26

to the decoder RNN or LSDM.

play16:29

That basically goes chonk, chonk,

play16:30

chonk for the next word in a sequence,

play16:32

translating the English to French or something like that.

play16:35

Now, the big problem with this, that people identified,

play16:37

I think, very quickly and tried to resolve

play16:40

is that there's what's called this encoder bottleneck.

play16:43

So this entire English sentence that we are trying to condition

play16:46

on is packed into a single vector

play16:48

that goes from the encoder to the decoder.

play16:50

And so this is just too much information

play16:52

to potentially maintain in a single vector,

play16:54

and that didn't seem correct.

play16:55

And so people who are looking around for ways

play16:57

to alleviate the attention of the encoder bottleneck as it

play17:00

was called at the time.

play17:02

And so that brings us to this paper,

play17:03

Neural Machine Translation by Jointly Learning

play17:05

to Align and Translate.

play17:07

And here, just quoting from the abstract, "in this paper,

play17:11

we conjectured that the use of a fixed length vector

play17:13

is a bottleneck in improving the performance

play17:15

of the basic encoder-decoder architecture

play17:17

and propose to extend this by allowing

play17:19

the model to automatically soft search

play17:21

for parts of the source sentence that are relevant to predicting

play17:24

a target word without having to form

play17:28

these parts or hard segments exclusively."

play17:30

So this was a way to look back to the words that

play17:34

are coming from the encoder.

play17:35

And it was achieved using this soft search.

play17:38

So as you are decoding in the words

play17:42

here, while you are decoding them,

play17:44

you are allowed to look back at the words

play17:45

at the encoder via this soft attention mechanism proposed

play17:49

in this paper.

play17:50

And so this paper, I think, is the first time that I saw,

play17:52

basically, attention.

play17:55

So your context vector that comes from the encoder

play17:58

is a weighted sum of the hidden states

play18:01

of the words in the encoding.

play18:05

And then the weights of this sum come

play18:07

from a softmax that is based on these compatibilities

play18:10

between the current state as you're decoding

play18:13

and the hidden states generated by the encoder.

play18:15

And so this is the first time that really you

play18:17

start to look at it, and this is the current modern equations

play18:22

of the attention.

play18:23

And I think this was the first paper that I saw it in.

play18:25

It's the first time that there's a word

play18:27

attention used, as far as I know, to call this mechanism.

play18:32

So I actually tried to dig into the details of the history

play18:34

of the attention.

play18:35

So the first author here, Dzmitry, I

play18:38

had an email correspondence with him,

play18:40

and I basically sent him an email.

play18:41

I'm like, Dzmitry, this is really interesting.

play18:43

Just rumors have taken over.

play18:44

Where did you come up with the soft attention

play18:45

mechanism that ends up being the heart of the transformer?

play18:48

And to my surprise, he wrote me back this massive email, which

play18:52

was really fascinating.

play18:52

So this is an excerpt from that email.

play18:57

So basically, he talks about how he was looking for a way

play18:59

to avoid this bottleneck between the encoder and decoder.

play19:02

He had some ideas about cursors that

play19:04

traverse the sequences that didn't quite work out.

play19:06

And then here, "so one day, I had this thought

play19:08

that it would be nice to enable the decoder

play19:10

RNN to learn to search where to put the cursor in the source

play19:13

sequence.

play19:14

This was sort of inspired by translation exercises

play19:16

that learning English in my middle school involved.

play19:21

Your gaze shifts back and forth between source and target,

play19:23

sequence as you translate."

play19:24

So literally, I thought that this was kind of interesting,

play19:27

that he's not a native English speaker,

play19:28

and here, that gave him an edge in this machine translation

play19:31

that led to attention and then led to transformer.

play19:34

So that's really fascinating.

play19:37

"I expressed a soft search a softmax

play19:38

and then weighted averaging of the [INAUDIBLE] states.

play19:40

And basically, to my great excitement,

play19:43

this worked from the very first try."

play19:45

So really, I think, interesting piece of history.

play19:48

And as it later turned out that the name of RNN search

play19:51

was kind of lame, so the better name attention came

play19:54

from Yoshua on one of the final passes

play19:57

as they went over the paper.

play19:58

So maybe Attention is All You Need

play20:00

would have been called RNN Search is All You Need,

play20:03

but we have Yoshua Bengio to thank

play20:05

for a little bit of better name, I would say.

play20:07

So apparently, that's the history

play20:08

of this, which I thought was interesting.

play20:11

OK, so that brings us to 2017, which is Attention

play20:13

is All You Need.

play20:14

So this attention component, which

play20:16

in Dzmitry's paper was just one small segment,

play20:19

and there's all this bidirectional RNN, RNN

play20:21

and decoder, and this Attention All You Need paper is saying,

play20:25

OK, you can actually delete everything.

play20:26

What's making this work very well

play20:28

is just attention by itself.

play20:29

And so delete everything, keep attention.

play20:32

And then what's remarkable about this paper actually is usually,

play20:35

you see papers that are very incremental.

play20:36

They add one thing, and they show that it's better.

play20:39

But I feel like Attention is All You

play20:41

Need was like a mix of multiple things at the same time.

play20:44

They were combined in a very unique way,

play20:46

and then also achieve a very good local minimum

play20:49

in the architecture space.

play20:50

And so to me, this is really a landmark paper

play20:52

that is quite remarkable and, I think,

play20:55

had quite a lot of work behind the scenes.

play20:58

So delete all the RNN, just keep attention.

play21:01

Because attention operates over sets--

play21:03

and I'm going to go to this in a second--

play21:05

you now need to positionally encode your inputs

play21:07

because attention doesn't have the notion of space by itself.

play21:14

I have to be very careful.

play21:17

They adopted this residual network structure

play21:19

from resonance.

play21:21

They interspersed attention with multi-layer perceptrons.

play21:24

They used layer norms, which came from a different paper.

play21:27

They introduced the concept of multiple heads of attention

play21:29

that were applied in parallel.

play21:30

And they gave us, I think, like a fairly good set

play21:33

of hyperparameters that to this day are used.

play21:35

So the expansion factor in the multi-layer perceptron goes up

play21:39

by 4X--

play21:40

and we'll go into a bit more detail--

play21:41

and this 4X has stuck around.

play21:43

And I believe there's a number of papers

play21:44

that try to play with all kinds of little details

play21:47

of the transformer, and nothing sticks because this is actually

play21:50

quite good.

play21:51

The only thing to my knowledge that didn't stick

play21:54

was this reshuffling of the layer norms

play21:56

to go into the prenorm version where here you

play21:59

see the layer norms are after the multiheaded attention feed

play22:01

forward.

play22:02

They just put them before instead.

play22:04

So just reshuffling of layer norms, but otherwise,

play22:06

the TPTs and everything else that you're seeing today

play22:08

is basically the 2017 architecture from 5 years ago.

play22:11

And even though everyone is working on it,

play22:13

it's been proven remarkably resilient,

play22:15

which I think is real interesting.

play22:17

There are innovations that, I think,

play22:18

have been adopted also in positional encoding.

play22:21

It's more common to use different rotary and relative

play22:24

positional encoding and so on.

play22:25

So I think there have been changes, but for the most part,

play22:28

it's proven very resilient.

play22:31

So really quite an interesting paper.

play22:32

Now, I wanted to go into the attention mechanism.

play22:36

And I think, the way I interpret it is not similar to the ways

play22:43

that I've seen it presented before.

play22:44

So let me try a different way of how I see it.

play22:47

Basically, to me, attention is kind of like the communication

play22:49

phase of the transformer, and the transformer

play22:52

interweaves two phases of the communication phase, which

play22:55

is the multi-headed attention, and the computation

play22:57

stage, which is this multilayered perceptron

play23:00

or [INAUDIBLE].

play23:01

So in the communication phase, it's

play23:03

really just a data dependent message

play23:05

passing on directed graphs.

play23:07

And you can think of it as OK, forget everything

play23:09

with machine translation, everything.

play23:10

Let's just-- we have directed graphs.

play23:13

At each node, you are storing a vector.

play23:16

And then let me talk now about the communication

play23:18

phase of how these vectors talk to each other

play23:20

and this directed graph.

play23:21

And then the compute phase later is just

play23:23

a multi-perceptron, which then basically acts on every node

play23:27

individually.

play23:28

But how do these nodes talk to each other

play23:30

in this directed graph?

play23:32

So I wrote like some simple Python--

play23:36

I wrote this in Python basically to create

play23:39

one round of communication of using attention

play23:44

as the message passing scheme.

play23:46

So here, a node has this private data vector,

play23:51

as you can think of it as private information

play23:53

to this node.

play23:54

And then it can also emit a key, a query, and a value.

play23:57

And simply, that's done by linear transformation

play24:00

from this node.

play24:01

So the key is what are the things that I am--

play24:07

sorry.

play24:07

The query is what are the things that I'm looking for?

play24:10

The key is what other the things that I have?

play24:12

And the value is what are the things that I will communicate?

play24:15

And so then when you have your graph that's

play24:16

made up of nodes in some random edges, when you actually

play24:19

have these nodes communicating, what's happening is

play24:21

you loop over all the nodes individually

play24:23

in some random order, and you're at some node,

play24:27

and you get the query vector q, which

play24:29

is, I'm a node in some graph, and this

play24:32

is what I'm looking for.

play24:33

And so that's just achieved via this linear transformation

play24:36

here.

play24:36

And then we look at all the inputs that point to this node,

play24:39

and then they broadcast what are the things that I have,

play24:42

which is their keys.

play24:44

So they broadcast the keys.

play24:45

I have the query, then those interact by dot product

play24:49

to get scores.

play24:51

So basically, simply by doing dot product,

play24:53

you get some unnormalized weighting

play24:55

of the interestingness of all of the information in the nodes

play24:59

that point to me and to the things I'm looking for.

play25:02

And then when you normalize that with softmax,

play25:03

so it just sums to 1, you basically just

play25:06

end up using those scores, which now sum to 1 in our probability

play25:09

distribution, and you do a weighted sum of the values

play25:13

to get your update.

play25:15

So I have a query.

play25:17

They have keys, dot products to get interestingness or like

play25:21

affinity, softmax to normalize it, and then

play25:24

weighted sum of those values flow to me and update me.

play25:27

And this is happening for each node individually.

play25:29

And then we update at the end.

play25:30

And so this kind of a message passing scheme

play25:32

is at the heart of the transformer.

play25:35

And it happens in the more vectorized batched way

play25:40

that is more confusing and is also interspersed with layer

play25:44

norms and things like that to make the training behave

play25:46

better.

play25:47

But that's roughly what's happening in the attention

play25:49

mechanism, I think, on a high level.

play25:53

So yeah, so in the communication phase of the transformer, then

play25:59

this message passing scheme happens

play26:00

in every head in parallel and then in every layer in series

play26:06

and with different weights each time.

play26:08

And that's it as far as the multi-headed attention goes.

play26:13

And so if you look at these encooder-decoder models,

play26:15

you can think of it then in terms of the connectivity

play26:18

of these nodes in the graph.

play26:19

You can think of it as like, OK, all these tokens that

play26:21

are in the encoder that we want to condition on,

play26:23

they are fully connected to each other.

play26:25

So when they communicate, they communicate fully

play26:28

when you calculate their features.

play26:30

But in the decoder, because we are

play26:32

trying to have a language model, we

play26:33

don't want to have communication for future tokens

play26:35

because they give away the answer at this step.

play26:38

So the tokens in the decoder are fully connected

play26:40

from all the encoder states, and then they

play26:43

are also fully connected from everything that is decoding.

play26:46

And so you end up with this triangular structure

play26:49

in the data graph.

play26:50

But that's the message passing scheme

play26:52

that this basically implements.

play26:54

And then you have to be also a little bit careful because

play26:57

in the cross attention here with the decoder,

play26:59

you consume the features from the top of the encoder.

play27:01

So think of it as in the encoder,

play27:03

all the nodes are looking at each other,

play27:05

all the tokens are looking at each other many, many times.

play27:08

And they really figure out what's in there,

play27:09

and then the decoder when it's looking only at the top nodes.

play27:14

So that's roughly the message passing scheme.

play27:16

I was going to go into more of an implementation

play27:18

of a transformer.

play27:19

I don't know if there's any questions about this.

play27:23

[INAUDIBLE] self-attention and multi-headed attention,

play27:26

but what is the advantage of [INAUDIBLE]??

play27:30

Yeah, so self-attention and multi-headed attention, so

play27:35

the multi-headed attention is just this attention scheme,

play27:38

but it's just applied multiple times in parallel.

play27:40

Multiple heads just means independent applications

play27:42

of the same attention.

play27:44

So this message passing scheme basically just

play27:47

happens in parallel multiple times

play27:49

with different weights for the query, key, and value.

play27:52

So you can almost look at it like in parallel, I'm

play27:55

looking for, I'm seeking different kinds of information

play27:57

from different nodes.

play27:59

And I'm collecting it all in the same node.

play28:01

It's all done in parallel.

play28:03

So heads is really just copy-paste in parallel.

play28:06

And layers are copy-paste but in series.

play28:12

Maybe that makes sense.

play28:15

And self-attention, when it's self-attention,

play28:18

what it's referring to is that the node here

play28:21

produces each node here.

play28:23

So as I described it here, this is really self-attention

play28:25

because every one of these nodes produces

play28:27

a key query and a value from this individual node.

play28:30

When you have cross-attention, you have one cross-attention

play28:33

here, coming from the encoder.

play28:36

That just means that the queries are still

play28:38

produced from this node, but the keys and the values

play28:42

are produced as a function of nodes that

play28:44

are coming from the encoder.

play28:48

So I have my queries because I'm trying to decode some--

play28:52

the fifth word in the sequence.

play28:53

And I'm looking for certain things

play28:55

because I'm the fifth word.

play28:56

And then the keys and the values in terms

play28:58

of the source of information that could answer my queries

play29:01

can come from the previous nodes in the current decoding

play29:04

sequence or from the top of the encoder.

play29:06

So all the nodes that have already seen all

play29:09

of the encoding tokens many, many times cannot broadcast

play29:12

what they contain in terms of information.

play29:14

So I guess, to summarize, the self-attention is--

play29:18

sorry, cross-attention and self-attention

play29:20

only differ in where the piece and the values come from.

play29:24

Either the keys and values are produced from this node,

play29:28

or they are produced from some external source like an encoder

play29:31

and the nodes over there.

play29:33

But algorithmically, is the same mathematical operations.

play29:39

Question.

play29:39

Yeah, OK.

play29:40

So two questions for you.

play29:41

First question is, in the message passing [INAUDIBLE]

play29:56

So think of-- so each one of these nodes is a token.

play30:04

I guess they don't have a very good picture of it

play30:06

in the transformer.

play30:06

But this node here could represent the third word

play30:14

in the output in the decoder, and in the beginning,

play30:19

it is just the embedding of the word.

play30:27

And then, OK, I have to think through this analogy

play30:30

a little bit more.

play30:31

I came up with it this morning.

play30:32

[LAUGHTER]

play30:34

[INAUDIBLE]

play30:39

What example of instantiation [INAUDIBLE] nodes

play30:45

as in in blocks were embedding?

play30:50

These nodes are basically the vectors.

play30:53

I'll go to an implementation.

play30:54

I'll go to the implementation, and then maybe I'll

play30:56

make the connections to the graph.

play30:58

So let me try to first go to-- let me now go to,

play31:01

with this intuition in mind, at least,

play31:03

to a nanoGPT, which is a concrete implementation

play31:05

of a transformer that is very minimal.

play31:06

So I worked on this over the last few days,

play31:08

and here it is reproducing GPT-2 on open web text.

play31:11

So it's a pretty serious implementation that reproduces

play31:14

GPT-2, I would say, and provide it enough compute--

play31:17

This was one node of 8 GPUs for 38 hours or something

play31:21

like that, if I remember correctly.

play31:22

And it's very readable.

play31:23

It's 300 lines, so everyone can take a look at it.

play31:27

And yeah, let me basically briefly step through it.

play31:30

So let's try to have a decoder-only transformer.

play31:34

So what that means is that it's a language model.

play31:36

It tries to model the next word in the sequence

play31:39

or the next character in the sequence.

play31:41

So the data that we train on this

play31:43

is always some kind of text.

play31:44

So here's some fake Shakespeare.

play31:45

Sorry, this is real Shakespeare.

play31:47

We're going to produce fake Shakespeare.

play31:48

So this is called a Tiny Shakespeare

play31:50

dataset, which is one of my favorite toy datasets.

play31:52

You take all of Shakespeare, concatenate it,

play31:54

and it's 1 megabyte file, and then

play31:55

you can train language models on it

play31:56

and get infinite Shakespeare, if you like,

play31:58

which I think is kind of cool.

play31:59

So we have a text.

play32:00

The first thing we need to do is we

play32:02

need to convert it to a sequence of integers

play32:05

because transformers natively process--

play32:09

you can't plug text into transformer.

play32:10

You need to somehow encode it.

play32:11

So the way that encoding is done is

play32:13

we convert, for example, in the simplest case,

play32:15

every character gets an integer, and then instead of "hi

play32:18

there," we would have this sequence of integers.

play32:21

So then you can encode every single character as an integer

play32:25

and get a massive sequence of integers.

play32:27

You just concatenate it all into one

play32:29

large, long one-dimensional sequence.

play32:31

And then you can train on it.

play32:32

Now, here, we only have a single document.

play32:34

In some cases, if you have multiple independent documents,

play32:36

what people like to do is create special tokens,

play32:38

and they intersperse those documents

play32:40

with those special end of text tokens

play32:42

that they splice in between to create boundaries.

play32:46

But those boundaries actually don't have any modeling impact.

play32:50

It's just that the transformer is supposed

play32:52

to learn via backpropagation that the end of document

play32:55

sequence means that you should wipe the memory.

play33:00

OK, so then we produce batches.

play33:02

So these batches of data just mean

play33:04

that we go back to the one-dimensional sequence,

play33:06

and we take out chunks of this sequence.

play33:08

So say, if the block size is 8, Then the block size indicates

play33:13

the maximum length of context that your transformer will

play33:17

process.

play33:18

So if our block size is 8, that means

play33:20

that we are going to have up to eight characters of context

play33:23

to predict the ninth character in a sequence.

play33:26

And the batch size indicates how many sequences in parallel

play33:29

we're going to process.

play33:30

And we want this to be as large as possible,

play33:31

so we're fully taking advantage of the GPU

play33:33

and the parallels [INAUDIBLE] So in this example,

play33:36

we're doing a 4 by 8 batches.

play33:38

So every row here is independent example

play33:41

and then every row here is a small chunk of the sequence

play33:47

that we're going to train on.

play33:48

And then we have both the inputs and the targets

play33:50

at every single point here.

play33:52

So to fully spell out what's contained in a single 4

play33:55

by 8 batch to the transformer--

play33:57

I sort of compact it here--

play33:59

so when the input is 47, by itself, the target is 58.

play34:04

And when the input is the sequence 47, 58,

play34:07

the target is one.

play34:08

And when it's 47, 58, 1, the target is 51 and so on.

play34:13

So actually, the single batch of examples that score by 8

play34:15

actually has a ton of individual examples

play34:17

that we are expecting a transformer

play34:18

to learn on in parallel.

play34:21

And so you'll see that the batches are learned

play34:23

on completely independently, but the time dimension here along

play34:28

horizontally is also trained on in parallel.

play34:30

So your real batch size is more like B times T.

play34:34

And it's just that the context grows linearly

play34:37

for the predictions that you make along the T direction

play34:41

in the model.

play34:42

So this is all the examples that the model will learn from,

play34:45

this single batch.

play34:48

So now, this is the GPT class.

play34:52

And because this is a decoder-only model,

play34:55

so we're not going to have an encoder because there's no

play34:58

English we're translating from--

play34:59

we're not trying to condition in some other external

play35:02

information.

play35:02

We're just trying to produce a sequence of words that

play35:05

follow each other or likely to.

play35:08

So this is all PyTorch, and I'm going slightly faster

play35:10

because I'm assuming people have taken 231 or something

play35:12

along those lines.

play35:15

But here in the forward pass, we take these indices,

play35:19

and then we both encode the identity of the indices,

play35:24

just via an embedding lookup table.

play35:26

So every single integer, we index into a lookup table of

play35:31

vectors in this, and end up embedding, and pull out

play35:34

the word vector for that token.

play35:38

And then because the transformer by itself

play35:41

doesn't actually-- the process is set natively.

play35:43

So we need to also positionally encode these vectors

play35:45

so that we basically have both the information

play35:47

about the token identity and its place in the sequence from 1

play35:51

to block size.

play35:53

Now, the information about what and where

play35:56

is combined additively, so the token embeddings

play35:58

and the positional embeddings are just added exactly as here.

play36:02

So then there's optional dropout,

play36:06

this x here basically just contains

play36:08

the set of words and their positions,

play36:14

and that feeds into the blocks of transformer.

play36:16

And we're going to look into what's block here.

play36:18

But for here, for now, this is just a series

play36:20

of blocks in a transformer.

play36:22

And then in the end, there's a layer norm,

play36:23

and then you're decoding the logits

play36:26

for the next word or next integer in a sequence,

play36:30

using the linear projection of the output of this transformer

play36:33

So LM head here, a short core language model head.

play36:36

It's just a linear function.

play36:38

So basically, positionally encode all the words,

play36:42

feed them into a sequence of blocks,

play36:45

and then apply a linear layer to get the probability

play36:47

distribution for the next character.

play36:50

And then if we have the targets, which

play36:51

we produced in the data order--

play36:54

and you'll notice that the targets are just

play36:55

the inputs offset by one in time--

play36:59

then those targets feed into a cross entropy loss.

play37:01

So this is just a negative log likelihood

play37:03

typical classification loss.

play37:04

So now let's drill into what's here in the blocks.

play37:08

So these blocks that are applied sequentially,

play37:11

there's, again, as I mentioned, this communicate

play37:13

phase and the compute phase.

play37:15

So in the communicate phase, all the nodes

play37:17

get to talk to each other, and so these nodes are basically,

play37:21

if our block size is 8, then we are

play37:23

going to have eight nodes in this graph.

play37:26

There's eight nodes in this graph.

play37:28

The first node is pointed to only by itself.

play37:30

The second node is pointed to by the first node and itself.

play37:33

The third node is pointed to by the first two nodes

play37:35

and itself, et cetera.

play37:36

So there's eight nodes here.

play37:38

So you apply-- there's a residual pathway and x.

play37:42

You take it out.

play37:43

You apply a layer norm, and then the self-attention

play37:45

so that these communicate, these eight nodes communicate.

play37:47

But you have to keep in mind that the batch is 4.

play37:50

So because batch is 4, this is also applied--

play37:54

so we have eight nodes communicating,

play37:55

but there's a batch of four of them individually communicating

play37:58

in one of those eight nodes.

play37:59

There's no crisscross across the batch dimension, of course.

play38:02

There's no batch anywhere luckily.

play38:04

And then once they've changed information,

play38:06

they are processed using the multi-layer perceptron.

play38:09

And that's the compute phase.

play38:12

And then also here we are missing the cross-attention

play38:18

because this is a decoder-only model.

play38:19

So all we have is this step here,

play38:21

the multi-headed attention, and that's

play38:22

this line, the communicate phase.

play38:24

And then we have the feed forward, which is the MLP,

play38:27

and that's the compute phase.

play38:29

I'll take question's a bit later.

play38:31

Then the MLP here is fairly straightforward.

play38:34

The MLP is just individual processing on each node,

play38:38

just transforming the feature representation at that node.

play38:41

So applying a two-layer neural net

play38:45

with a GELU nonlinearity, which is just

play38:47

think of it as a ReLU or something like that.

play38:49

It's just a nonlinearity.

play38:51

And then MLP is straightforward.

play38:53

I don't think there's anything too crazy there.

play38:55

And then this is the causal self-attention part,

play38:57

the communication phase.

play38:59

So this is like the meat of things

play39:01

and the most complicated part.

play39:03

It's only complicated because of the batching

play39:06

and the implementation detail of how you mask the connectivity

play39:10

in the graph so that you can't obtain

play39:13

any information from the future when

play39:15

you're predicting your token.

play39:16

Otherwise, it gives away the information.

play39:18

So if I'm the fifth token and if I'm the fifth position,

play39:23

then I'm getting the fourth token coming into the input,

play39:26

and I'm attending to the third, second, and first,

play39:29

and I'm trying to figure out what is the next token.

play39:32

Well then, in this batch, in the next element

play39:34

over in the time dimension, the answer is at the input.

play39:37

So I can't get any information from there.

play39:40

So that's why this is all tricky,

play39:41

but basically, in the forward pass,

play39:45

we are calculating the queries, keys, and values based on x.

play39:50

So these are the keys, queries, and values.

play39:52

Here, when I'm computing the attention,

play39:54

I have the queries matrix multiplying the piece.

play39:58

So this is the dot product in parallel for all the queries

play40:00

and all the keys in all the heads.

play40:03

So I failed to mention that there's also

play40:06

the aspect of the heads, which is also done all in parallel

play40:08

here.

play40:09

So we have the batch dimension, the time dimension,

play40:10

and the head dimension, and you end up

play40:12

with five-dimensional tensors, and it's all really confusing.

play40:14

So I invite you to step through it later and convince yourself

play40:17

that this is actually doing the right thing.

play40:19

But basically, you have the batch dimension, the head

play40:21

dimension and the time dimension,

play40:23

and then you have features at them.

play40:25

And so this is evaluating for all the batch elements, for all

play40:28

the head elements, and all the time elements,

play40:31

the simple Python that I gave you earlier, which is query

play40:34

dot product p.

play40:35

Then here, we do a masked_fill, and what this is doing

play40:38

is it's basically clamping the attention between the nodes

play40:44

that are not supposed to communicate to be negative

play40:46

infinity.

play40:47

And we're doing negative infinity

play40:48

because we're about to softmax, and so negative infinity will

play40:51

make basically the attention that those elements be zero.

play40:54

And so here we are going to basically end up

play40:56

with the weights, the affinities between these nodes, optional

play41:03

dropout.

play41:03

And then here, attention matrix multiply v is basically

play41:08

the gathering of the information according to the affinities

play41:10

we calculated.

play41:11

And this is just a weighted sum of the values

play41:14

at all those nodes.

play41:15

So this matrix multiplies is doing that weighted sum.

play41:19

And then transpose contiguous view

play41:20

because it's all complicated and batched

play41:22

in five-dimensional tensors, but it's really not

play41:24

doing anything, optional drop out,

play41:26

and then a linear projection back to the residual pathway.

play41:30

So this is implementing the communication phase here.

play41:34

Then you can train this transformer.

play41:37

And then you can generate infinite Shakespeare.

play41:41

And you will simply do this by--

play41:43

because our block size is 8, we start with a sum token,

play41:47

say like, I used in this case, you

play41:50

can use something like a new line as the start token.

play41:53

And then you communicate only to yourself

play41:55

because there's a single node, and you

play41:57

get the probability distribution for the first word

play41:59

in the sequence.

play42:00

And then you decode it for the first character

play42:03

in the sequence.

play42:04

You decode the character.

play42:05

And then you bring back the character,

play42:06

and you re-encode it as an integer.

play42:08

And now, you have the second thing.

play42:10

And so you get--

play42:12

OK, we're at the first position, and this

play42:14

is whatever integer it is, add the positional encodings,

play42:17

goes into the sequence, goes in the transformer,

play42:19

and again, this token now communicates

play42:21

with the first token and it's identity.

play42:26

And so you just keep plugging it back.

play42:28

And once you run out of the block size, which is eight,

play42:31

you start to crawl, because you can never

play42:33

have watt size more than eight in the way you've

play42:34

trained this transformer.

play42:35

So we have more and more context until eight.

play42:37

And then if you want to generate beyond eight,

play42:39

you have to start cropping because the transformer only

play42:41

works for eight elements in time dimension.

play42:43

And so all of these transformers in the [INAUDIBLE] setting

play42:47

have a finite block size or context length,

play42:50

and in typical models, this will be 1,024 tokens or 2,048

play42:54

tokens, something like that.

play42:56

But these tokens are usually like BPE tokens,

play42:58

or SentencePiece tokens, or WorkPiece tokens.

play43:00

There's many different encodings.

play43:02

So it's not like that long.

play43:03

And so that's why, I think, [INAUDIBLE]..

play43:05

We really want to expand the context size,

play43:06

and it gets gnarly because the attention

play43:08

is sporadic in the [INAUDIBLE] case.

play43:11

Now, if you want to implement an encoder instead of a decoder

play43:16

attention.

play43:18

Then all you have to do is this [INAUDIBLE]

play43:21

and you just delete that line.

play43:23

So if you don't mask the attention,

play43:25

then all the nodes communicate to each other,

play43:27

and everything is allowed, and information

play43:29

flows between all the nodes.

play43:31

So if you want to have the encoder here, just delete.

play43:35

All the encoder blocks will use attention

play43:38

where this line is deleted.

play43:39

That's it.

play43:40

So you're allowing whatever-- this encoder might store say,

play43:44

10 tokens, 10 nodes, and they are all

play43:46

allowed to communicate to each other going up the transformer.

play43:51

And then if you want to implement cross-attention,

play43:53

so you have a full encoder-decoder transformer,

play43:55

not just a decoder-only transformer or a GPT.

play43:59

Then we need to also add cross-attention in the middle.

play44:03

So here, there is a self-attention piece where all

play44:05

the--

play44:06

there's a self-attention piece, a cross-attention piece,

play44:08

and this MLP.

play44:09

And in the cross-attention, we need

play44:12

to take the features from the top of the encoder.

play44:14

We need to add one more line here,

play44:16

and this would be the cross-attention instead of a--

play44:20

I should have implemented it instead of just pointing,

play44:22

I think.

play44:23

But there will be a cross-attention line here.

play44:25

So we'll have three lines because we

play44:26

need to add another block.

play44:28

And the queries will come from x but the keys

play44:31

and the values will come from the top of the encoder.

play44:35

And there will be basic code information

play44:36

flowing from the encoder, strictly

play44:38

to all the nodes inside x.

play44:41

And then that's it.

play44:42

So it's a very simple modifications

play44:44

on the decoder attention.

play44:47

So you'll hear people talk that you have

play44:49

a decoder-only model like GPT.

play44:51

You can have an encoder-only model like BERT,

play44:53

or you can have an encoder-decoder model

play44:55

like say T5, doing things like machine translation.

play44:59

And in BERT, you can't train it using this language modeling

play45:04

setup that's utter aggressive, and you're just

play45:06

trying to predict next [INAUDIBLE] in the sequence.

play45:07

You're training it doing slightly different objectives.

play45:09

You're putting in the full sentence,

play45:12

and, the full sentence is allowed to communicate fully.

play45:14

And then you're trying to classify sentiment or something

play45:16

like that.

play45:18

So you're not trying to model the next token in the sequence.

play45:21

So these are trained slightly different

play45:26

using masking and other denoising techniques.

play45:31

OK.

play45:32

So that's like the transformer.

play45:34

I'm going to continue.

play45:36

So yeah, maybe more questions.

play45:38

[INAUDIBLE]

play46:01

This is like we are enforcing these constraints on it

play46:06

by just masking [INAUDIBLE]

play46:12

So I'm not sure if I fully follow.

play46:14

So there's different ways to look at this analogy,

play46:16

but one analogy is you can interpret

play46:18

this graph as really fixed.

play46:20

It's just that every time we do the communicate,

play46:22

we are using different weights.

play46:23

You can look at it that way.

play46:24

So if we have block size of eight in my example,

play46:26

we would have eight nodes.

play46:27

Here we have 2, 4, 6.

play46:29

OK, so we'd have eight nodes.

play46:30

They would be connected in--

play46:33

you lay them out, and you only connect from left to right.

play46:35

[INAUDIBLE]

play46:42

Why would they connect-- usually,

play46:44

the connections don't change as a function of the data

play46:46

or something like that--

play46:47

[INAUDIBLE]

play47:00

I don't think I've seen a single example where

play47:02

the connectivity changes dynamically

play47:03

in the function data.

play47:04

Usually, the connectivity is fixed.

play47:05

If you have an encoder, and you're training a BERT,

play47:07

you have how many tokens you want,

play47:09

and they are fully connected.

play47:11

And if you have a decoder-only model,

play47:13

you have this triangular thing, and if you

play47:15

have encoder-decoder, then you have

play47:16

awkwardly two pools of nodes.

play47:21

Yeah.

play47:24

Go ahead.

play47:25

[INAUDIBLE] I wonder, you know much more about this

play47:45

than I know.

play47:46

But do you have a sense of like if you ran [INAUDIBLE]

play48:00

In my head, I'm thinking [INAUDIBLE] but then you also

play48:08

have different things for one or more of [INAUDIBLE]----

play48:13

Yeah, it's really hard to say, so that's

play48:15

why I think this paper is so interesting because like, yeah,

play48:17

usually, you'd see like the path,

play48:18

and maybe they had path internally.

play48:19

They just didn't publish it.

play48:20

All you can see is things that didn't look like a transformer.

play48:23

I mean, you have ResNets, which have lots of this.

play48:26

But a ResNet would be like this, but there's

play48:29

no self-attention component.

play48:31

But the MLP is there kind of in a ResNet.

play48:35

So a ResNet looks very much like this

play48:37

except there's no-- you can use layer norms in ResNets,

play48:40

I believe, as well.

play48:41

Typically, sometimes, they can be batch norms.

play48:43

So it is kind of like a ResNet.

play48:45

It is like they took a ResNet, and they

play48:47

put in a self-attention block in addition

play48:50

to the preexisting MLP block, which

play48:52

is kind of like convolutions.

play48:53

And MLP was strictly speaking deconvolution,

play48:55

one by one convolution, but I think

play48:59

the idea is similar in that MLP is just like a typical weights,

play49:04

nonlinearity weights operation.

play49:11

But I will say, yeah, this is kind of interesting

play49:13

because a lot of work is not there,

play49:15

and then they give you this transformer.

play49:17

And then it turns out 5 years later,

play49:18

it's not changed, even though everyone's trying to change it.

play49:20

So it's interesting to me that it's like a package,

play49:23

in like a package, which I think is really

play49:25

interesting historically.

play49:26

And I also talked to paper authors,

play49:30

and they were unaware of the impact

play49:32

that the transformer would have at the time.

play49:33

So when you read this paper, actually, it's unfortunate

play49:37

because this is the paper that changed everything,

play49:39

but when people read it, it's like question marks

play49:41

because it reads like a pretty random machine translation

play49:45

paper.

play49:46

It's like, oh, we're doing machine translation.

play49:47

Oh, here's a cool architecture.

play49:48

OK, great, good results.

play49:51

It doesn't know what's going to happen.

play49:53

[LAUGHS] And so when people read it today,

play49:56

I think they're confused potentially.

play50:00

I will have some tweets at the end,

play50:02

but I think I would have renamed it

play50:03

with the benefit of hindsight of like, well, I'll get to it.

play50:08

[INAUDIBLE]

play50:20

Yeah, I think that's a good question as well.

play50:22

Currently, I mean, I certainly don't

play50:24

love the autoregressive modeling approach.

play50:27

I think it's kind of weird to sample a token

play50:29

and then commit to it.

play50:31

So maybe there are some ways, some hybrids

play50:36

with the Fusion as an example, which

play50:38

I think would be really cool, or we'll

play50:41

find some other ways to edit the sequences later but still

play50:44

in our regressive framework.

play50:47

But I think the Fusion is like an up and coming modeling

play50:49

approach that I personally find much more appealing.

play50:51

When I sample text, I don't go chunk, chunk, chunk,

play50:54

and commit.

play50:55

I do a draft one, and then I do a better draft two.

play50:58

And that feels like a diffusion process.

play51:00

So that would be my hope.

play51:05

OK, also a question.

play51:07

So yeah, you'd think the [INAUDIBLE]

play51:20

And then once we have the edge rates,

play51:21

we just have to multiply it by the values,

play51:23

and then you just [INAUDIBLE] it.

play51:25

Yes, yeah, it's right.

play51:27

And you think there's MLG within graph neural networks

play51:30

and they'll potentially--

play51:32

I find the graph neural networks like a confusing term

play51:34

because, I mean, yeah, previously,

play51:38

there, was this notion of--

play51:40

I feel like maybe today everything is a graph neural

play51:42

network because a transformer is a graph neural network

play51:44

processor.

play51:45

The native representation that the transformer operates over

play51:48

is sets that are connected by edges in a direct way.

play51:51

And so that's the native representation, and then, yeah.

play51:55

OK, I should go on because I still have 30 slides.

play51:57

[INAUDIBLE]

play52:08

Oh yeah, yeah, the root DE, I think, it basically

play52:11

like if you're initializing with random weights

play52:14

setup from a [INAUDIBLE] as your dimension size grows,

play52:17

so does your values, the variance grows.

play52:19

And then your softmax will just become the one half vector.

play52:23

So it's just a way to control the variance

play52:25

and bring it to always be in a good range for softmax

play52:28

and nice diffused distribution.

play52:31

OK, so it's almost like an initialization thing.

play52:37

OK, so transformers have been applied

play52:41

to all the other fields, and the way this was done

play52:44

is in my opinion, ridiculous ways

play52:46

honestly because I was a computer vision person,

play52:49

and you have ComNets, and they make sense.

play52:51

So what we're doing now with VITs as an example is

play52:53

you take an image and you chop it up into little squares.

play52:56

And then those squares, literally,

play52:57

feed into a transformer, and that's

play52:59

it, which is kind of ridiculous.

play53:01

And so, I mean, yeah, and so the transformer

play53:06

doesn't even, in the simplest case, really know where

play53:08

these patches might come from.

play53:10

They are usually positionally encoded,

play53:12

but it has to rediscover a lot of the structure,

play53:16

I think, of them in some ways.

play53:19

And it's kind of weird to approach it that way.

play53:23

But it's just the simplest baseline

play53:25

of just chomping up big images into small squares

play53:27

and feeding them in as the individual nodes actually

play53:29

works fairly well.

play53:30

And then this is in a transformer encoder,

play53:32

so all the patches are talking to each other

play53:34

throughout the entire transformer.

play53:36

And the number of nodes here would be like nine.

play53:42

Also, in speech recognition, you just take your melSpectrogram,

play53:44

and you chop it up into slices and you feed them

play53:46

into a transformer.

play53:47

So there was paper like this, but also Whisper.

play53:49

Whisper is a copy-paste transformer.

play53:51

If you saw Whisper from OpenAI, you just chop up melSpectrogram

play53:55

and feed it into a transformer and then pretend

play53:57

you're dealing with text.

play53:58

And it works very well.

play54:00

Decision transformer in RL, you take your states, actions,

play54:03

and reward that you experience in environment,

play54:05

and you just pretend it's a language.

play54:07

Then you start to model the sequences of that,

play54:09

and then you can use that for planning later.

play54:11

That works really well.

play54:13

Even things AlphaFold, so we were briefly

play54:15

talking about molecules and how you can plug them in.

play54:17

So at the heart of AlphaFold, computationally,

play54:19

is also a transformer.

play54:21

One thing I wanted to also say about transformers

play54:23

is I find that they're very flexible,

play54:26

and I really enjoy that.

play54:28

I'll give you an example from Tesla.

play54:31

You have a ComNet that takes an image

play54:32

and makes predictions about the image.

play54:34

And then the big question is, how do you

play54:35

feed in extra information?

play54:37

And it's not always trivial like say, I

play54:38

had additional information that I

play54:40

want to inform that I want the outputs to be informed by.

play54:43

Maybe I have other sensors like Radar.

play54:45

Maybe I have some map information, or a vehicle type,

play54:47

or some audio.

play54:48

And the question is, how do you feed information into a ComNet?

play54:50

Like where do you feed it in?

play54:52

Do you concatenate it?

play54:54

Do you add it?

play54:55

At what stage?

play54:56

And so with a transformer, it's much easier

play54:58

because you just take whatever you want, you chop it

play55:00

up into pieces, and you feed it in with a set

play55:02

of what you had before.

play55:03

And you let the self-attention figure out

play55:04

how everything should communicate.

play55:06

And that actually apparently works.

play55:07

So just chop up everything and throw it into the mix

play55:10

is like the way.

play55:11

And it frees neural nets from this burgeon

play55:15

of Euclidean space, where previously you

play55:19

had to arrange your computation to conform to the Euclidean

play55:21

space or three dimensions of how you're laying out the compute.

play55:25

Like the compute actually kind of

play55:26

happens in almost like 3D space if you think about it.

play55:29

But in attention, everything is just sets.

play55:32

So it's a very flexible framework,

play55:33

and you can just throw in stuff into your conditioning set.

play55:35

And everything just self-attended over.

play55:37

So it's quite beautiful from that perspective.

play55:39

OK, so now what exactly makes transformers so effective?

play55:43

I think a good example of this comes

play55:44

from the GPT-3 paper, which I encourage people to read.

play55:48

Language Models of Few-Shot Learners.

play55:50

I would have probably renamed this a little bit.

play55:52

I would have said something like transformers

play55:54

are capable of in-context learning or meta-learning.

play55:57

That's like what makes them really special.

play56:00

So basically the setting that they're working with

play56:02

is, OK, I have some context, and I'm

play56:03

trying-- like say, a passage.

play56:04

This is just one example of many.

play56:06

I have a passage, and I'm asking questions about it.

play56:08

And then as part of the context in the prompt,

play56:12

I'm giving the questions and the answers.

play56:14

So I'm giving one example of question-answer,

play56:16

another example of question-answer,

play56:17

another example of question-answer, and so on.

play56:19

And this becomes--

play56:21

Oh yeah, people are going to have to leave soon, huh?

play56:24

OK, is this really important?

play56:25

Let me think.

play56:29

OK, so what's really interesting is basically

play56:31

like with more examples given in a context,

play56:35

the accuracy improves.

play56:37

And so what that can set is that the transformer

play56:39

is able to somehow learn in the activations

play56:42

without doing any gradient descent

play56:43

in a typical fine-tuning fashion.

play56:45

So if you fine-tune, you have to give an example and the answer,

play56:48

and you fine-tune it, using gradient descent.

play56:51

But it looks like the transformer internally

play56:53

in its weights is doing something

play56:54

that looks like potentially gradient, some kind

play56:56

of a metalearning in the weights of the transformer

play56:57

as it is reading the prompt.

play56:59

And so in this paper, they go into, OK,

play57:01

distinguishing this outer loop with stochastic gradient

play57:03

descent in this inner loop of the intercontext learning.

play57:06

So the inner loop is as the transformer is reading

play57:08

the sequence almost and the outer loop is the training

play57:12

by gradient descent.

play57:14

So basically, there's some training

play57:15

happening in the activations of the transformer

play57:17

as it is consuming a sequence that

play57:18

may be very much looks like gradient descent.

play57:21

And so there are some recent papers that hint at this

play57:23

and study it.

play57:23

And so as an example, in this paper

play57:25

here, they propose something called the draw operator.

play57:28

And they argue that the raw operator is implemented

play57:32

by transformer, and then they show

play57:33

that you can implement things like ridge regression

play57:35

on top of the raw operator.

play57:36

And so this is giving--

play57:39

There are papers hinting that maybe there

play57:40

is some thing that looks like gradient-based learning

play57:42

inside the activations of the transformer.

play57:45

And I think this is not impossible to think through

play57:47

because what is gradient-based learning?

play57:49

Overpass, backward pass, and then update.

play57:52

Oh, that looks like a ResNet, right,

play57:54

because you're adding to the weights.

play57:57

So the start of initial random set of weights,

play57:59

forward pass, backward pass, and update your weights,

play58:01

and then forward pass, backward pass, update the weights.

play58:04

Looks like a ResNet.

play58:04

Transformer is a ResNet, so much more hand-wavey,

play58:10

but basically, some papers are trying

play58:11

to hint at why that would be potentially possible.

play58:14

And then I have a bunch of tweets I just copy-pasted here

play58:16

in the end.

play58:18

This was like meant for general consumption,

play58:20

so they're a bit more high-level and hypey a little bit.

play58:22

But I'm talking about why this architecture is so interesting

play58:26

and why potentially it became so popular.

play58:27

And I think it simultaneously optimizes

play58:29

three properties that, I think, are very desirable.

play58:31

Number one, the transformer is very

play58:33

expressive in the forward pass.

play58:35

It sort of like it's able to implement

play58:37

very interesting functions, potentially functions

play58:39

that can even do meta-learning.

play58:41

Number two, it is very optimizable thanks

play58:43

to things like residual connections, layer nodes,

play58:45

and so on.

play58:45

And number three, it's extremely efficient.

play58:47

This is not always appreciated, but the transformer,

play58:49

if you look at the computational graph,

play58:51

is a shallow, wide network, which

play58:53

is perfect to take advantage of the parallelism of GPUs.

play58:56

So I think the transformer was designed very deliberately

play58:58

to run efficiently on GPUs.

play59:00

There's previous work like neural GPU

play59:02

that I really enjoy as well, which is really just

play59:05

like how do we design neural nets that are efficient on GPUs

play59:08

and thinking backwards from the constraints of the hardware,

play59:10

which I think is a very interesting way

play59:11

to think about it.

play59:17

Oh yeah, so here, I'm saying, I probably would have called--

play59:21

I probably would've called the transformer a general purpose

play59:24

efficient optimizable computer instead of attention

play59:27

is all you need.

play59:28

That's what I would have maybe in hindsight called that paper.

play59:31

It's proposing a model that is very general purpose, so

play59:37

forward passes, expressive.

play59:38

It's very efficient in terms of GPU usage

play59:40

and is easily optimizable by gradient descent and trains

play59:44

very nicely.

play59:46

And then I have some other hype tweets here.

play59:51

Anyway, so you can read them later.

play59:53

But I think this one is maybe interesting.

play59:55

So if previous neural nets are special purpose computers

play59:58

designed for a specific task, GPT

play60:00

is a general purpose computer, reconfigurable at runtime

play60:03

to run natural language programs.

play60:06

So the programs are given as prompts,

play60:08

and then GPT runs the program by completing the document.

play60:12

So I really like these analogies personally to computer.

play60:16

It's just like a powerful computer,

play60:18

and it's optimizable by gradient descent.

play60:22

And I don't know--

play60:30

OK, yeah.

play60:31

That's it.

play60:31

[LAUGHTER]

play60:33

You can read the tweets later, but that's for now.

play60:35

I'll just thank you.

play60:36

I'll just leave this up.

play60:45

Sorry, I just found this tweet.

play60:46

So turns out that if you scale up the training set

play60:49

and use a powerful enough neural net like a transformer,

play60:51

the network becomes a kind of general purpose

play60:53

computer over text.

play60:54

So I think that's nice way to look at it.

play60:56

And instead of performing a single text sequence,

play60:58

you can design the sequence in the prompt.

play61:00

And because the transformer is both powerful

play61:02

but also is trained on large enough, very hard data set,

play61:05

it becomes this general purpose text computer.

play61:07

And so I think that's kind of interesting way to look at it.

play61:11

Yeah.

play61:13

[INAUDIBLE]

play62:01

And I guess my question is [INAUDIBLE] how

play62:04

much do you think [INAUDIBLE]?

play62:10

really because it's mostly more efficient or [INAUDIBLE]

play62:25

So I think there's a bit of that.

play62:27

Yeah, so I would say RNNs in principle,

play62:29

yes, they can implement arbitrary programs.

play62:31

I think, it's like a useless statement to some extent

play62:33

because they're probably--

play62:35

I'm not sure that they're probably expressive

play62:37

because in a sense of power and that they can implement

play62:40

these arbitrary functions.

play62:43

But they're not optimizable.

play62:44

And they're certainly not efficient because they

play62:46

are serial computing devices.

play62:50

So if you look at it as a compute graph,

play62:51

RNNs are very long, thin compute graph.

play62:58

What if you stretched out the neurons and you looked--

play63:00

like take all the individual neurons interconnectivity,

play63:02

and stretch them out, and try to visualize them.

play63:04

RNNs would be like a very long graph and that's bad.

play63:07

And it's bad also for optimizability

play63:08

because I don't exactly know why,

play63:10

but just the rough intuition is when you're backpropagating,

play63:13

you don't want to make too many steps.

play63:15

And so transformers are a shallow wide graph, and so

play63:19

from supervision to inputs is a very small number of hops.

play63:23

And it's a long residual pathways,

play63:25

which make gradients flow very easily.

play63:26

And there's all these layer norms

play63:28

to control the scales of all of those activations.

play63:32

And so there's not too many hops,

play63:34

and you're going from supervision to input

play63:36

very quickly and just flows through the graph.

play63:40

And it can all be done in parallel,

play63:42

so you don't need to do this--

play63:43

encoder and decoder RNNs, you have to go from first word,

play63:46

then second word, then third word.

play63:47

But here in transformer, every single word

play63:49

was processed completely in parallel, which is kind of a--

play63:54

So I think all of these are really important because all

play63:57

of these are really important.

play63:57

And I think number 3 is less talked about but extremely

play64:00

important because in deep learning scale matters.

play64:03

And so the size of the network that you can train it

play64:06

gives you is extremely important.

play64:08

And so if it's efficient on the current hardware,

play64:10

then you can make it bigger.

play64:14

You mentioned that if you do it with multiple modalities

play64:17

of data, [INAUDIBLE].

play64:21

How does that actually work?

play64:22

Do you leave the different data as different token,

play64:26

or is it [INAUDIBLE]?

play64:29

No, so yeah, so you take your image,

play64:31

and you apparently chop them up into patches.

play64:33

So there's the first thousand tokens or whatever.

play64:35

And now, I have a special--

play64:37

so radar could be also, but I don't actually

play64:40

want to make a representation of radar.

play64:43

But you just need to chop it up and enter it.

play64:46

And then you have to encode it somehow.

play64:47

Like the transformer needs to know

play64:48

that they're coming from radar.

play64:49

So you create a special--

play64:52

you have some kind of a special token of that to--

play64:55

these radar tokens are what's slightly

play64:57

different in the representation, and it's

play64:58

learnable by gradient descent.

play65:00

And like vehicle information would also

play65:03

come in with a special embedded token that can be learned.

play65:07

So--

play65:09

So how do you line those before really--

play65:11

Actually, but you don't.

play65:12

It's all just a set.

play65:13

And there's--

play65:14

Even the [INAUDIBLE]

play65:18

Yeah, it's all just a set, but you can positionally

play65:20

encode these sets if you want.

play65:23

So positional encoding means you can

play65:26

hardwire, for example, the coordinates

play65:28

like using [INAUDIBLE].

play65:29

You can hardwire that, but it's better

play65:31

if you don't hardwire the position.

play65:33

It's just a vector that is always

play65:34

hanging out the dislocation.

play65:35

Whatever content is there, it just adds on it.

play65:37

And this vector is trainable by background.

play65:39

That's how you do it.

play65:43

Good point.

play65:43

I don't really like the [INAUDIBLE]..

play65:48

They seem to work, but it seems like they're sometimes

play65:51

[INAUDIBLE]

play66:08

I'm not sure if I understand your question.

play66:10

[LAUGHTER]

play66:11

So I mean the positional encoders

play66:12

like they're actually like not--

play66:14

OK, so they have very little inductive bias or something

play66:16

like that.

play66:17

They're just vectors hanging out in location always,

play66:19

and you're trying to help the network in some way.

play66:23

And I think the intuition is good,

play66:28

but if you have enough data, usually,

play66:30

trying to mess with it is a bad thing.

play66:33

Trying to enter knowledge when you

play66:35

have enough knowledge in the data

play66:36

set itself is not usually productive.

play66:38

So it all really depends on what scale you want.

play66:40

If you have infinity data, then you actually

play66:41

want to encode less and less.

play66:43

That turns out to work better.

play66:44

And if you have very little data, then actually, you do

play66:46

want to encode some biases.

play66:47

And maybe if you have a much smaller data set, then

play66:49

maybe convolutions are a good idea

play66:50

because you actually have this bias coming from your filters.

play66:55

But I think-- so the transformer is extremely general,

play66:58

but there are ways to mess with the encodings

play67:01

to put in more structure.

play67:02

Like you could, for example, encode [INAUDIBLE] and fix it,

play67:05

or you could actually go to the attention mechanism

play67:07

and say, OK, if my image is chopped up into patches,

play67:10

this patch can only communicate to this neighborhood.

play67:13

And you just do that in the attention matrix,

play67:15

you just mask out whatever you don't want to communicate.

play67:18

And so people really play with this

play67:19

because the full attention is inefficient.

play67:22

So they will intersperse, for example, layers

play67:25

that only communicate in little patches

play67:26

and then layers that communicate globally.

play67:28

And they will do all kinds of tricks like that.

play67:30

So you can slowly bring in more inductive bias.

play67:33

You would do it, but the inductive biases

play67:35

are like they're factored out from the core transformer.

play67:38

And they are factored out, and the interconnectivity

play67:41

of the nodes.

play67:42

And they are factored out in the positionally--

play67:44

and you can mess with this for computation.

play67:49

[INAUDIBLE]

play68:02

So there's probably about 200 papers on this now if not more.

play68:06

They're kind of hard to keep track of.

play68:07

Honestly, like my Safari browser, which is-- oh,

play68:10

it's all up on my computer, like 200 open tabs.

play68:13

But yes, I'm not even sure if I want

play68:20

to pick my favorite honestly.

play68:23

Yeah, [INAUDIBLE]

play68:42

Maybe you can use a transformer like that [INAUDIBLE]

play68:45

The other one that I actually like even more

play68:46

is potentially, keep the context length fixed

play68:49

but allow the network to somehow use a scratch pad.

play68:53

And so the way this works is you will teach the transformer

play68:55

somehow via examples in [INAUDIBLE] hey,

play68:57

you actually have a scratch pad.

play69:00

Basically, you can't remember too much.

play69:01

Your context line is finite.

play69:02

But you can use a scratch pad.

play69:04

And you do that by emitting a start scratch pad,

play69:06

and then writing whatever you want to remember, and then

play69:08

end scratch pad.

play69:10

And then you continue with whatever you want.

play69:12

And then later when it's decoding,

play69:14

you actually have special objects

play69:15

that when you detect start scratch pad,

play69:18

you will like save whatever it puts

play69:19

in there in like external thing and allow it to attend over it.

play69:22

So basically, you can teach the transformer just dynamically

play69:25

because it's so meta-learned.

play69:27

You can teach it dynamically to use other gizmos and gadgets

play69:30

and allow it to expand its memory that way

play69:31

if that makes sense.

play69:32

It's just like human learning to use a notepad, right.

play69:35

You don't have to keep it in your brain.

play69:37

So keeping things in your brain is like the context line

play69:39

from the transformer.

play69:39

But maybe we can just give it a notebook.

play69:42

And then it can query the notebook, and read from it,

play69:45

and write to it.

play69:46

[INAUDIBLE] transformer to plug in another transformer.

play69:48

[LAUGHTER]

play69:53

[INAUDIBLE]

play70:09

I don't know if I detected that.

play70:10

I feel like-- did you feel like there was more than just

play70:12

a long prompt that's unfolding?

play70:14

Yeah, [INAUDIBLE]

play70:19

I didn't try extensively, but I did see a [INAUDIBLE] event.

play70:22

And I felt like the block size was just moved.

play70:28

Maybe I'm wrong.

play70:28

I don't actually know about the internals of ChatGPT.

play70:31

We have two online questions.

play70:33

So one question is, "what do you think about architecture

play70:35

[INAUDIBLE]?"

play70:38

S4?

play70:39

S4.

play70:40

I'm sorry.

play70:41

I don't know S4.

play70:42

Which one is this one?

play70:45

The second question, this one's a personal question.

play70:47

"What are you going to work on next?"

play70:49

[INAUDIBLE]

play70:51

I mean, so right now, I'm working on things like nanoGPT.

play70:53

Where is nanoGPT?

play70:58

I mean, I'm going basically slightly from computer vision

play71:01

and like computer vision-based products, do

play71:03

a little bit in language domain.

play71:05

Where's ChatGPT?

play71:06

OK, nanoGPT.

play71:07

So originally, I had minGPT, which I rewrote to nanoGPT.

play71:10

And I'm working on this.

play71:11

I'm trying to reproduce GPTs, and I mean,

play71:14

I think something like ChatGPT, I think,

play71:16

incrementally improved in a product fashion

play71:17

would be extremely interesting.

play71:19

And I think a lot of people feel it,

play71:23

and that's why it went so wide.

play71:24

So I think there's something like a Google plus

play71:28

plus plus to build that I think is more interesting.

play71:31

Shall we give our speaker a round of applause?

Rate This

5.0 / 5 (0 votes)

Related Tags
斯坦福大学深度学习Transformers自然语言处理计算机视觉强化学习AI革命研究应用课程讲座技术创新
Do you need a summary in English?