Stanford CS25: V2 I Introduction to Transformers w/ Andrej Karpathy
Summary
TLDR本次课程介绍了深度学习模型——变换器(Transformers),它在自然语言处理、计算机视觉、强化学习等多个领域产生了革命性的影响。课程由斯坦福大学的专家主讲,他们分享了变换器的基础知识、自注意力机制以及如何应用于不同研究领域。此外,还探讨了变换器的未来发展,包括视频理解和生成、金融业务等领域的应用,以及如何通过增强控制性和计算复杂性的降低来提升模型性能。
Takeaways
- 📚 课程CS 25 Transformers United V2.是斯坦福大学在2023年冬季开设的深度学习模型课程,重点介绍在AI及其他领域产生革命性影响的transformers。
- 🤖 讲师之一目前在一家机器人初创公司领导AI工作,研究兴趣包括强化学习、计算机视觉和建模。
- 🎓 另一位讲师是斯坦福大学计算机科学博士生,主要研究自然语言处理和计算机视觉。
- 🚀 Transformers自2017年由Vaswani等人提出以来,已广泛应用于自然语言处理、计算机视觉、生物学、机器人学等领域。
- 🌟 Transformers的核心机制是自注意力(self-attention),它允许模型在处理序列时更好地理解上下文。
- 📈 从2017年到2023年,transformers在AI领域的应用不断扩展,特别是在生成模型(如GPT和DALL-E)和多模态任务中。
- 🔍 课程介绍了transformers的工作原理,以及它们如何被应用于NLP以外的领域,并探讨了这些主题的新兴研究方向。
- 🧠 讲师提到transformers的成功可能暗示了大脑的工作方式,因为大脑在整个皮层中表现出高度的均匀性和统一性。
- 🔑 课程强调了transformers的灵活性,它们可以轻松地将来自不同来源的信息(如图像、音频和文本)整合到一起进行处理。
- 🌐 讲师讨论了transformers未来的发展方向,包括视频理解和生成、金融和业务应用,以及特定领域的模型(如DoctorGPT和LawyerGPT)。
- 💡 讲师提出了一些transformers领域的关键挑战,包括提高长序列建模的能力、减少计算复杂性、增强模型的可控性和与人类大脑的对齐。
Q & A
CS 25 Transformers United V2.课程是在哪个学校开设的?
-CS 25 Transformers United V2.课程是在斯坦福大学开设的。
这个课程主要讲授的是什么内容?
-这个课程主要讲授深度学习模型——变换器(Transformers),它们在自然语言处理、计算机视觉、强化学习、生物学、机器人学等领域的应用,并探讨了变换器在不同研究领域的应用。
变换器(Transformers)最初是由哪篇论文提出的?
-变换器(Transformers)最初是由Vaswani等人在2017年的论文中提出的。
变换器在自然语言处理(NLP)之外的领域有哪些应用?
-变换器在自然语言处理(NLP)之外的领域,如计算机视觉、强化学习、生物学、机器人学等都有应用。
课程中提到的RNN和LSTM在处理长序列时存在哪些问题?
-RNN和LSTM在处理长序列时存在无法有效编码长序列和上下文的问题。
变换器(Transformers)在处理上下文方面有哪些优势?
-变换器在处理上下文方面的优势包括更好地理解文本的上下文,以及在内容和上下文预测方面更为准确。
课程中提到的Codex、GPT和DALL-E是什么?
-Codex、GPT和DALL-E是变换器模型的例子,它们在生成模型领域有重要应用,如编程代码生成、文本生成和图像生成。
课程中提到的ChatGPT是如何训练的?
-ChatGPT是通过强化学习和人类反馈进行训练的,以提高其性能。
变换器(Transformers)在未来可能的发展方向有哪些?
-变换器在未来可能的发展方向包括视频理解和生成、金融和商业应用、长序列建模、多任务和多输入预测、领域特定模型等。
课程中提到的Transformer的哪些特性使其在AI领域如此有效?
-Transformer在AI领域之所以有效,是因为它们在前向传播中具有很高的表达能力,易于优化,并且由于其浅层宽网络的结构,非常适合GPU并行处理,从而非常高效。
Outlines
🎓 课程介绍与背景
本段落介绍了CS 25 Transformers United V2.课程的背景和内容,该课程由斯坦福大学在2023年冬季开设。课程主题不是变形机器人,而是深度学习模型——Transformers,它们在自然语言处理、计算机视觉、强化学习等多个领域产生了革命性影响。课程将通过一系列视频展示不同领域的专家如何应用Transformers进行研究。介绍还包括了讲师的个人背景和研究兴趣,以及课程的教学目标和内容概览。
🤖 机器人与Transformers的应用
这一部分讨论了Transformers在机器人学和其他领域的应用。提到了机器人学中的一般性机器人,以及如何通过Transformers改进强化学习、计算机视觉等算法。还提到了个人对机器人学的热情,以及在机器人学和自动驾驶领域的研究和出版物。此外,还介绍了个人的兴趣爱好,如音乐、武术和韩剧等。
🚀 课程内容与未来展望
本段落概述了课程内容和对未来技术发展的展望。讨论了Transformers的基础知识,包括自注意力机制,并深入探讨了BERT、GPT等模型。同时,提出了对未来研究方向的预测,包括视频理解和生成、金融业务应用,以及需要解决的长序列建模问题。还讨论了如何提高模型的泛化能力和控制能力,以及如何使模型更接近人类大脑的工作方式。
📚 历史回顾与Transformers的起源
这一部分回顾了机器学习和人工智能的历史,特别是Transformers的起源和发展。从早期的特征描述符和支持向量机,到神经网络在图像分类和机器翻译中的应用,再到Transformers的提出和普及。讨论了Transformers如何简化了不同领域的研究,并可能预示着我们正在接近大脑的工作方式。
🧠 从神经网络到Transformers的演变
本段落详细讨论了从早期的神经网络到Transformers的演变过程。从2003年的简单神经网络语言模型开始,到2014年的序列到序列模型,再到2017年的Attention is All You Need论文,详细解释了注意力机制的提出和Transformers架构的形成。还提到了与注意力机制发明者Dzmitry的交流,揭示了注意力机制背后的灵感和历史。
🌟 Transformers的核心——注意力机制
这一部分深入探讨了Transformers中注意力机制的工作方式和重要性。通过解释多头注意力和自注意力的概念,讨论了如何在Transformer中实现节点间的通信和信息传递。还提到了位置编码的必要性,以及如何在Transformer中处理不同数据类型(如文本、图像等)。最后,通过一个简化的Transformer模型——nanoGPT的实现,展示了Transformers的灵活性和强大功能。
📈 训练与生成文本的实例
本段落通过一个具体的例子——Tiny Shakespeare数据集,展示了如何使用Transformer模型进行文本生成。介绍了数据预处理、批处理、模型训练和生成过程。讨论了如何通过位置编码和自注意力机制,使模型能够理解和生成文本序列。还提到了如何通过特殊标记和掩码来控制模型的生成过程,以及如何通过连续的输入和输出来扩展模型的上下文长度。
🤔 对Transformers的思考与未来方向
这一部分对Transformers的效率、表达能力和优化特性进行了深入的思考。讨论了Transformers如何作为一种通用的、高效的、可优化的计算机模型,以及它们在不同领域的应用。还提到了Transformers的未来发展,包括如何通过增加上下文长度和使用外部记忆来提高模型性能。最后,提出了一些关于Transformers的未来发展和潜在改进方向的想法。
🎉 课程结束与感谢
本段落是对整个课程的一个总结,感谢听众的参与,并对未来的工作方向进行了简短的展望。提到了目前正在进行的nanoGPT项目,以及对构建更先进的语言模型产品的兴趣。最后,对听众的参与表示感谢,并邀请听众为演讲者鼓掌。
Mindmap
Keywords
💡Transformers
💡自注意力机制(Self-Attention)
💡BERT
💡GPT
💡编码器-解码器架构(Encoder-Decoder Architecture)
💡位置编码(Positional Encoding)
💡多头注意力(Multi-Head Attention)
💡残差连接(Residual Connection)
💡层归一化(Layer Normalization)
💡少样本学习(Few-Shot Learning)
💡元学习(Meta-Learning)
Highlights
CS 25 Transformers United V2. 这门课程在2023年冬季由斯坦福大学开设,主要介绍深度学习模型transformers及其在AI和其他领域中的革命性应用。
Transformers 从自然语言处理开始,已经被应用于计算机视觉、强化学习、生物学、机器人学等多个领域。
课程将介绍transformers的基础知识,包括自注意力机制,并深入探讨BERT、GPT等模型。
2017年的论文《Attention is All You Need》标志着transformers的诞生,该论文提出了一种全新的基于注意力机制的架构。
transformers的出现在AI领域引起了一场革命,改变了从NLP到计算机视觉等多个领域的研究和应用。
transformers在处理长序列方面表现出色,能够解决之前RNN和LSTM等模型无法有效处理的问题。
transformers的注意力机制使其在上下文预测方面表现出色,例如能够找到与特定词相关的名词。
2021年标志着生成模型时代的开始,出现了Codex、GPT、DALL-E等模型,这些模型在生成任务上取得了重大进展。
transformers的未来发展将包括视频理解和生成、金融业务应用,以及解决长序列建模和多模态任务的挑战。
transformers的一个重要研究方向是提高模型的可控性,例如通过增强模型的输出稳定性。
transformers的另一个研究方向是与人类大脑的工作方式对齐,以提高模型的理解和推理能力。
transformers的架构在过去五年中保持了惊人的韧性,尽管许多人试图对其进行改进,但基本架构仍然保持不变。
transformers的灵活性使其能够轻松地将来自不同模态的数据整合到一起,例如将图像、文本和音频数据混合处理。
transformers的有效性部分归功于其在GPU上的高效计算能力,这使得它可以处理大规模数据集并训练大型模型。
transformers的注意力机制可以看作是在处理图中的消息传递,其中节点之间的连接是通过注意力分数来确定的。
transformers的自注意力和多头注意力允许模型在并行中寻找和整合来自不同部分的信息。
transformers的编码器-解码器架构使其能够处理如机器翻译和文本生成等序列到序列的任务。
transformers的跨注意力机制允许解码器在生成序列时利用编码器捕获的信息。
Transcripts
Hi, everyone.
Welcome to CS 25 Transformers United V2.
This was a course that was held at Stanford
in the winter of 2023.
This course is not about robots that
can transform into cars as this picture might suggest.
Rather, it's about deep learning models
that have taken the world by storm
and have revolutionized the field of AI and others.
Starting from natural language processing,
transformers have been applied all over,
computer vision, reinforcement learning, biology, robotics,
et cetera.
We have an exciting set of videos lined up for you
with some truly fascinating speakers, talks, presenting
how they're applying transformers
to the research in different fields and areas.
We hope you'll enjoy and learn from these videos.
So without any further ado, let's get started.
This is a purely introductory lecture.
And we'll go into the building blocks of transformers.
So first, let's start with introducing the instructors.
So for me, I'm currently on a temporary deferral from the PhD
program, and I'm leading AI at a robotics startup, Collaborative
Robotics, that are working on some general purpose robots,
somewhat like [INAUDIBLE].
And I'm very passionate about robotics and building FSG
learning algorithms.
My research interests are in reinforcement learning,
computer vision, and remodeling, and I
have a bunch of publications in robotics,
autonomous driving, and other areas.
My undergrad was at Cornell.
If someone is from Cornell, so nice to [INAUDIBLE]..
So I'm Stephen, currently a first-year CS PhD here.
Previously did my master's at CMU and undergrad at Waterloo.
I'm mainly into NLP research, anything involving language
and text, but more recently, I've
been getting more into computer vision as well as [INAUDIBLE]
And just some stuff I do for fun, a lot of music
stuff, mainly piano.
Some self-promo of what I post a lot on my Insta, YouTube,
and TikTok, so if you guys want to check it out.
My friends and I are also starting a Stanford piano club,
so if anybody's interested, feel free to email
or DM me for details.
Other than that, martial arts, bodybuilding, and huge fan
of k-dramas, anime, and occasional gamer.
[LAUGHS]
OK, cool.
Yeah, so my name is Rylan.
Instead of talking about myself, I just
want to very briefly say that I'm super
excited to take this class.
I took it the last time-- sorry-- to teach this.
Excuse me.
I took it the last time I was offered.
I had a bunch of fun.
I thought we brought in a really great group of speakers
last time.
I'm super excited for this offering.
And yeah, I'm thankful that you're all here,
and I'm looking forward to a really fun quarter together.
Thank you.
Yeah, so fun fact, Rylan was the most outspoken student
last year.
And so if someone wants to become an instructor next year,
you know what to do.
[LAUGHTER]
OK, cool.
Let's see.
OK, I think we have a few minutes.
So what we hope you will learn in this class is, first of all,
how do transformers work, how they
are being applied, just beyond NLP,
and nowadays, like they are pretty [INAUDIBLE]
them everywhere in AI machine learning.
And what are some new and interesting directions
of research in these topics.
Cool, so this class is just an introductory.
So we're just talking about the basics of transformers,
introducing them, talking about the self-attention mechanism
on which they're founded.
And we'll do a deep dive more on models like BERT
to GPT, stuff like that.
So with that, happy to get started.
OK, so let me start with presenting the attention
timeline.
Attention all started with this one paper.
[INAUDIBLE] by Vaswani et al in 2017.
That was the beginning of transformers.
Before that, we had the prehistoric error,
where we had models like RNM, LSDMs,
and simple attention mechanisms that didn't work
or [INAUDIBLE].
Starting 2017, we saw this explosion of transformers
into NLP, where people started using it for everything.
I even heard this quote from Google.
It's like our performance increased every time
we [INAUDIBLE]
[CHUCKLES]
For the [INAUDIBLE] after 2018 to 2020,
we saw this explosion of transformers
into other fields like vision, a bunch of other stuff,
and like biology as a whole.
And in last year, 2021 was the start
of the generative era, where we got a lot of genetic modeling,
started models like Codex, GPT, DALL-E,
stable diffusions, or a lot of things
happening in genetic modeling.
And we started scaling up in AI.
And now, the present.
So this is 2022 and the startup in '23.
And now we have models like ChatGPT, Whisperer,
a bunch of others.
And we're scaling onwards without splitting up,
so that's great.
So that's the future.
So going more into this, so once there were RNNs.
So we had Seq2Seq models, LSTMs, GRU.
What worked there was that they were good at encoding history,
but what did not work was they didn't encode long sequences
and they were very bad at encoding context.
So consider this example.
Consider trying to predict the last word in the text,
"I grew up in France, dot, dot, dot.
I speak fluent Dutch."
Here, you need to understand the context for it
to predict French, and attention mechanism
is very good at that, whereas if they're just using LSDMs,
it doesn't here work that well.
Another thing transformers are good at is,
more based on content, is also context prediction
is like finding attention maps.
If I have something like a word like it,
what noun does it correlate to.
And we can give a property attention
on one of the possible activations.
And this works better than existing mechanisms.
OK, so where we were in 2021, we were on the verge of takeoff.
We were starting to realize the potential of transformers
in different fields.
We solved a lot of long sequence problems
like protein folding, AlphaFold, offline RL.
We started to see few-shots, zero-shot generalization.
We saw multimodal tasks and applications
like generating images from language.
So that was DALL-E. And it feels like [INAUDIBLE]..
And this was also a talk on transformers
that you can watch on YouTube.
Yeah, cool.
And this is where we were going from 2021 to 2022,
which is we have gone from the version of [INAUDIBLE]
And now, we are seeing unique applications
in audio generation, art, music, storytelling.
We are starting to see these new capabilities
like commonsense, logical reasoning,
mathematical reasoning.
We are also able to now get human enlightenment
and interaction.
They're able to use reinforcement learning
and human feedback.
That's how ChatGPT is trained to perform really good.
We have a lot of mechanisms for controlling
toxicity bias and ethics now.
And there are a lot of also, a lot
of developments in other areas like diffusion models.
Cool.
So the future is a spaceship, and we are all
excited about it.
And there's a lot of more applications
that we can enable, and it'll be great
if you can see transformers also up there.
One big example is video understanding and generation.
That is something that everyone is interested in,
and I'm hoping we'll see a lot of models
in this area this year, also, finance, business.
I'll be very excited to see GPT author a novel,
but we need to solve very long sequence modeling.
And most transformer models are still
limited to 4,000 tokens or something like that.
So we need to make them generalize much more
better on long sequences.
We also want to have generalized agents
that can do a lot of multitask, a multi-input predictions
like Gato.
And so I think we will see more of that, too.
And finally, we also want domain specific models.
So you might want a GPT model, let's
put it like maybe your health.
So that could be like a DoctorGPT model.
You might have a LawyerGPT model that's
trained on only law data.
So currently, we have GPT models that are trained on everything.
But we might start to see more niche models that
are good at one task.
And we could have a mixture of experts,
so it's like, you can think this is a--
how you'd normally consult an expert,
you'll have expert AI models.
And you can go to a different AI model for your different needs.
There are still a lot of missing ingredients
to make this all successful.
The first of all is external memory.
We are already starting to see this with the models
like ChatGPT, where the inflections are short-lived.
There's no long-term memory, and they
don't have ability to remember or store
conversations for long-term.
And this is something you want to fix.
Second is reducing the computation complexity.
So attention mechanism is quadratic over the sequence
length, which is slow.
And we want to reduce it and make it faster.
Another thing we want to do is we
want to enhance the controllability of these models
like a lot of these models can be stochastic.
And we want to be able to control what sort of outputs
we get from them.
And you might have experienced the ChatGPT,
if you just refresh, you get different output each time.
But you might want to have a mechanism that controls
what sort of things you get.
And finally, we want to align our state of art language
models with how the human brain works.
And we are seeing the surge, but we still
need more research on seeing how they can make more informed.
Thank you.
Great, hi.
Yes, I'm excited to be here.
I live very nearby, so I got the invites to come to class.
And I was like, OK, I'll just walk over.
But then I spent like 10 hours on the slides,
so it wasn't as simple.
So yeah, I'm going to talk about transformers.
I'm going to skip the first two over there.
I'm not going to talk about those.
We'll talk about that one just to simplify the lecture
since we don't have time.
OK, so I wanted to provide a little bit of context
on why does this transformers class even exist.
So a little bit of historical context.
I feel like Bilbo over there.
I joined like telling you guys about this.
I don't know if you guys saw Lord of the Rings.
And basically, I joined AI in roughly 2012, the full course,
so maybe a decade ago.
And back then, you wouldn't even say
that you joined AI by the way.
That was like a dirty word.
Now, it's OK to talk about, but back then, it
was not even deep learning.
It was machine learning.
That was the term we would use if you were serious.
But now, now, AI is OK to use, I think.
So basically, do you even realize
how lucky you are potentially entering
this area in roughly 2023?
So back then, in 2011 or so when I was working specifically
on computer vision, your pipeline's looked like this.
So you wanted to classify some images,
you would go to a paper, and I think this is representative.
You would have three pages in the paper describing
all kinds of a zoo, of kitchen sink,
of different kinds of features, descriptors.
And you would go to a poster session
and in computer vision conference,
and everyone would have their favorite feature descriptor
that they're proposing.
And it's totally ridiculous, and you
would take notes on which one you should incorporate
into your pipeline because you would extract all of them,
and then you would put an SVM on top.
So that's what you would do.
So there's two pages.
Make sure you get your [? Spar ?] SIFT histograms,
your SSIMs, your color histograms, textiles,
tiny images.
And don't forget the geometry specific histograms.
All of them have basically complicated code by themselves.
So you're collecting code from everywhere and running it,
and it was a total nightmare.
So on top of that, it also didn't work.
[LAUGHTER]
So this would be, I think, it represents the prediction
from that time.
You would just get predictions like this once in a while,
and you'd be like, you just shrug your shoulders
like that just happens once in a while.
Today, you would be looking for a bug.
And worse than that, every single chunk of AI
had their own completely separate vocabulary
that they work with.
So if you go to NLP papers, those papers
would be completely different.
So you're reading the NLP paper, and you're like,
what is this part of speech tagging,
morphological analysis, and tactic parsing,
co-reference resolution?
What is MPBTKJ?
And you're confused.
So the vocabulary and everything was completely different.
And you couldn't read papers, I would
say, across different areas.
So now, that changed a little bit
starting 2012 when Al Krizhevsky and colleagues basically
demonstrated that if you scale a large neural network
on large data set, you can get very strong performance.
And so up till then, there was a lot of focus on algorithms.
But this showed that actually neural nets scale very well.
So you need to now worry about compute and data,
and you can scale it up.
It works pretty well.
And then that recipe actually did copy paste
across many areas of AI.
So we start to see neural networks pop up everywhere
since 2012.
So we saw them in computer vision, and NLP, and speech,
and translation in RL and so on.
So everyone started to use the same kind of modeling
toolkit, modeling framework.
And now when you go to NLP, and you start reading papers there,
in machine translation, for example,
this is a sequence to sequence paper
which we'll come back to in a bit.
You start to read those papers, and you're like, OK,
I can recognize these words.
Like there's a neural network.
There's some parameters.
There's an optimizer, and it starts to read things
that you know of.
So that decreased tremendously the barrier to entry
across the different areas.
And then, I think, the big deal is
that when the transformer came out in 2017,
it's not even that just the tool kits and the neural networks
were similar-- there's that literally the architectures
converged to like one architecture that you
copy paste across everything seemingly.
So this was kind of an unassuming machine translation
paper at the time, proposing to transformer architecture.
But what we found since then is that you can just basically
copy paste this architecture and use it everywhere.
And what's changing is the details of the data,
and the chunking of the data, and how you feed it in.
And that's a caricature, but it's
kind of like a correct first order statement.
And so now, papers are even more similar looking
because everyone's just using transformer.
And so this convergence was remarkable to watch
and unfolded over the last decade.
And it's pretty crazy to me.
What I find interesting is I think
this is some kind of a hint that we're maybe converging
to something that maybe the brain is doing
because the brain is very homogeneous and uniform
across the entire sheet of your cortex.
And OK, maybe some of the details are changing,
but those feel like hyperparameters
like a transformer.
But your auditory cortex and your visual cortex
and everything else looks very similar.
And so maybe we're converging to some kind
of a uniform powerful learning algorithm here.
Something like that, I think, is interesting and exciting.
OK, so I want to talk about where the transformer came
from briefly, historically.
So I want to start in 2003.
I like this paper quite a bit.
It was the first popular application of neural networks
to the problem of language modeling,
so predicting in this case, the next word
in the sequence, which allows you to build
generative models over text.
And in this case, they were using multi-layer perceptron,
so very simple neural net.
The neural nets took three words and predicted the probability
distribution for the fourth word in a sequence.
So this was well and good at this point.
Now, over time, people started to apply this
to machine translation.
So that brings us to sequence to sequence paper
from 2014 that was pretty influential,
and the big problem here was OK, we
don't just want to take three words and predict the fourth.
We want to predict how to go from an English sentence
to a French sentence.
And the key problem was OK, you can
have arbitrary number of words in English and arbitrary number
of words in French, so how do you
get an architecture that can process
this variably sized input?
And so here they used a LSDM, and there's basically
two chunks of this, which are covered by the slack, by this.
But basically have an encoder LSDM on the left,
and it just consumes one word at a time
and builds up a context of what it has read.
And then that acts as a conditioning vector
to the decoder RNN or LSDM.
That basically goes chonk, chonk,
chonk for the next word in a sequence,
translating the English to French or something like that.
Now, the big problem with this, that people identified,
I think, very quickly and tried to resolve
is that there's what's called this encoder bottleneck.
So this entire English sentence that we are trying to condition
on is packed into a single vector
that goes from the encoder to the decoder.
And so this is just too much information
to potentially maintain in a single vector,
and that didn't seem correct.
And so people who are looking around for ways
to alleviate the attention of the encoder bottleneck as it
was called at the time.
And so that brings us to this paper,
Neural Machine Translation by Jointly Learning
to Align and Translate.
And here, just quoting from the abstract, "in this paper,
we conjectured that the use of a fixed length vector
is a bottleneck in improving the performance
of the basic encoder-decoder architecture
and propose to extend this by allowing
the model to automatically soft search
for parts of the source sentence that are relevant to predicting
a target word without having to form
these parts or hard segments exclusively."
So this was a way to look back to the words that
are coming from the encoder.
And it was achieved using this soft search.
So as you are decoding in the words
here, while you are decoding them,
you are allowed to look back at the words
at the encoder via this soft attention mechanism proposed
in this paper.
And so this paper, I think, is the first time that I saw,
basically, attention.
So your context vector that comes from the encoder
is a weighted sum of the hidden states
of the words in the encoding.
And then the weights of this sum come
from a softmax that is based on these compatibilities
between the current state as you're decoding
and the hidden states generated by the encoder.
And so this is the first time that really you
start to look at it, and this is the current modern equations
of the attention.
And I think this was the first paper that I saw it in.
It's the first time that there's a word
attention used, as far as I know, to call this mechanism.
So I actually tried to dig into the details of the history
of the attention.
So the first author here, Dzmitry, I
had an email correspondence with him,
and I basically sent him an email.
I'm like, Dzmitry, this is really interesting.
Just rumors have taken over.
Where did you come up with the soft attention
mechanism that ends up being the heart of the transformer?
And to my surprise, he wrote me back this massive email, which
was really fascinating.
So this is an excerpt from that email.
So basically, he talks about how he was looking for a way
to avoid this bottleneck between the encoder and decoder.
He had some ideas about cursors that
traverse the sequences that didn't quite work out.
And then here, "so one day, I had this thought
that it would be nice to enable the decoder
RNN to learn to search where to put the cursor in the source
sequence.
This was sort of inspired by translation exercises
that learning English in my middle school involved.
Your gaze shifts back and forth between source and target,
sequence as you translate."
So literally, I thought that this was kind of interesting,
that he's not a native English speaker,
and here, that gave him an edge in this machine translation
that led to attention and then led to transformer.
So that's really fascinating.
"I expressed a soft search a softmax
and then weighted averaging of the [INAUDIBLE] states.
And basically, to my great excitement,
this worked from the very first try."
So really, I think, interesting piece of history.
And as it later turned out that the name of RNN search
was kind of lame, so the better name attention came
from Yoshua on one of the final passes
as they went over the paper.
So maybe Attention is All You Need
would have been called RNN Search is All You Need,
but we have Yoshua Bengio to thank
for a little bit of better name, I would say.
So apparently, that's the history
of this, which I thought was interesting.
OK, so that brings us to 2017, which is Attention
is All You Need.
So this attention component, which
in Dzmitry's paper was just one small segment,
and there's all this bidirectional RNN, RNN
and decoder, and this Attention All You Need paper is saying,
OK, you can actually delete everything.
What's making this work very well
is just attention by itself.
And so delete everything, keep attention.
And then what's remarkable about this paper actually is usually,
you see papers that are very incremental.
They add one thing, and they show that it's better.
But I feel like Attention is All You
Need was like a mix of multiple things at the same time.
They were combined in a very unique way,
and then also achieve a very good local minimum
in the architecture space.
And so to me, this is really a landmark paper
that is quite remarkable and, I think,
had quite a lot of work behind the scenes.
So delete all the RNN, just keep attention.
Because attention operates over sets--
and I'm going to go to this in a second--
you now need to positionally encode your inputs
because attention doesn't have the notion of space by itself.
I have to be very careful.
They adopted this residual network structure
from resonance.
They interspersed attention with multi-layer perceptrons.
They used layer norms, which came from a different paper.
They introduced the concept of multiple heads of attention
that were applied in parallel.
And they gave us, I think, like a fairly good set
of hyperparameters that to this day are used.
So the expansion factor in the multi-layer perceptron goes up
by 4X--
and we'll go into a bit more detail--
and this 4X has stuck around.
And I believe there's a number of papers
that try to play with all kinds of little details
of the transformer, and nothing sticks because this is actually
quite good.
The only thing to my knowledge that didn't stick
was this reshuffling of the layer norms
to go into the prenorm version where here you
see the layer norms are after the multiheaded attention feed
forward.
They just put them before instead.
So just reshuffling of layer norms, but otherwise,
the TPTs and everything else that you're seeing today
is basically the 2017 architecture from 5 years ago.
And even though everyone is working on it,
it's been proven remarkably resilient,
which I think is real interesting.
There are innovations that, I think,
have been adopted also in positional encoding.
It's more common to use different rotary and relative
positional encoding and so on.
So I think there have been changes, but for the most part,
it's proven very resilient.
So really quite an interesting paper.
Now, I wanted to go into the attention mechanism.
And I think, the way I interpret it is not similar to the ways
that I've seen it presented before.
So let me try a different way of how I see it.
Basically, to me, attention is kind of like the communication
phase of the transformer, and the transformer
interweaves two phases of the communication phase, which
is the multi-headed attention, and the computation
stage, which is this multilayered perceptron
or [INAUDIBLE].
So in the communication phase, it's
really just a data dependent message
passing on directed graphs.
And you can think of it as OK, forget everything
with machine translation, everything.
Let's just-- we have directed graphs.
At each node, you are storing a vector.
And then let me talk now about the communication
phase of how these vectors talk to each other
and this directed graph.
And then the compute phase later is just
a multi-perceptron, which then basically acts on every node
individually.
But how do these nodes talk to each other
in this directed graph?
So I wrote like some simple Python--
I wrote this in Python basically to create
one round of communication of using attention
as the message passing scheme.
So here, a node has this private data vector,
as you can think of it as private information
to this node.
And then it can also emit a key, a query, and a value.
And simply, that's done by linear transformation
from this node.
So the key is what are the things that I am--
sorry.
The query is what are the things that I'm looking for?
The key is what other the things that I have?
And the value is what are the things that I will communicate?
And so then when you have your graph that's
made up of nodes in some random edges, when you actually
have these nodes communicating, what's happening is
you loop over all the nodes individually
in some random order, and you're at some node,
and you get the query vector q, which
is, I'm a node in some graph, and this
is what I'm looking for.
And so that's just achieved via this linear transformation
here.
And then we look at all the inputs that point to this node,
and then they broadcast what are the things that I have,
which is their keys.
So they broadcast the keys.
I have the query, then those interact by dot product
to get scores.
So basically, simply by doing dot product,
you get some unnormalized weighting
of the interestingness of all of the information in the nodes
that point to me and to the things I'm looking for.
And then when you normalize that with softmax,
so it just sums to 1, you basically just
end up using those scores, which now sum to 1 in our probability
distribution, and you do a weighted sum of the values
to get your update.
So I have a query.
They have keys, dot products to get interestingness or like
affinity, softmax to normalize it, and then
weighted sum of those values flow to me and update me.
And this is happening for each node individually.
And then we update at the end.
And so this kind of a message passing scheme
is at the heart of the transformer.
And it happens in the more vectorized batched way
that is more confusing and is also interspersed with layer
norms and things like that to make the training behave
better.
But that's roughly what's happening in the attention
mechanism, I think, on a high level.
So yeah, so in the communication phase of the transformer, then
this message passing scheme happens
in every head in parallel and then in every layer in series
and with different weights each time.
And that's it as far as the multi-headed attention goes.
And so if you look at these encooder-decoder models,
you can think of it then in terms of the connectivity
of these nodes in the graph.
You can think of it as like, OK, all these tokens that
are in the encoder that we want to condition on,
they are fully connected to each other.
So when they communicate, they communicate fully
when you calculate their features.
But in the decoder, because we are
trying to have a language model, we
don't want to have communication for future tokens
because they give away the answer at this step.
So the tokens in the decoder are fully connected
from all the encoder states, and then they
are also fully connected from everything that is decoding.
And so you end up with this triangular structure
in the data graph.
But that's the message passing scheme
that this basically implements.
And then you have to be also a little bit careful because
in the cross attention here with the decoder,
you consume the features from the top of the encoder.
So think of it as in the encoder,
all the nodes are looking at each other,
all the tokens are looking at each other many, many times.
And they really figure out what's in there,
and then the decoder when it's looking only at the top nodes.
So that's roughly the message passing scheme.
I was going to go into more of an implementation
of a transformer.
I don't know if there's any questions about this.
[INAUDIBLE] self-attention and multi-headed attention,
but what is the advantage of [INAUDIBLE]??
Yeah, so self-attention and multi-headed attention, so
the multi-headed attention is just this attention scheme,
but it's just applied multiple times in parallel.
Multiple heads just means independent applications
of the same attention.
So this message passing scheme basically just
happens in parallel multiple times
with different weights for the query, key, and value.
So you can almost look at it like in parallel, I'm
looking for, I'm seeking different kinds of information
from different nodes.
And I'm collecting it all in the same node.
It's all done in parallel.
So heads is really just copy-paste in parallel.
And layers are copy-paste but in series.
Maybe that makes sense.
And self-attention, when it's self-attention,
what it's referring to is that the node here
produces each node here.
So as I described it here, this is really self-attention
because every one of these nodes produces
a key query and a value from this individual node.
When you have cross-attention, you have one cross-attention
here, coming from the encoder.
That just means that the queries are still
produced from this node, but the keys and the values
are produced as a function of nodes that
are coming from the encoder.
So I have my queries because I'm trying to decode some--
the fifth word in the sequence.
And I'm looking for certain things
because I'm the fifth word.
And then the keys and the values in terms
of the source of information that could answer my queries
can come from the previous nodes in the current decoding
sequence or from the top of the encoder.
So all the nodes that have already seen all
of the encoding tokens many, many times cannot broadcast
what they contain in terms of information.
So I guess, to summarize, the self-attention is--
sorry, cross-attention and self-attention
only differ in where the piece and the values come from.
Either the keys and values are produced from this node,
or they are produced from some external source like an encoder
and the nodes over there.
But algorithmically, is the same mathematical operations.
Question.
Yeah, OK.
So two questions for you.
First question is, in the message passing [INAUDIBLE]
So think of-- so each one of these nodes is a token.
I guess they don't have a very good picture of it
in the transformer.
But this node here could represent the third word
in the output in the decoder, and in the beginning,
it is just the embedding of the word.
And then, OK, I have to think through this analogy
a little bit more.
I came up with it this morning.
[LAUGHTER]
[INAUDIBLE]
What example of instantiation [INAUDIBLE] nodes
as in in blocks were embedding?
These nodes are basically the vectors.
I'll go to an implementation.
I'll go to the implementation, and then maybe I'll
make the connections to the graph.
So let me try to first go to-- let me now go to,
with this intuition in mind, at least,
to a nanoGPT, which is a concrete implementation
of a transformer that is very minimal.
So I worked on this over the last few days,
and here it is reproducing GPT-2 on open web text.
So it's a pretty serious implementation that reproduces
GPT-2, I would say, and provide it enough compute--
This was one node of 8 GPUs for 38 hours or something
like that, if I remember correctly.
And it's very readable.
It's 300 lines, so everyone can take a look at it.
And yeah, let me basically briefly step through it.
So let's try to have a decoder-only transformer.
So what that means is that it's a language model.
It tries to model the next word in the sequence
or the next character in the sequence.
So the data that we train on this
is always some kind of text.
So here's some fake Shakespeare.
Sorry, this is real Shakespeare.
We're going to produce fake Shakespeare.
So this is called a Tiny Shakespeare
dataset, which is one of my favorite toy datasets.
You take all of Shakespeare, concatenate it,
and it's 1 megabyte file, and then
you can train language models on it
and get infinite Shakespeare, if you like,
which I think is kind of cool.
So we have a text.
The first thing we need to do is we
need to convert it to a sequence of integers
because transformers natively process--
you can't plug text into transformer.
You need to somehow encode it.
So the way that encoding is done is
we convert, for example, in the simplest case,
every character gets an integer, and then instead of "hi
there," we would have this sequence of integers.
So then you can encode every single character as an integer
and get a massive sequence of integers.
You just concatenate it all into one
large, long one-dimensional sequence.
And then you can train on it.
Now, here, we only have a single document.
In some cases, if you have multiple independent documents,
what people like to do is create special tokens,
and they intersperse those documents
with those special end of text tokens
that they splice in between to create boundaries.
But those boundaries actually don't have any modeling impact.
It's just that the transformer is supposed
to learn via backpropagation that the end of document
sequence means that you should wipe the memory.
OK, so then we produce batches.
So these batches of data just mean
that we go back to the one-dimensional sequence,
and we take out chunks of this sequence.
So say, if the block size is 8, Then the block size indicates
the maximum length of context that your transformer will
process.
So if our block size is 8, that means
that we are going to have up to eight characters of context
to predict the ninth character in a sequence.
And the batch size indicates how many sequences in parallel
we're going to process.
And we want this to be as large as possible,
so we're fully taking advantage of the GPU
and the parallels [INAUDIBLE] So in this example,
we're doing a 4 by 8 batches.
So every row here is independent example
and then every row here is a small chunk of the sequence
that we're going to train on.
And then we have both the inputs and the targets
at every single point here.
So to fully spell out what's contained in a single 4
by 8 batch to the transformer--
I sort of compact it here--
so when the input is 47, by itself, the target is 58.
And when the input is the sequence 47, 58,
the target is one.
And when it's 47, 58, 1, the target is 51 and so on.
So actually, the single batch of examples that score by 8
actually has a ton of individual examples
that we are expecting a transformer
to learn on in parallel.
And so you'll see that the batches are learned
on completely independently, but the time dimension here along
horizontally is also trained on in parallel.
So your real batch size is more like B times T.
And it's just that the context grows linearly
for the predictions that you make along the T direction
in the model.
So this is all the examples that the model will learn from,
this single batch.
So now, this is the GPT class.
And because this is a decoder-only model,
so we're not going to have an encoder because there's no
English we're translating from--
we're not trying to condition in some other external
information.
We're just trying to produce a sequence of words that
follow each other or likely to.
So this is all PyTorch, and I'm going slightly faster
because I'm assuming people have taken 231 or something
along those lines.
But here in the forward pass, we take these indices,
and then we both encode the identity of the indices,
just via an embedding lookup table.
So every single integer, we index into a lookup table of
vectors in this, and end up embedding, and pull out
the word vector for that token.
And then because the transformer by itself
doesn't actually-- the process is set natively.
So we need to also positionally encode these vectors
so that we basically have both the information
about the token identity and its place in the sequence from 1
to block size.
Now, the information about what and where
is combined additively, so the token embeddings
and the positional embeddings are just added exactly as here.
So then there's optional dropout,
this x here basically just contains
the set of words and their positions,
and that feeds into the blocks of transformer.
And we're going to look into what's block here.
But for here, for now, this is just a series
of blocks in a transformer.
And then in the end, there's a layer norm,
and then you're decoding the logits
for the next word or next integer in a sequence,
using the linear projection of the output of this transformer
So LM head here, a short core language model head.
It's just a linear function.
So basically, positionally encode all the words,
feed them into a sequence of blocks,
and then apply a linear layer to get the probability
distribution for the next character.
And then if we have the targets, which
we produced in the data order--
and you'll notice that the targets are just
the inputs offset by one in time--
then those targets feed into a cross entropy loss.
So this is just a negative log likelihood
typical classification loss.
So now let's drill into what's here in the blocks.
So these blocks that are applied sequentially,
there's, again, as I mentioned, this communicate
phase and the compute phase.
So in the communicate phase, all the nodes
get to talk to each other, and so these nodes are basically,
if our block size is 8, then we are
going to have eight nodes in this graph.
There's eight nodes in this graph.
The first node is pointed to only by itself.
The second node is pointed to by the first node and itself.
The third node is pointed to by the first two nodes
and itself, et cetera.
So there's eight nodes here.
So you apply-- there's a residual pathway and x.
You take it out.
You apply a layer norm, and then the self-attention
so that these communicate, these eight nodes communicate.
But you have to keep in mind that the batch is 4.
So because batch is 4, this is also applied--
so we have eight nodes communicating,
but there's a batch of four of them individually communicating
in one of those eight nodes.
There's no crisscross across the batch dimension, of course.
There's no batch anywhere luckily.
And then once they've changed information,
they are processed using the multi-layer perceptron.
And that's the compute phase.
And then also here we are missing the cross-attention
because this is a decoder-only model.
So all we have is this step here,
the multi-headed attention, and that's
this line, the communicate phase.
And then we have the feed forward, which is the MLP,
and that's the compute phase.
I'll take question's a bit later.
Then the MLP here is fairly straightforward.
The MLP is just individual processing on each node,
just transforming the feature representation at that node.
So applying a two-layer neural net
with a GELU nonlinearity, which is just
think of it as a ReLU or something like that.
It's just a nonlinearity.
And then MLP is straightforward.
I don't think there's anything too crazy there.
And then this is the causal self-attention part,
the communication phase.
So this is like the meat of things
and the most complicated part.
It's only complicated because of the batching
and the implementation detail of how you mask the connectivity
in the graph so that you can't obtain
any information from the future when
you're predicting your token.
Otherwise, it gives away the information.
So if I'm the fifth token and if I'm the fifth position,
then I'm getting the fourth token coming into the input,
and I'm attending to the third, second, and first,
and I'm trying to figure out what is the next token.
Well then, in this batch, in the next element
over in the time dimension, the answer is at the input.
So I can't get any information from there.
So that's why this is all tricky,
but basically, in the forward pass,
we are calculating the queries, keys, and values based on x.
So these are the keys, queries, and values.
Here, when I'm computing the attention,
I have the queries matrix multiplying the piece.
So this is the dot product in parallel for all the queries
and all the keys in all the heads.
So I failed to mention that there's also
the aspect of the heads, which is also done all in parallel
here.
So we have the batch dimension, the time dimension,
and the head dimension, and you end up
with five-dimensional tensors, and it's all really confusing.
So I invite you to step through it later and convince yourself
that this is actually doing the right thing.
But basically, you have the batch dimension, the head
dimension and the time dimension,
and then you have features at them.
And so this is evaluating for all the batch elements, for all
the head elements, and all the time elements,
the simple Python that I gave you earlier, which is query
dot product p.
Then here, we do a masked_fill, and what this is doing
is it's basically clamping the attention between the nodes
that are not supposed to communicate to be negative
infinity.
And we're doing negative infinity
because we're about to softmax, and so negative infinity will
make basically the attention that those elements be zero.
And so here we are going to basically end up
with the weights, the affinities between these nodes, optional
dropout.
And then here, attention matrix multiply v is basically
the gathering of the information according to the affinities
we calculated.
And this is just a weighted sum of the values
at all those nodes.
So this matrix multiplies is doing that weighted sum.
And then transpose contiguous view
because it's all complicated and batched
in five-dimensional tensors, but it's really not
doing anything, optional drop out,
and then a linear projection back to the residual pathway.
So this is implementing the communication phase here.
Then you can train this transformer.
And then you can generate infinite Shakespeare.
And you will simply do this by--
because our block size is 8, we start with a sum token,
say like, I used in this case, you
can use something like a new line as the start token.
And then you communicate only to yourself
because there's a single node, and you
get the probability distribution for the first word
in the sequence.
And then you decode it for the first character
in the sequence.
You decode the character.
And then you bring back the character,
and you re-encode it as an integer.
And now, you have the second thing.
And so you get--
OK, we're at the first position, and this
is whatever integer it is, add the positional encodings,
goes into the sequence, goes in the transformer,
and again, this token now communicates
with the first token and it's identity.
And so you just keep plugging it back.
And once you run out of the block size, which is eight,
you start to crawl, because you can never
have watt size more than eight in the way you've
trained this transformer.
So we have more and more context until eight.
And then if you want to generate beyond eight,
you have to start cropping because the transformer only
works for eight elements in time dimension.
And so all of these transformers in the [INAUDIBLE] setting
have a finite block size or context length,
and in typical models, this will be 1,024 tokens or 2,048
tokens, something like that.
But these tokens are usually like BPE tokens,
or SentencePiece tokens, or WorkPiece tokens.
There's many different encodings.
So it's not like that long.
And so that's why, I think, [INAUDIBLE]..
We really want to expand the context size,
and it gets gnarly because the attention
is sporadic in the [INAUDIBLE] case.
Now, if you want to implement an encoder instead of a decoder
attention.
Then all you have to do is this [INAUDIBLE]
and you just delete that line.
So if you don't mask the attention,
then all the nodes communicate to each other,
and everything is allowed, and information
flows between all the nodes.
So if you want to have the encoder here, just delete.
All the encoder blocks will use attention
where this line is deleted.
That's it.
So you're allowing whatever-- this encoder might store say,
10 tokens, 10 nodes, and they are all
allowed to communicate to each other going up the transformer.
And then if you want to implement cross-attention,
so you have a full encoder-decoder transformer,
not just a decoder-only transformer or a GPT.
Then we need to also add cross-attention in the middle.
So here, there is a self-attention piece where all
the--
there's a self-attention piece, a cross-attention piece,
and this MLP.
And in the cross-attention, we need
to take the features from the top of the encoder.
We need to add one more line here,
and this would be the cross-attention instead of a--
I should have implemented it instead of just pointing,
I think.
But there will be a cross-attention line here.
So we'll have three lines because we
need to add another block.
And the queries will come from x but the keys
and the values will come from the top of the encoder.
And there will be basic code information
flowing from the encoder, strictly
to all the nodes inside x.
And then that's it.
So it's a very simple modifications
on the decoder attention.
So you'll hear people talk that you have
a decoder-only model like GPT.
You can have an encoder-only model like BERT,
or you can have an encoder-decoder model
like say T5, doing things like machine translation.
And in BERT, you can't train it using this language modeling
setup that's utter aggressive, and you're just
trying to predict next [INAUDIBLE] in the sequence.
You're training it doing slightly different objectives.
You're putting in the full sentence,
and, the full sentence is allowed to communicate fully.
And then you're trying to classify sentiment or something
like that.
So you're not trying to model the next token in the sequence.
So these are trained slightly different
using masking and other denoising techniques.
OK.
So that's like the transformer.
I'm going to continue.
So yeah, maybe more questions.
[INAUDIBLE]
This is like we are enforcing these constraints on it
by just masking [INAUDIBLE]
So I'm not sure if I fully follow.
So there's different ways to look at this analogy,
but one analogy is you can interpret
this graph as really fixed.
It's just that every time we do the communicate,
we are using different weights.
You can look at it that way.
So if we have block size of eight in my example,
we would have eight nodes.
Here we have 2, 4, 6.
OK, so we'd have eight nodes.
They would be connected in--
you lay them out, and you only connect from left to right.
[INAUDIBLE]
Why would they connect-- usually,
the connections don't change as a function of the data
or something like that--
[INAUDIBLE]
I don't think I've seen a single example where
the connectivity changes dynamically
in the function data.
Usually, the connectivity is fixed.
If you have an encoder, and you're training a BERT,
you have how many tokens you want,
and they are fully connected.
And if you have a decoder-only model,
you have this triangular thing, and if you
have encoder-decoder, then you have
awkwardly two pools of nodes.
Yeah.
Go ahead.
[INAUDIBLE] I wonder, you know much more about this
than I know.
But do you have a sense of like if you ran [INAUDIBLE]
In my head, I'm thinking [INAUDIBLE] but then you also
have different things for one or more of [INAUDIBLE]----
Yeah, it's really hard to say, so that's
why I think this paper is so interesting because like, yeah,
usually, you'd see like the path,
and maybe they had path internally.
They just didn't publish it.
All you can see is things that didn't look like a transformer.
I mean, you have ResNets, which have lots of this.
But a ResNet would be like this, but there's
no self-attention component.
But the MLP is there kind of in a ResNet.
So a ResNet looks very much like this
except there's no-- you can use layer norms in ResNets,
I believe, as well.
Typically, sometimes, they can be batch norms.
So it is kind of like a ResNet.
It is like they took a ResNet, and they
put in a self-attention block in addition
to the preexisting MLP block, which
is kind of like convolutions.
And MLP was strictly speaking deconvolution,
one by one convolution, but I think
the idea is similar in that MLP is just like a typical weights,
nonlinearity weights operation.
But I will say, yeah, this is kind of interesting
because a lot of work is not there,
and then they give you this transformer.
And then it turns out 5 years later,
it's not changed, even though everyone's trying to change it.
So it's interesting to me that it's like a package,
in like a package, which I think is really
interesting historically.
And I also talked to paper authors,
and they were unaware of the impact
that the transformer would have at the time.
So when you read this paper, actually, it's unfortunate
because this is the paper that changed everything,
but when people read it, it's like question marks
because it reads like a pretty random machine translation
paper.
It's like, oh, we're doing machine translation.
Oh, here's a cool architecture.
OK, great, good results.
It doesn't know what's going to happen.
[LAUGHS] And so when people read it today,
I think they're confused potentially.
I will have some tweets at the end,
but I think I would have renamed it
with the benefit of hindsight of like, well, I'll get to it.
[INAUDIBLE]
Yeah, I think that's a good question as well.
Currently, I mean, I certainly don't
love the autoregressive modeling approach.
I think it's kind of weird to sample a token
and then commit to it.
So maybe there are some ways, some hybrids
with the Fusion as an example, which
I think would be really cool, or we'll
find some other ways to edit the sequences later but still
in our regressive framework.
But I think the Fusion is like an up and coming modeling
approach that I personally find much more appealing.
When I sample text, I don't go chunk, chunk, chunk,
and commit.
I do a draft one, and then I do a better draft two.
And that feels like a diffusion process.
So that would be my hope.
OK, also a question.
So yeah, you'd think the [INAUDIBLE]
And then once we have the edge rates,
we just have to multiply it by the values,
and then you just [INAUDIBLE] it.
Yes, yeah, it's right.
And you think there's MLG within graph neural networks
and they'll potentially--
I find the graph neural networks like a confusing term
because, I mean, yeah, previously,
there, was this notion of--
I feel like maybe today everything is a graph neural
network because a transformer is a graph neural network
processor.
The native representation that the transformer operates over
is sets that are connected by edges in a direct way.
And so that's the native representation, and then, yeah.
OK, I should go on because I still have 30 slides.
[INAUDIBLE]
Oh yeah, yeah, the root DE, I think, it basically
like if you're initializing with random weights
setup from a [INAUDIBLE] as your dimension size grows,
so does your values, the variance grows.
And then your softmax will just become the one half vector.
So it's just a way to control the variance
and bring it to always be in a good range for softmax
and nice diffused distribution.
OK, so it's almost like an initialization thing.
OK, so transformers have been applied
to all the other fields, and the way this was done
is in my opinion, ridiculous ways
honestly because I was a computer vision person,
and you have ComNets, and they make sense.
So what we're doing now with VITs as an example is
you take an image and you chop it up into little squares.
And then those squares, literally,
feed into a transformer, and that's
it, which is kind of ridiculous.
And so, I mean, yeah, and so the transformer
doesn't even, in the simplest case, really know where
these patches might come from.
They are usually positionally encoded,
but it has to rediscover a lot of the structure,
I think, of them in some ways.
And it's kind of weird to approach it that way.
But it's just the simplest baseline
of just chomping up big images into small squares
and feeding them in as the individual nodes actually
works fairly well.
And then this is in a transformer encoder,
so all the patches are talking to each other
throughout the entire transformer.
And the number of nodes here would be like nine.
Also, in speech recognition, you just take your melSpectrogram,
and you chop it up into slices and you feed them
into a transformer.
So there was paper like this, but also Whisper.
Whisper is a copy-paste transformer.
If you saw Whisper from OpenAI, you just chop up melSpectrogram
and feed it into a transformer and then pretend
you're dealing with text.
And it works very well.
Decision transformer in RL, you take your states, actions,
and reward that you experience in environment,
and you just pretend it's a language.
Then you start to model the sequences of that,
and then you can use that for planning later.
That works really well.
Even things AlphaFold, so we were briefly
talking about molecules and how you can plug them in.
So at the heart of AlphaFold, computationally,
is also a transformer.
One thing I wanted to also say about transformers
is I find that they're very flexible,
and I really enjoy that.
I'll give you an example from Tesla.
You have a ComNet that takes an image
and makes predictions about the image.
And then the big question is, how do you
feed in extra information?
And it's not always trivial like say, I
had additional information that I
want to inform that I want the outputs to be informed by.
Maybe I have other sensors like Radar.
Maybe I have some map information, or a vehicle type,
or some audio.
And the question is, how do you feed information into a ComNet?
Like where do you feed it in?
Do you concatenate it?
Do you add it?
At what stage?
And so with a transformer, it's much easier
because you just take whatever you want, you chop it
up into pieces, and you feed it in with a set
of what you had before.
And you let the self-attention figure out
how everything should communicate.
And that actually apparently works.
So just chop up everything and throw it into the mix
is like the way.
And it frees neural nets from this burgeon
of Euclidean space, where previously you
had to arrange your computation to conform to the Euclidean
space or three dimensions of how you're laying out the compute.
Like the compute actually kind of
happens in almost like 3D space if you think about it.
But in attention, everything is just sets.
So it's a very flexible framework,
and you can just throw in stuff into your conditioning set.
And everything just self-attended over.
So it's quite beautiful from that perspective.
OK, so now what exactly makes transformers so effective?
I think a good example of this comes
from the GPT-3 paper, which I encourage people to read.
Language Models of Few-Shot Learners.
I would have probably renamed this a little bit.
I would have said something like transformers
are capable of in-context learning or meta-learning.
That's like what makes them really special.
So basically the setting that they're working with
is, OK, I have some context, and I'm
trying-- like say, a passage.
This is just one example of many.
I have a passage, and I'm asking questions about it.
And then as part of the context in the prompt,
I'm giving the questions and the answers.
So I'm giving one example of question-answer,
another example of question-answer,
another example of question-answer, and so on.
And this becomes--
Oh yeah, people are going to have to leave soon, huh?
OK, is this really important?
Let me think.
OK, so what's really interesting is basically
like with more examples given in a context,
the accuracy improves.
And so what that can set is that the transformer
is able to somehow learn in the activations
without doing any gradient descent
in a typical fine-tuning fashion.
So if you fine-tune, you have to give an example and the answer,
and you fine-tune it, using gradient descent.
But it looks like the transformer internally
in its weights is doing something
that looks like potentially gradient, some kind
of a metalearning in the weights of the transformer
as it is reading the prompt.
And so in this paper, they go into, OK,
distinguishing this outer loop with stochastic gradient
descent in this inner loop of the intercontext learning.
So the inner loop is as the transformer is reading
the sequence almost and the outer loop is the training
by gradient descent.
So basically, there's some training
happening in the activations of the transformer
as it is consuming a sequence that
may be very much looks like gradient descent.
And so there are some recent papers that hint at this
and study it.
And so as an example, in this paper
here, they propose something called the draw operator.
And they argue that the raw operator is implemented
by transformer, and then they show
that you can implement things like ridge regression
on top of the raw operator.
And so this is giving--
There are papers hinting that maybe there
is some thing that looks like gradient-based learning
inside the activations of the transformer.
And I think this is not impossible to think through
because what is gradient-based learning?
Overpass, backward pass, and then update.
Oh, that looks like a ResNet, right,
because you're adding to the weights.
So the start of initial random set of weights,
forward pass, backward pass, and update your weights,
and then forward pass, backward pass, update the weights.
Looks like a ResNet.
Transformer is a ResNet, so much more hand-wavey,
but basically, some papers are trying
to hint at why that would be potentially possible.
And then I have a bunch of tweets I just copy-pasted here
in the end.
This was like meant for general consumption,
so they're a bit more high-level and hypey a little bit.
But I'm talking about why this architecture is so interesting
and why potentially it became so popular.
And I think it simultaneously optimizes
three properties that, I think, are very desirable.
Number one, the transformer is very
expressive in the forward pass.
It sort of like it's able to implement
very interesting functions, potentially functions
that can even do meta-learning.
Number two, it is very optimizable thanks
to things like residual connections, layer nodes,
and so on.
And number three, it's extremely efficient.
This is not always appreciated, but the transformer,
if you look at the computational graph,
is a shallow, wide network, which
is perfect to take advantage of the parallelism of GPUs.
So I think the transformer was designed very deliberately
to run efficiently on GPUs.
There's previous work like neural GPU
that I really enjoy as well, which is really just
like how do we design neural nets that are efficient on GPUs
and thinking backwards from the constraints of the hardware,
which I think is a very interesting way
to think about it.
Oh yeah, so here, I'm saying, I probably would have called--
I probably would've called the transformer a general purpose
efficient optimizable computer instead of attention
is all you need.
That's what I would have maybe in hindsight called that paper.
It's proposing a model that is very general purpose, so
forward passes, expressive.
It's very efficient in terms of GPU usage
and is easily optimizable by gradient descent and trains
very nicely.
And then I have some other hype tweets here.
Anyway, so you can read them later.
But I think this one is maybe interesting.
So if previous neural nets are special purpose computers
designed for a specific task, GPT
is a general purpose computer, reconfigurable at runtime
to run natural language programs.
So the programs are given as prompts,
and then GPT runs the program by completing the document.
So I really like these analogies personally to computer.
It's just like a powerful computer,
and it's optimizable by gradient descent.
And I don't know--
OK, yeah.
That's it.
[LAUGHTER]
You can read the tweets later, but that's for now.
I'll just thank you.
I'll just leave this up.
Sorry, I just found this tweet.
So turns out that if you scale up the training set
and use a powerful enough neural net like a transformer,
the network becomes a kind of general purpose
computer over text.
So I think that's nice way to look at it.
And instead of performing a single text sequence,
you can design the sequence in the prompt.
And because the transformer is both powerful
but also is trained on large enough, very hard data set,
it becomes this general purpose text computer.
And so I think that's kind of interesting way to look at it.
Yeah.
[INAUDIBLE]
And I guess my question is [INAUDIBLE] how
much do you think [INAUDIBLE]?
really because it's mostly more efficient or [INAUDIBLE]
So I think there's a bit of that.
Yeah, so I would say RNNs in principle,
yes, they can implement arbitrary programs.
I think, it's like a useless statement to some extent
because they're probably--
I'm not sure that they're probably expressive
because in a sense of power and that they can implement
these arbitrary functions.
But they're not optimizable.
And they're certainly not efficient because they
are serial computing devices.
So if you look at it as a compute graph,
RNNs are very long, thin compute graph.
What if you stretched out the neurons and you looked--
like take all the individual neurons interconnectivity,
and stretch them out, and try to visualize them.
RNNs would be like a very long graph and that's bad.
And it's bad also for optimizability
because I don't exactly know why,
but just the rough intuition is when you're backpropagating,
you don't want to make too many steps.
And so transformers are a shallow wide graph, and so
from supervision to inputs is a very small number of hops.
And it's a long residual pathways,
which make gradients flow very easily.
And there's all these layer norms
to control the scales of all of those activations.
And so there's not too many hops,
and you're going from supervision to input
very quickly and just flows through the graph.
And it can all be done in parallel,
so you don't need to do this--
encoder and decoder RNNs, you have to go from first word,
then second word, then third word.
But here in transformer, every single word
was processed completely in parallel, which is kind of a--
So I think all of these are really important because all
of these are really important.
And I think number 3 is less talked about but extremely
important because in deep learning scale matters.
And so the size of the network that you can train it
gives you is extremely important.
And so if it's efficient on the current hardware,
then you can make it bigger.
You mentioned that if you do it with multiple modalities
of data, [INAUDIBLE].
How does that actually work?
Do you leave the different data as different token,
or is it [INAUDIBLE]?
No, so yeah, so you take your image,
and you apparently chop them up into patches.
So there's the first thousand tokens or whatever.
And now, I have a special--
so radar could be also, but I don't actually
want to make a representation of radar.
But you just need to chop it up and enter it.
And then you have to encode it somehow.
Like the transformer needs to know
that they're coming from radar.
So you create a special--
you have some kind of a special token of that to--
these radar tokens are what's slightly
different in the representation, and it's
learnable by gradient descent.
And like vehicle information would also
come in with a special embedded token that can be learned.
So--
So how do you line those before really--
Actually, but you don't.
It's all just a set.
And there's--
Even the [INAUDIBLE]
Yeah, it's all just a set, but you can positionally
encode these sets if you want.
So positional encoding means you can
hardwire, for example, the coordinates
like using [INAUDIBLE].
You can hardwire that, but it's better
if you don't hardwire the position.
It's just a vector that is always
hanging out the dislocation.
Whatever content is there, it just adds on it.
And this vector is trainable by background.
That's how you do it.
Good point.
I don't really like the [INAUDIBLE]..
They seem to work, but it seems like they're sometimes
[INAUDIBLE]
I'm not sure if I understand your question.
[LAUGHTER]
So I mean the positional encoders
like they're actually like not--
OK, so they have very little inductive bias or something
like that.
They're just vectors hanging out in location always,
and you're trying to help the network in some way.
And I think the intuition is good,
but if you have enough data, usually,
trying to mess with it is a bad thing.
Trying to enter knowledge when you
have enough knowledge in the data
set itself is not usually productive.
So it all really depends on what scale you want.
If you have infinity data, then you actually
want to encode less and less.
That turns out to work better.
And if you have very little data, then actually, you do
want to encode some biases.
And maybe if you have a much smaller data set, then
maybe convolutions are a good idea
because you actually have this bias coming from your filters.
But I think-- so the transformer is extremely general,
but there are ways to mess with the encodings
to put in more structure.
Like you could, for example, encode [INAUDIBLE] and fix it,
or you could actually go to the attention mechanism
and say, OK, if my image is chopped up into patches,
this patch can only communicate to this neighborhood.
And you just do that in the attention matrix,
you just mask out whatever you don't want to communicate.
And so people really play with this
because the full attention is inefficient.
So they will intersperse, for example, layers
that only communicate in little patches
and then layers that communicate globally.
And they will do all kinds of tricks like that.
So you can slowly bring in more inductive bias.
You would do it, but the inductive biases
are like they're factored out from the core transformer.
And they are factored out, and the interconnectivity
of the nodes.
And they are factored out in the positionally--
and you can mess with this for computation.
[INAUDIBLE]
So there's probably about 200 papers on this now if not more.
They're kind of hard to keep track of.
Honestly, like my Safari browser, which is-- oh,
it's all up on my computer, like 200 open tabs.
But yes, I'm not even sure if I want
to pick my favorite honestly.
Yeah, [INAUDIBLE]
Maybe you can use a transformer like that [INAUDIBLE]
The other one that I actually like even more
is potentially, keep the context length fixed
but allow the network to somehow use a scratch pad.
And so the way this works is you will teach the transformer
somehow via examples in [INAUDIBLE] hey,
you actually have a scratch pad.
Basically, you can't remember too much.
Your context line is finite.
But you can use a scratch pad.
And you do that by emitting a start scratch pad,
and then writing whatever you want to remember, and then
end scratch pad.
And then you continue with whatever you want.
And then later when it's decoding,
you actually have special objects
that when you detect start scratch pad,
you will like save whatever it puts
in there in like external thing and allow it to attend over it.
So basically, you can teach the transformer just dynamically
because it's so meta-learned.
You can teach it dynamically to use other gizmos and gadgets
and allow it to expand its memory that way
if that makes sense.
It's just like human learning to use a notepad, right.
You don't have to keep it in your brain.
So keeping things in your brain is like the context line
from the transformer.
But maybe we can just give it a notebook.
And then it can query the notebook, and read from it,
and write to it.
[INAUDIBLE] transformer to plug in another transformer.
[LAUGHTER]
[INAUDIBLE]
I don't know if I detected that.
I feel like-- did you feel like there was more than just
a long prompt that's unfolding?
Yeah, [INAUDIBLE]
I didn't try extensively, but I did see a [INAUDIBLE] event.
And I felt like the block size was just moved.
Maybe I'm wrong.
I don't actually know about the internals of ChatGPT.
We have two online questions.
So one question is, "what do you think about architecture
[INAUDIBLE]?"
S4?
S4.
I'm sorry.
I don't know S4.
Which one is this one?
The second question, this one's a personal question.
"What are you going to work on next?"
[INAUDIBLE]
I mean, so right now, I'm working on things like nanoGPT.
Where is nanoGPT?
I mean, I'm going basically slightly from computer vision
and like computer vision-based products, do
a little bit in language domain.
Where's ChatGPT?
OK, nanoGPT.
So originally, I had minGPT, which I rewrote to nanoGPT.
And I'm working on this.
I'm trying to reproduce GPTs, and I mean,
I think something like ChatGPT, I think,
incrementally improved in a product fashion
would be extremely interesting.
And I think a lot of people feel it,
and that's why it went so wide.
So I think there's something like a Google plus
plus plus to build that I think is more interesting.
Shall we give our speaker a round of applause?
تصفح المزيد من مقاطع الفيديو ذات الصلة
Natural Language Processing: Crash Course Computer Science #36
Introduction to Generative AI
Ilya sutskever | Humanity will eventually move towards AGI | The intelligent body will soon appear
Lecture 1.1 — Why do we need machine learning — [ Deep Learning | Geoffrey Hinton | UofT ]
【生成式AI導論 2024】第10講:今日的語言模型是如何做文字接龍的 — 淺談Transformer (已經熟悉 Transformer 的同學可略過本講)
How to Break into AI Product Management without experience
5.0 / 5 (0 votes)