Stanford CS25: V1 I Transformers United: DL Models that have revolutionized NLP, CV, RL
Summary
TLDRThis video introduces the CS 25 course 'Transformers United,' taught at Stanford in 2021, covering the groundbreaking Transformer deep learning models. The instructors, Advay, Chetanya, and Div, explain the context and timeline leading to the development of Transformers, their self-attention mechanism, and their applications beyond natural language processing. They delve into the advantages, such as efficient handling of long sequences and parallelization, as well as drawbacks like computational complexity. The video also explores notable Transformer-based models like GPT and BERT, highlighting their novel approaches and widespread impact across various fields.
Takeaways
- 👉 Transformers are a type of deep learning model that revolutionized fields like natural language processing, computer vision, and reinforcement learning.
- 🌟 The key idea behind Transformers is the attention mechanism, introduced in the 2017 paper "Attention is All You Need" by Vaswani et al.
- 🧠 Transformers can effectively encode long sequences and context, overcoming limitations of previous models like RNNs and LSTMs.
- 🔄 Transformers use an encoder-decoder architecture with self-attention layers, feed-forward layers, and residual connections.
- 📐 Multi-head self-attention allows Transformers to learn multiple representations of the input data.
- 🚀 Major advantages of Transformers include constant path length between positions and parallelization for faster training.
- ⚠️ A key disadvantage is the quadratic time complexity of self-attention, which has led to efforts like Big Bird and Reformer to make it more efficient.
- 🌐 GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers) are two influential Transformer-based models.
- 🎯 GPT excels at language modeling and in-context learning, while BERT uses masked language modeling and next sentence prediction for pretraining.
- 🌉 Transformers have enabled various applications beyond NLP, such as protein folding, reinforcement learning, and image generation.
Q & A
What is the focus of the CS 25 class at Stanford?
-The CS 25 class at Stanford focuses on Transformers, a particular type of deep learning model that has revolutionized multiple fields like natural language processing, computer vision, and reinforcement learning.
Who are the instructors for the CS 25 class?
-The instructors for the CS 25 class are Advay, a software engineer at Applied Intuition, Chetanya, an ML engineer at Moveworks, and Div, a PhD student at Stanford.
What are the three main goals of the CS 25 class?
-The three main goals of the CS 25 class are: 1) to understand how Transformers work, 2) to learn how Transformers are being applied beyond natural language processing, and 3) to spark new ideas and research directions.
What was the key idea behind Transformers, and when was it introduced?
-The key idea behind Transformers was the simple attention mechanism, which was developed in 2017 with the paper "Attention is All you Need" by Vaswani et al.
What are the advantages of Transformers over older models like RNNs and LSTMs?
-The main advantages of Transformers are: 1) There is a constant path length between any two positions in a sequence, solving the problem of long sequences, and 2) They lend themselves well to parallelization, making training faster.
What is the main disadvantage of Transformers?
-The main disadvantage of Transformers is that self-attention takes quadratic time (O(n^2)), which does not scale well for very long sequences.
What is the purpose of the Multi-head Self-Attention mechanism in Transformers?
-The Multi-head Self-Attention mechanism in Transformers enables the model to learn multiple representations of the input data, with each head potentially learning different semantics or aspects of the input.
What is the purpose of the positional embedding layer in Transformers?
-The positional embedding layer in Transformers introduces the notion of ordering, which is essential for understanding language since most languages are read in a particular order (e.g., left to right).
What is the purpose of the masking component in the decoder part of Transformers?
-The masking component in the decoder part of Transformers prevents the decoder from looking into the future, which could result in data leakage.
What are some examples of applications or models that have been built using Transformers?
-Some examples of applications or models that have been built using Transformers include GPT-3 (for language generation and in-context learning), BERT (for bidirectional language representation and various NLP tasks), and AlphaFold (for protein folding).
Outlines
🤖 Introduction to CS 25: Transformers United
This paragraph introduces CS 25, a class about deep learning models, particularly Transformers, taught at Stanford. The instructors Advay, Div, and Chetanya introduce themselves and outline their backgrounds. The goal of the series is to explain how Transformers work, their applications beyond natural language processing, and to inspire new research directions. Div starts with a historical overview of attention mechanisms, highlighting the shift from RNNs and LSTMs to the revolutionary concept of Transformers, exemplified by the influential paper 'Attention is All You Need' by Vaswani et al.
📚 Deep Dive into Transformers and Attention Mechanisms
In this section, the presenters delve into the intricacies of attention mechanisms, the cornerstone of Transformer models. They discuss the evolution from simple attention mechanisms to the sophisticated self-attention mechanisms that underpin Transformers. They introduce concepts like Multi-head Self-attention and its importance in enabling Transformers to capture complex interactions across different parts of the input. They also touch on historical attention models, contrasting them with the more advanced self-attention, and discuss the unique advantages that Transformers offer due to their architecture.
🔍 Exploring Self-Attention and Transformer Components
This paragraph focuses on explaining self-attention in depth, illustrating how it enables Transformers to analyze sequences by comparing each element with every other element. Chetanya provides a practical example of how self-attention works. The discussion then moves to the critical components of Transformers, including positional embeddings and non-linearities, which address the challenges of sequence order and complex mappings. The concept of masking and its role in preventing future information leakage in the model is also explained, setting the stage for discussing the encoder-decoder architecture.
🏗️ Understanding Encoder-Decoder Architecture and Transformer Advantages
This segment details the encoder-decoder architecture fundamental to Transformer models, illustrating how it facilitates language translation tasks. It discusses the composition of encoder and decoder blocks, highlighting their specific functions and the introduction of Multi-head Attention in decoders. The advantages of Transformers, such as their constant path length and parallelization capabilities, are underscored, alongside their computational challenges. The presenters recommend resources for further learning and hint at methods to address the quadratic complexity issue in self-attention.
🚀 Transformer Applications and Future Directions
The final paragraph showcases applications of Transformer models, emphasizing the GPT and BERT architectures and their unique contributions to the field. The versatility of GPT in generative tasks and BERT's novel training approach with Masked Language Modeling are highlighted. The discussion also covers the impact of these models on various applications and hints at future developments in Transformer technology, suggesting an evolving landscape with new models and techniques. The lecture concludes by expressing gratitude and anticipation for the upcoming content in the series.
Mindmap
Keywords
💡Transformers
💡Self-Attention
💡Encoder-Decoder Architecture
💡Multi-Head Attention
💡Positional Encoding
💡GPT (Generative Pre-trained Transformer)
💡BERT (Bidirectional Encoder Representations from Transformers)
💡Parallelization
💡Quadratic Time Complexity
💡Applications
Highlights
CS 25 is a class created and taught at Stanford in the fall of 2021, focusing on deep learning models called Transformers that have revolutionized multiple fields, starting from natural language processing to computer vision and reinforcement learning.
The class aims to provide an understanding of how Transformers work, how they are being applied beyond natural language processing, and to spark new ideas and research directions.
Transformers introduced a simple attention mechanism in 2017 that allowed encoding long sequences and capturing context better than previous models like RNNs and LSTMs.
Transformers have been applied to solve long sequence problems in protein folding, few-shot and zero-shot generalization, text and image generation, and content generation.
Future directions for Transformers include incorporating external memory units, improving computational complexity of attention mechanisms, and aligning language models with human values.
Self-attention, the key component of Transformers, is a search retrieval problem where similar key vectors are found for a given query and weighted by their corresponding values.
Multi-head self-attention enables the model to learn multiple representations and capture different semantics.
Positional representations, nonlinearities, and masking are important ingredients that make Transformers powerful.
Transformers have an encoder-decoder architecture, with the encoder reading the input and the decoder generating the output token by token.
Advantages of Transformers include constant path length between any two positions in a sequence and parallelization capabilities, while the main disadvantage is quadratic time complexity of self-attention.
GPT by OpenAI consists of only the decoder blocks from Transformers and is trained on traditional language modeling tasks, with the ability to perform in-context learning and few-shot settings.
BERT (Bidirectional Encoder Representations from Transformers) consists of only the encoder blocks and uses a novel Masked Language Modeling task to overcome the data leakage problem.
BERT also includes a Next Sentence Prediction task, and the model can be fine-tuned for downstream tasks with an additional classification layer.
The landscape of Transformer models has evolved with different computing techniques and models for other modalities since the class was taken.
The speakers in the upcoming lectures will provide insights into how they are applying Transformers in their research.
Transcripts
Hey, everyone.
Welcome to the first and introductory lecture for CS 25,
Transformers United.
So CS 25 was a class that the three of us
created and taught at Stanford in the fall of 2021,
and the subject of the class is not
as the picture might suggest.
It's not about robots that can transform into cars.
It's about deep learning models and specifically
a particular kind of deep learning models
that have revolutionized multiple fields,
starting from natural language processing
to things like computer vision and reinforcement
learning to name a few.
We have an exciting set of videos lined up for you.
We had some truly fantastic speakers
come and give talks about how they were applying Transformers
in their own research.
And we hope you will enjoy and learn from these talks.
This video is purely an introductory lecture
to talk a little bit about transformers.
And before we get started, I'd like
to introduce the instructors.
So my name is Advay.
I am a software engineer at a company
called Applied Intuition.
Before this, I was a master's student in CS at Stanford.
And I am one of the co-instructors for CS25.
Chetanya, Div, if the two of you could introduce yourselves.
So hi, everyone.
I am a PhD student at Stanford.
Before this, I was pursuing a master's here,
researching a lot in generative modeling, reinforcement
learning, and robotics.
So nice to meet you all.
Yeah, that was Div, since he didn't say his name.
Chetanya, if you want to introduce yourself.
Yeah.
Hi, everyone.
My name is Chetanya, and I'm currently
working as an ML engineer at a start-up called Moveworks.
Before that, I was a master's student
at Stanford specializing in NLP and was
a member of the prize-winning Stanford's team
for the Alexa Prize Challenge.
All right, awesome.
So moving on to the rest of this talk,
essentially, what we hope you will learn
watching these videos and what we
hope the people who took our class in the fall of 2021
learned is three things.
One is we hope you will have an understanding of how
Transformers work.
Secondly, we hope you will learn and, by the end of these talks,
understand how Transformers are being applied beyond just
natural language processing.
And thirdly, we hope that some of these talks
will spark some new ideas within you
and hopefully lead to new directions of research,
new kinds of innovation, and things of that sort.
And to begin, we're going to talk a little bit
about Transformers and introduce some
of the context behind transformers as well.
And for that, I'd like to hand it off to Div.
So hi, everyone.
So welcome to our Transformer seminar.
So let's start first with an overview of the attention
timeline and how it came to be.
The key idea about Transformers was the simple attention
mechanism that was developed in 2017.
And this all started with this one paper
called "Attention is All you Need," by Vaswani et al.
Before 2017, we used to have this prehistoric era where
we had older models like RNNs, LSTMs, and simpler attention
mechanisms
And eventually, the growth in Transformers
has exploded into other fields and has become prominent
in all of machine learning.
And I'll go and see and show how this has been used.
So in the prehistoric era, there used to be RNNs.
There were different models like the Sequence2Sequence, LSTMs,
GRUs.
They were good at encoding some sort of memory,
but they did not work for encoding long sequences.
And they were very bad at encoding context.
So here is an example.
If you have a sentence like I grew up in France dot,
dot, dot, so I speak fluent- then
you want to fill this with French based on the context,
but like a LSTM model might not know what it is and might just
make a very big mistake here.
Similarly, we can show some sort of correlation map
here, where if you have a pronoun like it,
we want it to correlate to one of the past nouns
that we have seen so far, like animal.
But again, older models were really not good
at this context encoding.
So where we are currently now is on the verge of take-off.
We began to realize the potential of Transformers
in different fields.
We have started to use them to solve long sequence
problems in protein folding, such as the AlphaFold
model from DeepMind, which gets 95% accuracy
on different challenges, and offline RL.
We can use it for few-shot and zero-shot generalization,
for text and image generation.
And we can also use for content generation.
So here's an example from OpenAI,
where you can give a different text prompt
and have an AI generate a fictional image for you.
And so there's a doc on this that you can also
watch on YouTube, which basically
says that LSTMs are dead, and long live Transformers.
So what's the future?
So we can enable a lot more applications for Transformers.
They can be applied to any form of sequence modeling.
So we could use them for video understanding.
We can use them for finance and a lot more.
So basically, imagine all sorts of generative modeling
problems.
Nevertheless, there are a lot of missing ingredients.
So like the human brain, we need some sort
of external memory unit, which is the hippocampus for us.
And there are some early works here.
So one nice work you might want to check out
is called Neural Turing Machines.
Similarly, the current attention mechanisms
are very computationally complex in terms
of time and correlating which we will discuss later.
And we want to make them more linear.
And the third problem is that we want
to align our current sort of language models
with how the human brain works and human values.
And this is also a big issue.
OK.
So now I will deep dive--
I will dive deeper into the attention mechanisms
and show how they came out to be.
So initially, they used to be very simple mechanisms.
Attention was inspired by the process of importance
weighting, of putting attention on different parts of an image
like similar to a human, where you might focus more
on a foreground if you have an image of a dog compared
to the rest of the background.
So in the case of soft attention, what you do is
you learn the simple soft attention weighting
for each pixel, which can be a weight between 0 to 1.
The problem over here is that this is
a very expensive computation.
And you can show--
as is shown in the figure on the left,
you can see we are calculating this attention
map for the whole image.
What you can do instead is you can just get a 0 to 1
attention map, where we directly put a 1 on wherever the dog is
and 0 wherever it's a background.
This is like less computationally expensive,
but the problem is it's non-differentiable and makes
things harder to train.
Going forward, we also have different varieties
of basic attention mechanisms that were
proposed before self-attention.
So the first term right here is global attention models.
So global attention model is for each hidden layer
input-- hidden layer output.
You learn attention weight of a of p
and this is elementwise multiplied
with your current output to get your final output, yt.
Similarly, you have the local attention models,
where instead of calculating the global attention
for over the whole sequence length,
you only calculate the attention over a small window.
And then you weight by the attention of the window
into the current output to get the final output you need.
So moving on, I will pass on to Chetanya
to discuss self-attention mechanisms and platforms.
Thank you, Div, for covering a brief overview of how
the primitive versions of attention work.
Now just before we talk about self-attention, just a bit
of trivia, that term was first introduced by a paper
from Lin et al., which provided a framework
for a self-attentive mechanism for sentence and meanings.
And now moving on to the main crux of the Transformers
paper, which was the self-attention block.
So self-attention is the basis, is the main building
block for what makes a Transformers model work so well
and to enable them and make them so powerful.
So to think of it more easily, we
can break down the self-attention
as a search retrieval problem.
So the problem is that given a query, q, and v,
we need to find a set of keys, k,
which are most similar to q and return
the corresponding key values called v. Now
these three vectors can be drawn from the same source.
For example, we can have that q, k, and v
are all equal to a single vector x, where x can
be output of a previous layer.
In Transformers, these vectors are
obtained by applying different linear transformations to x.
So as to enable the model to capture
more complex interactions between the different tokens
at different places of the sentence.
Now how attention is computed is just a weighted summation
of the similarities between the query and key vectors,
which is weighted by the respective value
for those keys.
And in the Transformers paper, they
used the scaled dot-product as a similarity function
for the queries and keys.
And another important aspect of the Transformers
was the introduction of Multi-head self-attention.
So what Multi-head Self-attention means
is that the self-attention is for at every layer
the self-attention is performed multiple times,
which enables the model to learn multiple representations
of spaces.
So in a way, you can think of it that each head has a power
to look at different things and to learn different semantics.
For example, one head can be learning
to try to predict what is the part of speech
for those tokens.
One head might be learning what is the syntactic structure
of the sentence and all those things
that are there to understand what the upcoming
sentence means.
Now to better understand what the self-attention works
and what are the different computations,
there is a short video.
So in this-- so as you can see, there
are three incoming tokens, so input 1, input 2, input 3.
We apply linear transformations to get the key value vectors
for each input and then give-- once a query, q, comes,
we calculate its similarity with the respective key vectors
and then multiply those scores with the value vector.
And then add them all up to get the output.
The same computation is then performed on all the tokens.
And we get the output of the self-attention layer.
So as you can see here, the final output
of self-attention layer is in dark green that's
at the top of the screen.
So now again, for the final token,
we perform everything same, queries multiplied by keys.
We get the similarity scores.
And then those similarity scores weigh the value vectors.
And then we finally perform the addition
to get the self-attention output of the Transformers block.
Apart from self-attention, there are
some other necessary ingredients that makes
the Transformers so powerful.
One important aspect is the presence
of positional representations or the embedding layer.
So the way RNNs worked very well was
that since they process each of the information
in a sequential ordering.
So there was this notion of ordering, right,
which is also very important in understanding language
because we all know that we read any piece of text
from left to right in most of the languages
and also right to left in some languages.
So there is a notion of ordering which
is lost in kind of self-attention
because every word is attending to every other word.
That's why this paper introduced a separate embedding
layer for introducing positional representations.
The second important aspect is having nonlinearities.
So if you think of all the computation that
is happening in the self-attention layer,
it's all linear because it's all matrix multiplication.
But as we all know, that deep learning models
work well when they are able to--
when they are able to learn more complex mappings between input
and output, which can be attained by a simple MLP.
And the third important component of the self--
of the Transformers is the masking.
So masking is what allows to parallelize the operations.
Since every word can attend to every other word,
in the decoder part of our transformers,
which Advay is going to be talking about later,
is the problem becomes that you don't want the decoder
to look into the future because that
can result in data leakage.
So that's why masking helps the decoder
to avoid that future information and learn only what has been--
what the model has processed so far.
So now onto the encoder-decoder architecture
of the Transformers.
Advay.
Yeah.
Thanks, Chetanya, for talking about self-attention.
So self-attention is sort of the key ingredient or one
of the key ingredients that allows Transformers
to work so well, but at a very high level,
the model that was proposed in the Vaswani et al.
paper of 2017 was like previous language models in the sense
that it had an encoder-decoder architecture.
What that means is--
let's say you're working on a translation problem.
You want to translate English to French.
The way that would work is you would
read in the entire input of your English sentence.
You would encode that input.
So that's the encoded part of the network.
And then you would generate token
by token the corresponding French translation.
And the decoder is the part of the network
that is responsible for generating those tokens.
So you can think of these encoder blocks and decoder
blocks as essentially something like LEGO.
They have these subcomponents that make them up.
And in particular, the encoder block
has three main subcomponents.
The first is the self-attention layer
that Chetanya talked about earlier.
And, as talked about earlier as well,
you need a feed-forward layer after that
because the self-attention layer only
performs linear operations.
And so you need something that can capture the nonlinearities.
You also have a layer norm after this.
And lastly, there are residual connections
between different encoder blocks.
The decoder is very similar to the encoder,
but there's one difference, which
is that it has this extra layer, because the decoder doesn't
just do Multi-head Attention on the output
of the previous layers.
So for context, the encoder does Multi-head Attention
for each self-attention layer in the encoder block.
And each of the encoder blocks does Multi-head Attention
looking at the previous layers of the encoder blocks.
The decoder, however, does that in the sense
that it also looks at the previous layers of the decoder,
but it also looks at the output of the encoder.
And so it needs a Multi-head Attention layer
over the encoder blocks.
And lastly, there's masking as well.
So if you are-- because every token can
look at every other token, you want
to sort of make sure in the decoder
that you're not looking into the future.
So if you're in position 3, for instance,
you shouldn't be able to look at position 4 and position 5.
So those are sort of all the components
that led to the creation of the model in the Vaswani et al.
paper.
And let's talk a little bit about the advantages
and drawbacks of this model.
So the two main advantages which are huge advantages and which
are why Transformers have done such a good job
of revolutionizing many, many fields within deep learning
are as follows.
So the first is there is this constant path
length between any two positions in a sequence
because every token in the sequence
is looking at every other token.
And this basically solves the problem
that they've talked about earlier with long sequences.
You don't have this problem with long sequences
where if you're trying to predict a token
that depends on a word that was far, far behind in a sentence.
You don't have the problem of losing that context.
Now the distance between them is only one
in terms of the path length.
Also, because of the nature of the computation that's
happening, Transformer models lend themselves really
well to parallelization and because
of the advances that we've had with GPUs.
Basically, if you take a Transformer model
with n parameters and you take a model that isn't a Transformer,
say, like an LSTM also with n parameters,
training the Transformer model is
going to be much faster because of the parallelization
that it leverages.
So those are the advantages.
The disadvantages are basically self-attention takes
quadratic time because every token looks
at every other token.
Order n squared as you might know does not scale,
and there's actually been a lot of work
in trying to tackle this.
So we've linked to some here.
Big Bird, Linformer, and Reformer
are all approaches to try and make this linear or quasilinear
essentially.
And yeah, we highly recommend to--
recommend going through Jay Alammar's blog,
"The Illustrated Transformer." which
provides great visualizations and explains everything
that we just talked about in great detail.
Yeah.
And I'd like to pass it on to Chetanya for applications
of Transformers.
Yeah.
So now moving on to some of the recent work--
some of the work that very shortly followed
the Transformers paper.
So one of the models that came out
was GPT, the GPT architecture, which was released by OpenAI.
So OpenAI had the latest model that OpenAI has and the GPT
series and the GPT-3.
So it consists of only the decoder
blocks from Transformers.
And it's trained on a traditional language modeling
task, which is predicting the current token-- which
is creating the next token given the last t tokens
that the model has seen.
And for any downstream tasks, now the model
can just-- you can just train a classification layer
on the last hidden state, which can have any number of labels.
And since the model is generative in nature,
you can also use the pretrained network
for generative kind of tasks, such as summarization
and natural language generation for that instance.
Another important aspect that GPT gained popularity
was its ability to be able to perform
in-context learning, what the authors called
in-context learning.
So this is the ability wherein the model can
learn under few-shot settings, what
the task is to complete the task without performing
any gradient updates.
For example, let's say, the model
has shown a bunch of addition examples.
And then if you pass in a new input and leave the--
and just leave it at equal to sign,
the model tries to predict the next token, which
very well comes out to be the sum of the numbers that
is shown.
Another example can be also the spell correction
task or the translation task.
So this was the ability that made GPT-3 so much talked about
in the NLP world.
And right now also, many applications
have been made using GPT-3 which includes one of them being
the VS Code Copilot, which tries to generate a piece of code
given docstring kind of natural language text.
Another major model that came out
that was based on the Transformers' architecture
was BERT.
So BERT lends its name from-- it's
an acronym for Bidirectional Encoder Representations
from Transformers.
It consists of only the encoder blocks of the Transformers,
which is unlike GPT-3, which had only the decoder blocks.
Because of this change, there comes a problem
because BERT has only the encoder block.
So it sees the entire piece of text.
It cannot be pretrained on a naive language modeling task
because of the problem of data leakage from the future.
So what the authors came up with was a clever idea.
And they came up with a novel task called Masked Language
Modeling, which included to replace certain words
with a placeholder.
And then the model tries to predict those words given
the entire context.
Now, apart from this token-level task,
the authors also added a second objective
called the Next Sentence Prediction, which
was a sentence-level task, wherein
given two chunks of text, the model tried to predict
whether the second sentence followed the other sentence
or not, followed the first sentence or not.
And now after pretraining this model for any downstream tasks,
the model can be further fine-tuned
with an additional classification layer
just like it was in GPT-3.
So these are the two models that have been very popular
and have made a lot of applications, made their way
in a lot of applications.
But the landscape has changed quite a lot
since we have taken this class.
There are models with different computing techniques
like ELECTRA, DeBERTa.
And there are also models that do
well in like other modalities and which
we are going to be talking about in other lecture series
as well.
So yeah, that's all from this lecture.
And thank you for tuning in.
Yeah.
Just want to end by saying, thank you
all for watching this.
And we have a really exciting set
of videos with truly amazing speakers.
And we hope you are able to derive value from that.
Sure.
Thanks a lot.
Thank you.
Thank you, everyone.
Ver Más Videos Relacionados
Transformers, explained: Understand the model behind GPT, BERT, and T5
Introduction to Generative AI
LLM Foundations (LLM Bootcamp)
Stanford CS25: V1 I Transformers in Language: The development of GPT Models, GPT3
Introduction to Generative AI n explainable AI
Deep Learning(CS7015): Lec 1.6 The Curious Case of Sequences
5.0 / 5 (0 votes)