Stanford CS25: V1 I Transformers United: DL Models that have revolutionized NLP, CV, RL

Stanford Online
8 Jul 202222:43

Summary

TLDRThis video introduces the CS 25 course 'Transformers United,' taught at Stanford in 2021, covering the groundbreaking Transformer deep learning models. The instructors, Advay, Chetanya, and Div, explain the context and timeline leading to the development of Transformers, their self-attention mechanism, and their applications beyond natural language processing. They delve into the advantages, such as efficient handling of long sequences and parallelization, as well as drawbacks like computational complexity. The video also explores notable Transformer-based models like GPT and BERT, highlighting their novel approaches and widespread impact across various fields.

Takeaways

  • πŸ‘‰ Transformers are a type of deep learning model that revolutionized fields like natural language processing, computer vision, and reinforcement learning.
  • 🌟 The key idea behind Transformers is the attention mechanism, introduced in the 2017 paper "Attention is All You Need" by Vaswani et al.
  • 🧠 Transformers can effectively encode long sequences and context, overcoming limitations of previous models like RNNs and LSTMs.
  • πŸ”„ Transformers use an encoder-decoder architecture with self-attention layers, feed-forward layers, and residual connections.
  • πŸ“ Multi-head self-attention allows Transformers to learn multiple representations of the input data.
  • πŸš€ Major advantages of Transformers include constant path length between positions and parallelization for faster training.
  • ⚠️ A key disadvantage is the quadratic time complexity of self-attention, which has led to efforts like Big Bird and Reformer to make it more efficient.
  • 🌐 GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers) are two influential Transformer-based models.
  • 🎯 GPT excels at language modeling and in-context learning, while BERT uses masked language modeling and next sentence prediction for pretraining.
  • πŸŒ‰ Transformers have enabled various applications beyond NLP, such as protein folding, reinforcement learning, and image generation.

Q & A

  • What is the focus of the CS 25 class at Stanford?

    -The CS 25 class at Stanford focuses on Transformers, a particular type of deep learning model that has revolutionized multiple fields like natural language processing, computer vision, and reinforcement learning.

  • Who are the instructors for the CS 25 class?

    -The instructors for the CS 25 class are Advay, a software engineer at Applied Intuition, Chetanya, an ML engineer at Moveworks, and Div, a PhD student at Stanford.

  • What are the three main goals of the CS 25 class?

    -The three main goals of the CS 25 class are: 1) to understand how Transformers work, 2) to learn how Transformers are being applied beyond natural language processing, and 3) to spark new ideas and research directions.

  • What was the key idea behind Transformers, and when was it introduced?

    -The key idea behind Transformers was the simple attention mechanism, which was developed in 2017 with the paper "Attention is All you Need" by Vaswani et al.

  • What are the advantages of Transformers over older models like RNNs and LSTMs?

    -The main advantages of Transformers are: 1) There is a constant path length between any two positions in a sequence, solving the problem of long sequences, and 2) They lend themselves well to parallelization, making training faster.

  • What is the main disadvantage of Transformers?

    -The main disadvantage of Transformers is that self-attention takes quadratic time (O(n^2)), which does not scale well for very long sequences.

  • What is the purpose of the Multi-head Self-Attention mechanism in Transformers?

    -The Multi-head Self-Attention mechanism in Transformers enables the model to learn multiple representations of the input data, with each head potentially learning different semantics or aspects of the input.

  • What is the purpose of the positional embedding layer in Transformers?

    -The positional embedding layer in Transformers introduces the notion of ordering, which is essential for understanding language since most languages are read in a particular order (e.g., left to right).

  • What is the purpose of the masking component in the decoder part of Transformers?

    -The masking component in the decoder part of Transformers prevents the decoder from looking into the future, which could result in data leakage.

  • What are some examples of applications or models that have been built using Transformers?

    -Some examples of applications or models that have been built using Transformers include GPT-3 (for language generation and in-context learning), BERT (for bidirectional language representation and various NLP tasks), and AlphaFold (for protein folding).

Outlines

00:00

πŸ€– Introduction to CS 25: Transformers United

This paragraph introduces CS 25, a class about deep learning models, particularly Transformers, taught at Stanford. The instructors Advay, Div, and Chetanya introduce themselves and outline their backgrounds. The goal of the series is to explain how Transformers work, their applications beyond natural language processing, and to inspire new research directions. Div starts with a historical overview of attention mechanisms, highlighting the shift from RNNs and LSTMs to the revolutionary concept of Transformers, exemplified by the influential paper 'Attention is All You Need' by Vaswani et al.

05:03

πŸ“š Deep Dive into Transformers and Attention Mechanisms

In this section, the presenters delve into the intricacies of attention mechanisms, the cornerstone of Transformer models. They discuss the evolution from simple attention mechanisms to the sophisticated self-attention mechanisms that underpin Transformers. They introduce concepts like Multi-head Self-attention and its importance in enabling Transformers to capture complex interactions across different parts of the input. They also touch on historical attention models, contrasting them with the more advanced self-attention, and discuss the unique advantages that Transformers offer due to their architecture.

10:05

πŸ” Exploring Self-Attention and Transformer Components

This paragraph focuses on explaining self-attention in depth, illustrating how it enables Transformers to analyze sequences by comparing each element with every other element. Chetanya provides a practical example of how self-attention works. The discussion then moves to the critical components of Transformers, including positional embeddings and non-linearities, which address the challenges of sequence order and complex mappings. The concept of masking and its role in preventing future information leakage in the model is also explained, setting the stage for discussing the encoder-decoder architecture.

15:07

πŸ—οΈ Understanding Encoder-Decoder Architecture and Transformer Advantages

This segment details the encoder-decoder architecture fundamental to Transformer models, illustrating how it facilitates language translation tasks. It discusses the composition of encoder and decoder blocks, highlighting their specific functions and the introduction of Multi-head Attention in decoders. The advantages of Transformers, such as their constant path length and parallelization capabilities, are underscored, alongside their computational challenges. The presenters recommend resources for further learning and hint at methods to address the quadratic complexity issue in self-attention.

20:07

πŸš€ Transformer Applications and Future Directions

The final paragraph showcases applications of Transformer models, emphasizing the GPT and BERT architectures and their unique contributions to the field. The versatility of GPT in generative tasks and BERT's novel training approach with Masked Language Modeling are highlighted. The discussion also covers the impact of these models on various applications and hints at future developments in Transformer technology, suggesting an evolving landscape with new models and techniques. The lecture concludes by expressing gratitude and anticipation for the upcoming content in the series.

Mindmap

Keywords

πŸ’‘Transformers

Transformers are a type of deep learning model that have revolutionized fields like natural language processing, computer vision, and reinforcement learning. They utilize a self-attention mechanism that allows every token in a sequence to attend to every other token, solving the problem of long-range dependencies. The video introduces Transformers as the central topic and highlights their impact across various domains.

πŸ’‘Self-Attention

Self-attention is the core component of Transformer models. It is a mechanism that allows each token in a sequence to attend to every other token, capturing long-range dependencies and contextual information. The video explains self-attention as a search retrieval problem, where queries (tokens) are matched with keys (other tokens) and weighted by corresponding values. This enables the model to learn complex interactions between tokens.

πŸ’‘Encoder-Decoder Architecture

The Transformer model proposed in the Vaswani et al. paper follows an encoder-decoder architecture, similar to previous language models. The encoder encodes the input sequence, while the decoder generates the output sequence token by token. The video breaks down the components of the encoder and decoder blocks, including self-attention layers, feed-forward layers, layer normalization, and residual connections.

πŸ’‘Multi-Head Attention

Multi-head attention is a technique introduced in the Transformer paper, where the self-attention mechanism is performed multiple times in parallel, each with a different representation subspace. This allows the model to capture different aspects or semantics of the input, such as part-of-speech tagging, syntactic structure, and more. The video explains how Multi-head Attention enables the model to learn multiple representations simultaneously.

πŸ’‘Positional Encoding

Since the self-attention mechanism treats all tokens equally, positional encoding is introduced to incorporate the notion of order or position within the sequence. This is important for tasks like language modeling, where the order of words is crucial for understanding meaning. The video highlights positional encoding as a necessary component in Transformer models to preserve sequential information.

πŸ’‘GPT (Generative Pre-trained Transformer)

GPT is a Transformer-based language model introduced by OpenAI. It consists only of the decoder blocks from the original Transformer architecture and is trained on a traditional language modeling task. The video discusses GPT's ability to perform in-context learning, where it can adapt to new tasks with few examples without any gradient updates. Examples include arithmetic, spell correction, and code generation tasks.

πŸ’‘BERT (Bidirectional Encoder Representations from Transformers)

BERT is another influential Transformer-based model, introduced by Google. Unlike GPT, BERT consists only of the encoder blocks and is pre-trained using a novel Masked Language Modeling task, where certain words are replaced with placeholders, and the model predicts the masked words based on the context. The video explains BERT's pretraining objectives and its application to various downstream tasks through fine-tuning.

πŸ’‘Parallelization

One of the key advantages of Transformer models, as mentioned in the video, is their ability to leverage parallelization due to the nature of their computations. This allows Transformer models with the same number of parameters as other models (e.g., LSTMs) to train much faster, thanks to the advancements in GPU technology. The video highlights parallelization as a significant benefit of Transformers.

πŸ’‘Quadratic Time Complexity

While Transformers offer several advantages, the video acknowledges their quadratic time complexity as a major drawback. Since every token attends to every other token, the self-attention mechanism scales quadratically with the sequence length. The video mentions ongoing research efforts, such as Big Bird, Linformer, and Reformer, aimed at addressing this issue by making the self-attention computation linear or quasi-linear.

πŸ’‘Applications

The video emphasizes the diverse applications of Transformers beyond natural language processing, including protein folding (e.g., AlphaFold), few-shot and zero-shot generalization, text and image generation, video understanding, and finance. It highlights the potential of Transformers for various sequence modeling problems and encourages exploration of new research directions and innovations based on the presented talks.

Highlights

CS 25 is a class created and taught at Stanford in the fall of 2021, focusing on deep learning models called Transformers that have revolutionized multiple fields, starting from natural language processing to computer vision and reinforcement learning.

The class aims to provide an understanding of how Transformers work, how they are being applied beyond natural language processing, and to spark new ideas and research directions.

Transformers introduced a simple attention mechanism in 2017 that allowed encoding long sequences and capturing context better than previous models like RNNs and LSTMs.

Transformers have been applied to solve long sequence problems in protein folding, few-shot and zero-shot generalization, text and image generation, and content generation.

Future directions for Transformers include incorporating external memory units, improving computational complexity of attention mechanisms, and aligning language models with human values.

Self-attention, the key component of Transformers, is a search retrieval problem where similar key vectors are found for a given query and weighted by their corresponding values.

Multi-head self-attention enables the model to learn multiple representations and capture different semantics.

Positional representations, nonlinearities, and masking are important ingredients that make Transformers powerful.

Transformers have an encoder-decoder architecture, with the encoder reading the input and the decoder generating the output token by token.

Advantages of Transformers include constant path length between any two positions in a sequence and parallelization capabilities, while the main disadvantage is quadratic time complexity of self-attention.

GPT by OpenAI consists of only the decoder blocks from Transformers and is trained on traditional language modeling tasks, with the ability to perform in-context learning and few-shot settings.

BERT (Bidirectional Encoder Representations from Transformers) consists of only the encoder blocks and uses a novel Masked Language Modeling task to overcome the data leakage problem.

BERT also includes a Next Sentence Prediction task, and the model can be fine-tuned for downstream tasks with an additional classification layer.

The landscape of Transformer models has evolved with different computing techniques and models for other modalities since the class was taken.

The speakers in the upcoming lectures will provide insights into how they are applying Transformers in their research.

Transcripts

play00:05

Hey, everyone.

play00:06

Welcome to the first and introductory lecture for CS 25,

play00:11

Transformers United.

play00:12

So CS 25 was a class that the three of us

play00:15

created and taught at Stanford in the fall of 2021,

play00:19

and the subject of the class is not

play00:22

as the picture might suggest.

play00:23

It's not about robots that can transform into cars.

play00:27

It's about deep learning models and specifically

play00:30

a particular kind of deep learning models

play00:32

that have revolutionized multiple fields,

play00:35

starting from natural language processing

play00:38

to things like computer vision and reinforcement

play00:40

learning to name a few.

play00:42

We have an exciting set of videos lined up for you.

play00:45

We had some truly fantastic speakers

play00:48

come and give talks about how they were applying Transformers

play00:51

in their own research.

play00:53

And we hope you will enjoy and learn from these talks.

play00:56

This video is purely an introductory lecture

play00:59

to talk a little bit about transformers.

play01:01

And before we get started, I'd like

play01:03

to introduce the instructors.

play01:05

So my name is Advay.

play01:06

I am a software engineer at a company

play01:08

called Applied Intuition.

play01:10

Before this, I was a master's student in CS at Stanford.

play01:14

And I am one of the co-instructors for CS25.

play01:19

Chetanya, Div, if the two of you could introduce yourselves.

play01:22

So hi, everyone.

play01:23

I am a PhD student at Stanford.

play01:26

Before this, I was pursuing a master's here,

play01:29

researching a lot in generative modeling, reinforcement

play01:31

learning, and robotics.

play01:34

So nice to meet you all.

play01:35

Yeah, that was Div, since he didn't say his name.

play01:38

Chetanya, if you want to introduce yourself.

play01:40

Yeah.

play01:41

Hi, everyone.

play01:41

My name is Chetanya, and I'm currently

play01:44

working as an ML engineer at a start-up called Moveworks.

play01:48

Before that, I was a master's student

play01:50

at Stanford specializing in NLP and was

play01:52

a member of the prize-winning Stanford's team

play01:54

for the Alexa Prize Challenge.

play01:58

All right, awesome.

play01:59

So moving on to the rest of this talk,

play02:04

essentially, what we hope you will learn

play02:07

watching these videos and what we

play02:09

hope the people who took our class in the fall of 2021

play02:12

learned is three things.

play02:15

One is we hope you will have an understanding of how

play02:17

Transformers work.

play02:19

Secondly, we hope you will learn and, by the end of these talks,

play02:24

understand how Transformers are being applied beyond just

play02:27

natural language processing.

play02:29

And thirdly, we hope that some of these talks

play02:32

will spark some new ideas within you

play02:35

and hopefully lead to new directions of research,

play02:37

new kinds of innovation, and things of that sort.

play02:44

And to begin, we're going to talk a little bit

play02:47

about Transformers and introduce some

play02:49

of the context behind transformers as well.

play02:52

And for that, I'd like to hand it off to Div.

play03:00

So hi, everyone.

play03:02

So welcome to our Transformer seminar.

play03:05

So let's start first with an overview of the attention

play03:07

timeline and how it came to be.

play03:10

The key idea about Transformers was the simple attention

play03:12

mechanism that was developed in 2017.

play03:15

And this all started with this one paper

play03:17

called "Attention is All you Need," by Vaswani et al.

play03:19

Before 2017, we used to have this prehistoric era where

play03:22

we had older models like RNNs, LSTMs, and simpler attention

play03:27

mechanisms

play03:28

And eventually, the growth in Transformers

play03:31

has exploded into other fields and has become prominent

play03:33

in all of machine learning.

play03:35

And I'll go and see and show how this has been used.

play03:39

So in the prehistoric era, there used to be RNNs.

play03:43

There were different models like the Sequence2Sequence, LSTMs,

play03:46

GRUs.

play03:47

They were good at encoding some sort of memory,

play03:50

but they did not work for encoding long sequences.

play03:53

And they were very bad at encoding context.

play03:55

So here is an example.

play03:57

If you have a sentence like I grew up in France dot,

play03:59

dot, dot, so I speak fluent- then

play04:02

you want to fill this with French based on the context,

play04:04

but like a LSTM model might not know what it is and might just

play04:07

make a very big mistake here.

play04:09

Similarly, we can show some sort of correlation map

play04:13

here, where if you have a pronoun like it,

play04:15

we want it to correlate to one of the past nouns

play04:17

that we have seen so far, like animal.

play04:20

But again, older models were really not good

play04:23

at this context encoding.

play04:25

So where we are currently now is on the verge of take-off.

play04:29

We began to realize the potential of Transformers

play04:31

in different fields.

play04:32

We have started to use them to solve long sequence

play04:35

problems in protein folding, such as the AlphaFold

play04:39

model from DeepMind, which gets 95% accuracy

play04:44

on different challenges, and offline RL.

play04:47

We can use it for few-shot and zero-shot generalization,

play04:50

for text and image generation.

play04:51

And we can also use for content generation.

play04:53

So here's an example from OpenAI,

play04:55

where you can give a different text prompt

play04:58

and have an AI generate a fictional image for you.

play05:02

And so there's a doc on this that you can also

play05:05

watch on YouTube, which basically

play05:07

says that LSTMs are dead, and long live Transformers.

play05:11

So what's the future?

play05:13

So we can enable a lot more applications for Transformers.

play05:17

They can be applied to any form of sequence modeling.

play05:20

So we could use them for video understanding.

play05:23

We can use them for finance and a lot more.

play05:25

So basically, imagine all sorts of generative modeling

play05:28

problems.

play05:29

Nevertheless, there are a lot of missing ingredients.

play05:31

So like the human brain, we need some sort

play05:33

of external memory unit, which is the hippocampus for us.

play05:37

And there are some early works here.

play05:40

So one nice work you might want to check out

play05:42

is called Neural Turing Machines.

play05:44

Similarly, the current attention mechanisms

play05:46

are very computationally complex in terms

play05:49

of time and correlating which we will discuss later.

play05:52

And we want to make them more linear.

play05:54

And the third problem is that we want

play05:56

to align our current sort of language models

play05:58

with how the human brain works and human values.

play06:01

And this is also a big issue.

play06:03

OK.

play06:03

So now I will deep dive--

play06:06

I will dive deeper into the attention mechanisms

play06:10

and show how they came out to be.

play06:13

So initially, they used to be very simple mechanisms.

play06:18

Attention was inspired by the process of importance

play06:20

weighting, of putting attention on different parts of an image

play06:24

like similar to a human, where you might focus more

play06:26

on a foreground if you have an image of a dog compared

play06:29

to the rest of the background.

play06:31

So in the case of soft attention, what you do is

play06:33

you learn the simple soft attention weighting

play06:35

for each pixel, which can be a weight between 0 to 1.

play06:39

The problem over here is that this is

play06:40

a very expensive computation.

play06:42

And you can show--

play06:43

as is shown in the figure on the left,

play06:46

you can see we are calculating this attention

play06:48

map for the whole image.

play06:50

What you can do instead is you can just get a 0 to 1

play06:54

attention map, where we directly put a 1 on wherever the dog is

play06:57

and 0 wherever it's a background.

play07:00

This is like less computationally expensive,

play07:03

but the problem is it's non-differentiable and makes

play07:05

things harder to train.

play07:07

Going forward, we also have different varieties

play07:09

of basic attention mechanisms that were

play07:12

proposed before self-attention.

play07:14

So the first term right here is global attention models.

play07:17

So global attention model is for each hidden layer

play07:22

input-- hidden layer output.

play07:23

You learn attention weight of a of p

play07:26

and this is elementwise multiplied

play07:28

with your current output to get your final output, yt.

play07:32

Similarly, you have the local attention models,

play07:35

where instead of calculating the global attention

play07:37

for over the whole sequence length,

play07:39

you only calculate the attention over a small window.

play07:44

And then you weight by the attention of the window

play07:47

into the current output to get the final output you need.

play07:52

So moving on, I will pass on to Chetanya

play07:55

to discuss self-attention mechanisms and platforms.

play07:59

Thank you, Div, for covering a brief overview of how

play08:02

the primitive versions of attention work.

play08:05

Now just before we talk about self-attention, just a bit

play08:09

of trivia, that term was first introduced by a paper

play08:13

from Lin et al., which provided a framework

play08:16

for a self-attentive mechanism for sentence and meanings.

play08:22

And now moving on to the main crux of the Transformers

play08:25

paper, which was the self-attention block.

play08:28

So self-attention is the basis, is the main building

play08:32

block for what makes a Transformers model work so well

play08:36

and to enable them and make them so powerful.

play08:40

So to think of it more easily, we

play08:42

can break down the self-attention

play08:44

as a search retrieval problem.

play08:46

So the problem is that given a query, q, and v,

play08:51

we need to find a set of keys, k,

play08:53

which are most similar to q and return

play08:55

the corresponding key values called v. Now

play08:58

these three vectors can be drawn from the same source.

play09:00

For example, we can have that q, k, and v

play09:03

are all equal to a single vector x, where x can

play09:06

be output of a previous layer.

play09:08

In Transformers, these vectors are

play09:11

obtained by applying different linear transformations to x.

play09:14

So as to enable the model to capture

play09:17

more complex interactions between the different tokens

play09:20

at different places of the sentence.

play09:22

Now how attention is computed is just a weighted summation

play09:26

of the similarities between the query and key vectors,

play09:29

which is weighted by the respective value

play09:31

for those keys.

play09:33

And in the Transformers paper, they

play09:35

used the scaled dot-product as a similarity function

play09:38

for the queries and keys.

play09:40

And another important aspect of the Transformers

play09:42

was the introduction of Multi-head self-attention.

play09:46

So what Multi-head Self-attention means

play09:48

is that the self-attention is for at every layer

play09:51

the self-attention is performed multiple times,

play09:54

which enables the model to learn multiple representations

play09:57

of spaces.

play09:58

So in a way, you can think of it that each head has a power

play10:05

to look at different things and to learn different semantics.

play10:08

For example, one head can be learning

play10:11

to try to predict what is the part of speech

play10:14

for those tokens.

play10:15

One head might be learning what is the syntactic structure

play10:18

of the sentence and all those things

play10:21

that are there to understand what the upcoming

play10:26

sentence means.

play10:28

Now to better understand what the self-attention works

play10:31

and what are the different computations,

play10:32

there is a short video.

play10:35

So in this-- so as you can see, there

play10:39

are three incoming tokens, so input 1, input 2, input 3.

play10:43

We apply linear transformations to get the key value vectors

play10:47

for each input and then give-- once a query, q, comes,

play10:51

we calculate its similarity with the respective key vectors

play10:55

and then multiply those scores with the value vector.

play10:59

And then add them all up to get the output.

play11:02

The same computation is then performed on all the tokens.

play11:06

And we get the output of the self-attention layer.

play11:10

So as you can see here, the final output

play11:12

of self-attention layer is in dark green that's

play11:15

at the top of the screen.

play11:17

So now again, for the final token,

play11:19

we perform everything same, queries multiplied by keys.

play11:22

We get the similarity scores.

play11:24

And then those similarity scores weigh the value vectors.

play11:27

And then we finally perform the addition

play11:29

to get the self-attention output of the Transformers block.

play11:39

Apart from self-attention, there are

play11:41

some other necessary ingredients that makes

play11:44

the Transformers so powerful.

play11:46

One important aspect is the presence

play11:49

of positional representations or the embedding layer.

play11:51

So the way RNNs worked very well was

play11:56

that since they process each of the information

play11:58

in a sequential ordering.

play12:00

So there was this notion of ordering, right,

play12:03

which is also very important in understanding language

play12:06

because we all know that we read any piece of text

play12:10

from left to right in most of the languages

play12:14

and also right to left in some languages.

play12:17

So there is a notion of ordering which

play12:19

is lost in kind of self-attention

play12:20

because every word is attending to every other word.

play12:24

That's why this paper introduced a separate embedding

play12:27

layer for introducing positional representations.

play12:30

The second important aspect is having nonlinearities.

play12:34

So if you think of all the computation that

play12:36

is happening in the self-attention layer,

play12:38

it's all linear because it's all matrix multiplication.

play12:41

But as we all know, that deep learning models

play12:44

work well when they are able to--

play12:46

when they are able to learn more complex mappings between input

play12:50

and output, which can be attained by a simple MLP.

play12:54

And the third important component of the self--

play12:56

of the Transformers is the masking.

play12:59

So masking is what allows to parallelize the operations.

play13:03

Since every word can attend to every other word,

play13:05

in the decoder part of our transformers,

play13:08

which Advay is going to be talking about later,

play13:10

is the problem becomes that you don't want the decoder

play13:13

to look into the future because that

play13:15

can result in data leakage.

play13:18

So that's why masking helps the decoder

play13:20

to avoid that future information and learn only what has been--

play13:26

what the model has processed so far.

play13:29

So now onto the encoder-decoder architecture

play13:33

of the Transformers.

play13:34

Advay.

play13:36

Yeah.

play13:36

Thanks, Chetanya, for talking about self-attention.

play13:39

So self-attention is sort of the key ingredient or one

play13:44

of the key ingredients that allows Transformers

play13:46

to work so well, but at a very high level,

play13:48

the model that was proposed in the Vaswani et al.

play13:52

paper of 2017 was like previous language models in the sense

play13:56

that it had an encoder-decoder architecture.

play13:59

What that means is--

play14:00

let's say you're working on a translation problem.

play14:02

You want to translate English to French.

play14:04

The way that would work is you would

play14:06

read in the entire input of your English sentence.

play14:10

You would encode that input.

play14:11

So that's the encoded part of the network.

play14:13

And then you would generate token

play14:16

by token the corresponding French translation.

play14:18

And the decoder is the part of the network

play14:20

that is responsible for generating those tokens.

play14:24

So you can think of these encoder blocks and decoder

play14:27

blocks as essentially something like LEGO.

play14:30

They have these subcomponents that make them up.

play14:34

And in particular, the encoder block

play14:36

has three main subcomponents.

play14:38

The first is the self-attention layer

play14:40

that Chetanya talked about earlier.

play14:43

And, as talked about earlier as well,

play14:46

you need a feed-forward layer after that

play14:48

because the self-attention layer only

play14:50

performs linear operations.

play14:52

And so you need something that can capture the nonlinearities.

play14:55

You also have a layer norm after this.

play14:58

And lastly, there are residual connections

play15:00

between different encoder blocks.

play15:02

The decoder is very similar to the encoder,

play15:05

but there's one difference, which

play15:06

is that it has this extra layer, because the decoder doesn't

play15:09

just do Multi-head Attention on the output

play15:12

of the previous layers.

play15:14

So for context, the encoder does Multi-head Attention

play15:17

for each self-attention layer in the encoder block.

play15:21

And each of the encoder blocks does Multi-head Attention

play15:25

looking at the previous layers of the encoder blocks.

play15:29

The decoder, however, does that in the sense

play15:33

that it also looks at the previous layers of the decoder,

play15:35

but it also looks at the output of the encoder.

play15:38

And so it needs a Multi-head Attention layer

play15:40

over the encoder blocks.

play15:43

And lastly, there's masking as well.

play15:46

So if you are-- because every token can

play15:49

look at every other token, you want

play15:51

to sort of make sure in the decoder

play15:53

that you're not looking into the future.

play15:55

So if you're in position 3, for instance,

play15:57

you shouldn't be able to look at position 4 and position 5.

play16:03

So those are sort of all the components

play16:05

that led to the creation of the model in the Vaswani et al.

play16:09

paper.

play16:10

And let's talk a little bit about the advantages

play16:14

and drawbacks of this model.

play16:16

So the two main advantages which are huge advantages and which

play16:20

are why Transformers have done such a good job

play16:22

of revolutionizing many, many fields within deep learning

play16:28

are as follows.

play16:29

So the first is there is this constant path

play16:32

length between any two positions in a sequence

play16:35

because every token in the sequence

play16:37

is looking at every other token.

play16:39

And this basically solves the problem

play16:41

that they've talked about earlier with long sequences.

play16:44

You don't have this problem with long sequences

play16:46

where if you're trying to predict a token

play16:49

that depends on a word that was far, far behind in a sentence.

play16:53

You don't have the problem of losing that context.

play16:55

Now the distance between them is only one

play16:58

in terms of the path length.

play17:00

Also, because of the nature of the computation that's

play17:03

happening, Transformer models lend themselves really

play17:05

well to parallelization and because

play17:07

of the advances that we've had with GPUs.

play17:09

Basically, if you take a Transformer model

play17:12

with n parameters and you take a model that isn't a Transformer,

play17:15

say, like an LSTM also with n parameters,

play17:18

training the Transformer model is

play17:19

going to be much faster because of the parallelization

play17:22

that it leverages.

play17:24

So those are the advantages.

play17:26

The disadvantages are basically self-attention takes

play17:30

quadratic time because every token looks

play17:32

at every other token.

play17:33

Order n squared as you might know does not scale,

play17:36

and there's actually been a lot of work

play17:38

in trying to tackle this.

play17:40

So we've linked to some here.

play17:41

Big Bird, Linformer, and Reformer

play17:43

are all approaches to try and make this linear or quasilinear

play17:47

essentially.

play17:51

And yeah, we highly recommend to--

play17:53

recommend going through Jay Alammar's blog,

play17:55

"The Illustrated Transformer." which

play17:57

provides great visualizations and explains everything

play18:00

that we just talked about in great detail.

play18:04

Yeah.

play18:04

And I'd like to pass it on to Chetanya for applications

play18:07

of Transformers.

play18:10

Yeah.

play18:10

So now moving on to some of the recent work--

play18:13

some of the work that very shortly followed

play18:16

the Transformers paper.

play18:18

So one of the models that came out

play18:21

was GPT, the GPT architecture, which was released by OpenAI.

play18:25

So OpenAI had the latest model that OpenAI has and the GPT

play18:29

series and the GPT-3.

play18:31

So it consists of only the decoder

play18:33

blocks from Transformers.

play18:34

And it's trained on a traditional language modeling

play18:37

task, which is predicting the current token-- which

play18:40

is creating the next token given the last t tokens

play18:44

that the model has seen.

play18:46

And for any downstream tasks, now the model

play18:49

can just-- you can just train a classification layer

play18:52

on the last hidden state, which can have any number of labels.

play18:57

And since the model is generative in nature,

play19:01

you can also use the pretrained network

play19:04

for generative kind of tasks, such as summarization

play19:09

and natural language generation for that instance.

play19:13

Another important aspect that GPT gained popularity

play19:17

was its ability to be able to perform

play19:20

in-context learning, what the authors called

play19:22

in-context learning.

play19:23

So this is the ability wherein the model can

play19:27

learn under few-shot settings, what

play19:30

the task is to complete the task without performing

play19:32

any gradient updates.

play19:33

For example, let's say, the model

play19:35

has shown a bunch of addition examples.

play19:38

And then if you pass in a new input and leave the--

play19:43

and just leave it at equal to sign,

play19:46

the model tries to predict the next token, which

play19:50

very well comes out to be the sum of the numbers that

play19:55

is shown.

play19:55

Another example can be also the spell correction

play19:58

task or the translation task.

play20:00

So this was the ability that made GPT-3 so much talked about

play20:06

in the NLP world.

play20:08

And right now also, many applications

play20:11

have been made using GPT-3 which includes one of them being

play20:15

the VS Code Copilot, which tries to generate a piece of code

play20:21

given docstring kind of natural language text.

play20:26

Another major model that came out

play20:29

that was based on the Transformers' architecture

play20:31

was BERT.

play20:32

So BERT lends its name from-- it's

play20:35

an acronym for Bidirectional Encoder Representations

play20:38

from Transformers.

play20:39

It consists of only the encoder blocks of the Transformers,

play20:43

which is unlike GPT-3, which had only the decoder blocks.

play20:48

Because of this change, there comes a problem

play20:52

because BERT has only the encoder block.

play20:55

So it sees the entire piece of text.

play20:57

It cannot be pretrained on a naive language modeling task

play21:00

because of the problem of data leakage from the future.

play21:03

So what the authors came up with was a clever idea.

play21:08

And they came up with a novel task called Masked Language

play21:10

Modeling, which included to replace certain words

play21:14

with a placeholder.

play21:15

And then the model tries to predict those words given

play21:17

the entire context.

play21:20

Now, apart from this token-level task,

play21:23

the authors also added a second objective

play21:25

called the Next Sentence Prediction, which

play21:27

was a sentence-level task, wherein

play21:31

given two chunks of text, the model tried to predict

play21:34

whether the second sentence followed the other sentence

play21:37

or not, followed the first sentence or not.

play21:39

And now after pretraining this model for any downstream tasks,

play21:43

the model can be further fine-tuned

play21:45

with an additional classification layer

play21:46

just like it was in GPT-3.

play21:49

So these are the two models that have been very popular

play21:54

and have made a lot of applications, made their way

play21:57

in a lot of applications.

play21:59

But the landscape has changed quite a lot

play22:01

since we have taken this class.

play22:02

There are models with different computing techniques

play22:05

like ELECTRA, DeBERTa.

play22:07

And there are also models that do

play22:09

well in like other modalities and which

play22:12

we are going to be talking about in other lecture series

play22:15

as well.

play22:16

So yeah, that's all from this lecture.

play22:18

And thank you for tuning in.

play22:21

Yeah.

play22:22

Just want to end by saying, thank you

play22:24

all for watching this.

play22:25

And we have a really exciting set

play22:28

of videos with truly amazing speakers.

play22:31

And we hope you are able to derive value from that.

play22:33

Sure.

play22:34

Thanks a lot.

play22:35

Thank you.

play22:36

Thank you, everyone.

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Deep LearningTransformersNLPComputer VisionReinforcement LearningStanfordLectureIntroductionAIInnovation