Visualizing Attention, a Transformer's Heart | Chapter 6, Deep Learning

3Blue1Brown
7 Apr 202426:09

Summary

TLDRThe video script delves into the intricacies of transformers, a pivotal technology in modern AI, highlighting the attention mechanism's role in refining word embeddings to capture contextual meaning. It explains how through a series of computations, including query, key, and value matrices, the model adjusts embeddings to reflect context, enabling it to predict the next word in a sequence. The concept of multi-headed attention is introduced, emphasizing the model's ability to learn various contextual relationships in parallel, contributing to its nuanced understanding of language. The script also touches on the computational efficiency and scalability of attention mechanisms, crucial for the performance of large language models like GPT-3.

Takeaways

  • 🧠 The transformer model, introduced in the 2017 paper 'Attention is All You Need', is a fundamental technology in modern AI, including large language models.
  • 📈 Transformers process text by breaking it into tokens and associating each with a high-dimensional vector, or embedding, which captures semantic meaning based on its direction in the embedding space.
  • 🔄 The attention mechanism within transformers adjusts embeddings to encode not just individual words but also rich contextual meaning derived from surrounding words.
  • 💡 Understanding the attention mechanism may be challenging, but it enables the model to refine word meanings based on context, such as distinguishing between 'mole' as an animal and 'mole' as a unit of measurement.
  • 🔍 The attention block calculates an attention pattern by comparing query vectors (from the context words) with key vectors (from potential context words) using dot products.
  • 📊 The attention pattern is a grid of relevance scores that are normalized using softmax, effectively turning them into a probability distribution that the model uses to weigh the importance of context words.
  • 🎭 The model employs a process called masking to prevent later words in a sequence from influencing the interpretation of earlier words, which would be counterproductive during training.
  • 🔢 Each attention head involves key, query, and value matrices, which are parameterized to capture different attention patterns and update embeddings accordingly.
  • 🌐 Multi-headed attention allows the model to learn various ways context can change word meanings by running many attention heads in parallel, each capturing a unique aspect of the context.
  • 📈 GPT-3, a large language model, uses 96 attention heads per block and includes 96 layers, resulting in nearly 58 billion parameters devoted to attention heads, though the total network has around 175 billion parameters.
  • 🚀 The success of attention mechanisms is partly due to their parallelizability, which allows for efficient computation using GPUs and contributes to the qualitative improvements in model performance with scale.

Q & A

  • What is the primary function of a transformer in the context of AI and large language models?

    -The primary function of a transformer is to process text data by taking in a piece of text and predicting the next word in the sequence. It achieves this by breaking the input text into tokens and associating each token with a high-dimensional vector, known as its embedding. The transformer then adjusts these embeddings to encode not just individual words but also richer contextual meaning.

  • What is the significance of the 2017 paper 'Attention is All You Need' in the development of transformers?

    -The 2017 paper 'Attention is All You Need' introduced the concept of the attention mechanism, which is a key component of transformers. This paper provided a new approach to processing sequences by focusing on the importance of attended information, which has since become fundamental in the design of large language models and other AI tools.

  • How does the attention mechanism in a transformer work to adjust token embeddings?

    -The attention mechanism works by progressively adjusting the embeddings of tokens so that they go from encoding just the individual words to incorporating richer contextual meanings. This is achieved by having the model attend to different parts of the input sequence and updating the embeddings based on the context provided by other tokens in the sequence.

  • What is the role of the embedding space in the transformer model?

    -The embedding space is a high-dimensional space where each token from the input text is represented as a vector, or an embedding. Directions in this space can correspond to semantic meanings of words. The transformer model adjusts these embeddings in this space to reflect the context and relationships between words in the text.

  • How does the attention mechanism handle multiple meanings of the same word?

    -The attention mechanism handles multiple meanings of the same word by adjusting the embedding of that word based on its context. It uses the surrounding embeddings to pass information and refine the meaning of the word, allowing the model to distinguish between different contexts in which the word may appear.

  • What is the purpose of the query, key, and value matrices in the attention mechanism?

    -The query, key, and value matrices are essential components of the attention mechanism. The query matrix is used to generate a query vector for each token, which represents the token's request for information from other tokens. The key matrix generates key vectors that can respond to these queries. The value matrix produces value vectors that represent the information to be passed on. The attention pattern, computed using dot products and softmax, determines how much of the value vectors each token should receive based on their relevance.

  • How does the attention mechanism prevent later words from influencing earlier ones during training?

    -During training, the attention mechanism uses a process called masking to prevent later words from influencing earlier ones. This is done by setting the entries in the attention pattern that represent later tokens influencing earlier ones to negative infinity before applying softmax. After softmax, these entries become zero, but the columns remain normalized, ensuring that the attention pattern correctly represents the relevance of words without violating the sequence order.

  • What is the significance of the attention pattern in the transformer model?

    -The attention pattern is a crucial part of the transformer model as it represents the weights assigned to each token based on its relevance to updating the meaning of other tokens. It is used to perform a weighted sum of the value vectors, which in turn updates the embeddings of the tokens. This allows the model to focus on the most contextually relevant information when refining the meanings of words in the sequence.

  • How does the transformer model handle the issue of scaling with context size?

    -The transformer model handles the issue of scaling with context size by using a mechanism called multi-headed attention, where multiple attention heads run in parallel, each with its own distinct key, query, and value maps. This allows the model to learn many different ways that context can influence the meaning of a word and effectively scales the attention mechanism to handle larger context sizes.

  • What is the role of the multi-layer perceptrons in a transformer model?

    -Multi-layer perceptrons, or feedforward networks, are another type of operation in a transformer model that processes the data in addition to the attention blocks. These networks consist of multiple layers of fully connected neurons with non-linear activations, which allow the model to perform non-linear transformations on the data. They contribute to the model's ability to capture complex patterns and relationships in the data.

  • How does the parameter count in a transformer model contribute to its performance?

    -The parameter count in a transformer model contributes significantly to its performance. A larger number of parameters provides the model with more capacity to learn complex patterns and representations from the data. In the case of GPT-3, the large number of parameters, particularly in the attention heads, allows the model to capture a wide range of contextual nuances and generate more accurate predictions for the next word in a sequence.

Outlines

00:00

🤖 Introduction to Transformers and Attention Mechanism

This paragraph introduces the concept of transformers, a key technology in large language models and other AI tools. It mentions the 2017 paper 'Attention is All You Need' and sets the stage for a deeper exploration of the attention mechanism. The goal of the model discussed is to predict the next word in a text, with input text broken into tokens. The paragraph emphasizes the importance of high-dimensional vectors, or embeddings, in capturing semantic meaning and the role of the attention mechanism in refining these embeddings to include richer contextual meaning. It also acknowledges the complexity of the topic and provides examples of how context influences word meaning, such as the different interpretations of the word 'mole' in various phrases.

05:04

🔍 Understanding Word Position and Embeddings

This paragraph delves into how words' positions are encoded in their embeddings, with each word being represented by a high-dimensional vector that not only signifies the word itself but also its position within the context. The discussion continues with the goal of transforming these initial embeddings into refined versions that incorporate meanings from surrounding words, particularly focusing on how adjectives can update the meanings of their corresponding nouns. The paragraph introduces the concept of a 'query' vector, which is used to ask questions about the context, and sets the stage for the explanation of the 'key' and 'value' vectors in the next paragraph.

10:07

📊 Attention Patterns and Dot Product Matching

The paragraph explains the creation of attention patterns through the use of query and key vectors. It describes how the model computes dot products between each possible key-query pair to determine the relevance of each word in updating the meaning of another. The concept of 'attention' is introduced, where certain words 'attend' to others based on the alignment of their vectors. The paragraph also discusses the use of softmax to normalize these scores into a probability distribution, ensuring that the weights sum up to one for each column, representing the relevance of each word. Additionally, it touches on the concept of 'masking' to prevent later words from influencing earlier ones during the training process and mentions the challenges of scaling up context size in attention mechanisms.

15:07

🔧 Updating Embeddings with Attention Mechanism

This paragraph details the process of updating word embeddings using the attention mechanism. It introduces the 'value' vector, which is used to adjust the embeddings based on the relevance scores determined in the previous paragraph. The paragraph explains how these value vectors are added to the original embeddings to produce a refined sequence of embeddings that carry richer contextual meaning. The concept of a 'single head of attention' is introduced, and the paragraph outlines the three matrices (key, query, and value) that parameterize this process, each with its own set of tunable weights. The discussion also includes the parameter count for these matrices and the efficiency gained by factoring the value matrix into two smaller matrices.

20:09

🌐 Multi-Headed Attention and Model Parameters

The paragraph expands on the concept of a single head of attention by explaining multi-headed attention, where many attention heads operate in parallel, each learning different ways that context can influence word meaning. It describes how the outputs of these heads are combined to produce a final, refined embedding. The paragraph also discusses the parameter count for a full attention block in a transformer model, highlighting the efficiency of having the same number of parameters for the value map as for the key and query maps. The concept of 'cross-attention' is briefly mentioned, which is used in models processing two distinct types of data. The paragraph concludes with a note on the parameter count for GPT-3's 96 attention heads and the overall parallelizable nature of the attention mechanism, which contributes to the model's performance.

25:09

🚀 Scaling and Future of Attention Mechanisms

The final paragraph discusses the scalability of attention mechanisms and their impact on model performance. It emphasizes the importance of parallelization in deep learning and how it leads to qualitative improvements in model performance. The paragraph also provides resources for further learning, including links to works by notable experts in the field. It concludes with a brief mention of the history of large language models and an encouragement for viewers to explore the concept of attention for potential innovations.

Mindmap

Keywords

💡Transformer

A Transformer is a type of neural network architecture that is central to many large language models and AI tools today. It was introduced in the seminal 2017 paper 'Attention is All You Need', and it fundamentally processes data through attention mechanisms. In the context of the video, the Transformer is the model being studied, aiming to predict the next word in a given text by encoding tokens (words or pieces of words) with high-dimensional vectors, or embeddings, that capture semantic meaning influenced by their context within the text.

💡Attention Mechanism

The attention mechanism is a crucial component of the Transformer architecture that allows the model to weigh the importance of different parts of the input data relative to each other. It is designed to enable the model to focus on relevant information and is visualized in the video as a way for the model to understand the context surrounding a word by considering its relationship with other words in the input text. The attention mechanism is often found confusing due to its complexity and the numerous computations it involves.

💡Embeddings

Embeddings are high-dimensional vectors associated with each token in the input text that represent the semantic meaning of that token. In the Transformer model, these embeddings are initially context-free but are refined through the attention mechanism to encode richer contextual meaning. The direction of these vectors in the high-dimensional space can correspond to various aspects of a word's meaning, such as gender, number, or other semantic features.

💡Context

In the context of language models and the Transformer architecture, context refers to the surrounding words or tokens that provide additional meaning to a particular word. The model uses the context to disambiguate words with multiple meanings and to understand how different words relate to one another. The attention mechanism is essential for capturing and utilizing this context to make predictions about the next word in a sequence.

💡Query, Key, Value

In the attention mechanism of a Transformer, Query, Key, and Value are components used to compute the attention scores and权重 the relevance of different parts of the input data. The Query is a vector that represents the information needed by a particular token, the Key is a vector that represents the information provided by other tokens, and the Value is a vector that stores the information to be passed on if the Query and Key align closely. These components are used to calculate an attention pattern that determines how much each token in the input influences the representation of other tokens.

💡Self-Attention

Self-attention is a type of attention mechanism used in the Transformer model where the Query, Key, and Value come from the same set of data, meaning the model is attending to different parts of the same input sequence. This allows the model to weigh the importance of each token relative to all other tokens within the same input, enabling it to capture complex relationships and dependencies within the text.

💡Multi-Headed Attention

Multi-headed attention is a technique used in Transformer models where multiple self-attention mechanisms are applied in parallel, each with its own set of parameters. This allows the model to learn different types of contextual relationships and representations at the same time. Each 'head' generates an attention pattern and the resulting embeddings are a combination of the outputs from all the heads, providing a richer and more nuanced understanding of the input data.

💡Masking

Masking is a technique used in the training of language models to prevent certain tokens from influencing others in ways that could reveal information about the future context. This is particularly important during the training phase where the model is learning to predict the next word in a sequence. By masking future tokens, the model is forced to focus on the current context without 'cheating' by using information it shouldn't have access to yet.

💡Softmax

Softmax is a function used in the attention mechanism to normalize the attention scores into a probability distribution, ensuring that the weights sum up to 1 across each column of the attention pattern. This function is applied to the attention scores to determine the relative importance of each token in the input sequence to the context of a particular token. The softmax function is crucial for the attention mechanism to work correctly, as it allows the model to focus on the most relevant information.

💡Parameter Count

The parameter count refers to the total number of tunable weights within a neural network model. In the context of the Transformer and attention mechanisms, each matrix involved in the attention process (query, key, value, and output matrices) contributes to the overall parameter count. A higher parameter count often allows the model to capture more complex patterns and relationships in the data, but it also requires more data and computational resources for training.

💡Training Process

The training process of a language model involves adjusting the model's parameters based on the training data to minimize the error in predicting the next word in a sequence. This process typically involves a cost function that measures the difference between the model's predictions and the actual next words, and an optimization algorithm that adjusts the parameters to reduce this cost. During training, the model learns to recognize patterns and dependencies in the language to improve its predictive accuracy.

Highlights

The discussion begins with an introduction to transformers, a key technology in modern AI and large language models, first introduced in the seminal 2017 paper 'Attention is All You Need'.

Transformers aim to take in a piece of text and predict the next word by breaking the input into tokens and associating each with a high-dimensional vector called its embedding.

The high-dimensional space of embeddings can correspond to semantic meanings, with different directions representing various aspects of a word's meaning.

The attention mechanism in transformers is designed to progressively adjust embeddings to encode richer contextual meaning beyond individual words.

The initial token embedding is akin to a lookup table without context, but the transformer's subsequent steps allow surrounding embeddings to pass information and refine the initial vector.

The attention block calculates adjustments to the generic embedding of a word based on its context, moving the vector towards specific directions in the embedding space.

The attention mechanism is crucial for understanding the different meanings of the same word in different contexts, such as 'mole' in various phrases.

The transformer model updates word embeddings by associating adjectives with their corresponding nouns, using the attention mechanism to adjust the meaning of the nouns.

Each word in the input is associated with a query, key, and value vector; the query and key vectors are used to compute an attention pattern that determines how much each word attends to others.

The attention pattern is obtained by computing dot products between key-query pairs and applying softmax to create a probability distribution that represents relevance.

Masking is used during training to prevent later words from influencing earlier ones, ensuring that the model does not inadvertently reveal information about the next word.

The attention mechanism's complexity scales with the square of the context size, making it a significant bottleneck for large language models.

The process of updating embeddings involves using a value matrix to adjust the meaning of words based on their relevance, as determined by the attention pattern.

In multi-headed attention, multiple attention heads run in parallel, each learning distinct ways that context changes meaning, and their outputs are combined to produce a refined embedding.

GPT-3 uses 96 attention heads in each block, with each head consisting of its own key, query, and value matrices, resulting in a total of around 600 million parameters for a single attention block.

The total number of parameters devoted to attention heads in GPT-3 is nearly 58 billion, which is about a third of the total 175 billion parameters in the network.

The attention mechanism's parallelizability is a key factor in the performance improvements of large language models, as it allows for efficient computation on GPUs.

The video concludes with an encouragement for further exploration of attention mechanisms and provides resources for additional learning.

Transcripts

play00:00

In the last chapter, you and I started to step

play00:02

through the internal workings of a transformer.

play00:04

This is one of the key pieces of technology inside large language models,

play00:07

and a lot of other tools in the modern wave of AI.

play00:10

It first hit the scene in a now-famous 2017 paper called Attention is All You Need,

play00:15

and in this chapter you and I will dig into what this attention mechanism is,

play00:19

visualizing how it processes data.

play00:26

As a quick recap, here's the important context I want you to have in mind.

play00:30

The goal of the model that you and I are studying is to

play00:33

take in a piece of text and predict what word comes next.

play00:36

The input text is broken up into little pieces that we call tokens,

play00:40

and these are very often words or pieces of words,

play00:43

but just to make the examples in this video easier for you and me to think about,

play00:47

let's simplify by pretending that tokens are always just words.

play00:51

The first step in a transformer is to associate each token

play00:54

with a high-dimensional vector, what we call its embedding.

play00:57

The most important idea I want you to have in mind is how directions in this

play01:02

high-dimensional space of all possible embeddings can correspond with semantic meaning.

play01:07

In the last chapter we saw an example for how direction can correspond to gender,

play01:11

in the sense that adding a certain step in this space can take you from the

play01:15

embedding of a masculine noun to the embedding of the corresponding feminine noun.

play01:20

That's just one example you could imagine how many other directions in this

play01:23

high-dimensional space could correspond to numerous other aspects of a word's meaning.

play01:28

The aim of a transformer is to progressively adjust these

play01:31

embeddings so that they don't merely encode an individual word,

play01:35

but instead they bake in some much, much richer contextual meaning.

play01:40

I should say up front that a lot of people find the attention mechanism,

play01:43

this key piece in a transformer, very confusing,

play01:46

so don't worry if it takes some time for things to sink in.

play01:49

I think that before we dive into the computational details and

play01:52

all the matrix multiplications, it's worth thinking about a couple

play01:55

examples for the kind of behavior that we want attention to enable.

play02:00

Consider the phrases American true mole, one mole of carbon dioxide,

play02:04

and take a biopsy of the mole.

play02:06

You and I know that the word mole has different meanings in each one of these,

play02:10

based on the context.

play02:11

But after the first step of a transformer, the one that breaks up the text

play02:15

and associates each token with a vector, the vector that's associated with

play02:18

mole would be the same in all of these cases, because this initial token

play02:22

embedding is effectively a lookup table with no reference to the context.

play02:26

It's only in the next step of the transformer that the surrounding

play02:30

embeddings have the chance to pass information into this one.

play02:33

The picture you might have in mind is that there are multiple distinct directions in

play02:38

this embedding space encoding the multiple distinct meanings of the word mole,

play02:42

and that a well-trained attention block calculates what you need to add to the generic

play02:47

embedding to move it to one of these specific directions, as a function of the context.

play02:53

To take another example, consider the embedding of the word tower.

play02:57

This is presumably some very generic, non-specific direction in the space,

play03:01

associated with lots of other large, tall nouns.

play03:04

If this word was immediately preceded by Eiffel,

play03:06

you could imagine wanting the mechanism to update this vector so that

play03:10

it points in a direction that more specifically encodes the Eiffel tower,

play03:14

maybe correlated with vectors associated with Paris and France and things made of steel.

play03:19

If it was also preceded by the word miniature,

play03:22

then the vector should be updated even further,

play03:24

so that it no longer correlates with large, tall things.

play03:29

More generally than just refining the meaning of a word,

play03:32

the attention block allows the model to move information encoded in

play03:35

one embedding to that of another, potentially ones that are quite far away,

play03:39

and potentially with information that's much richer than just a single word.

play03:43

What we saw in the last chapter was how after all of the vectors flow through the

play03:47

network, including many different attention blocks,

play03:50

the computation you perform to produce a prediction of the next token is entirely a

play03:55

function of the last vector in the sequence.

play03:59

Imagine, for example, that the text you input is most of an entire mystery novel,

play04:03

all the way up to a point near the end, which reads, therefore the murderer was.

play04:08

If the model is going to accurately predict the next word,

play04:11

that final vector in the sequence, which began its life simply embedding the word was,

play04:16

will have to have been updated by all of the attention blocks to represent much,

play04:20

much more than any individual word, somehow encoding all of the information

play04:24

from the full context window that's relevant to predicting the next word.

play04:29

To step through the computations, though, let's take a much simpler example.

play04:32

Imagine that the input includes the phrase, a

play04:35

fluffy blue creature roamed the verdant forest.

play04:38

And for the moment, suppose that the only type of update that we care about

play04:42

is having the adjectives adjust the meanings of their corresponding nouns.

play04:47

What I'm about to describe is what we would call a single head of attention,

play04:50

and later we will see how the attention block consists of many different heads run in

play04:54

parallel.

play04:56

Again, the initial embedding for each word is some high dimensional vector

play04:59

that only encodes the meaning of that particular word with no context.

play05:04

Actually, that's not quite true.

play05:05

They also encode the position of the word.

play05:07

There's a lot more to say way that positions are encoded, but right now,

play05:11

all you need to know is that the entries of this vector are enough to

play05:15

tell you both what the word is and where it exists in the context.

play05:19

Let's go ahead and denote these embeddings with the letter e.

play05:22

The goal is to have a series of computations produce a new refined

play05:26

set of embeddings where, for example, those corresponding to the

play05:29

nouns have ingested the meaning from their corresponding adjectives.

play05:33

And playing the deep learning game, we want most of the computations

play05:37

involved to look like matrix-vector products, where the matrices are

play05:40

full of tunable weights, things that the model will learn based on data.

play05:44

To be clear, I'm making up this example of adjectives updating nouns just to

play05:48

illustrate the type of behavior that you could imagine an attention head doing.

play05:52

As with so much deep learning, the true behavior is much harder to parse because it's

play05:57

based on tweaking and tuning a huge number of parameters to minimize some cost function.

play06:01

It's just that as we step through all of different matrices filled with parameters

play06:05

that are involved in this process, I think it's really helpful to have an imagined

play06:09

example of something that it could be doing to help keep it all more concrete.

play06:14

For the first step of this process, you might imagine each noun, like creature,

play06:18

asking the question, hey, are there any adjectives sitting in front of me?

play06:22

And for the words fluffy and blue, to each be able to answer,

play06:25

yeah, I'm an adjective and I'm in that position.

play06:28

That question is somehow encoded as yet another vector,

play06:32

another list of numbers, which we call the query for this word.

play06:36

This query vector though has a much smaller dimension than the embedding vector, say 128.

play06:42

Computing this query looks like taking a certain matrix,

play06:46

which I'll label wq, and multiplying it by the embedding.

play06:50

Compressing things a bit, let's write that query vector as q,

play06:54

and then anytime you see me put a matrix next to an arrow like this one,

play06:58

it's meant to represent that multiplying this matrix by the vector at the arrow's start

play07:02

gives you the vector at the arrow's end.

play07:05

In this case, you multiply this matrix by all of the embeddings in the context,

play07:10

producing one query vector for each token.

play07:13

The entries of this matrix are parameters of the model,

play07:16

which means the true behavior is learned from data, and in practice,

play07:19

what this matrix does in a particular attention head is challenging to parse.

play07:23

But for our sake, imagining an example that we might hope that it would learn,

play07:27

we'll suppose that this query matrix maps the embeddings of nouns to

play07:31

certain directions in this smaller query space that somehow encodes

play07:34

the notion of looking for adjectives in preceding positions.

play07:38

As to what it does to other embeddings, who knows?

play07:41

Maybe it simultaneously tries to accomplish some other goal with those.

play07:44

Right now, we're laser focused on the nouns.

play07:47

At the same time, associated with this is a second matrix called the key matrix,

play07:51

which you also multiply by every one of the embeddings.

play07:55

This produces a second sequence of vectors that we call the keys.

play07:59

Conceptually, you want to think of the keys as potentially answering the queries.

play08:03

This key matrix is also full of tunable parameters, and just like the query matrix,

play08:07

it maps the embedding vectors to that same smaller dimensional space.

play08:12

You think of the keys as matching the queries whenever they closely align with each other.

play08:17

In our example, you would imagine that the key matrix maps the adjectives like fluffy

play08:21

and blue to vectors that are closely aligned with the query produced by the word creature.

play08:27

To measure how well each key matches each query,

play08:30

you compute a dot product between each possible key-query pair.

play08:34

I like to visualize a grid full of a bunch of dots,

play08:37

where the bigger dots correspond to the larger dot products,

play08:40

the places where the keys and queries align.

play08:43

For our adjective noun example, that would look a little more like this,

play08:47

where if the keys produced by fluffy and blue really do align closely with the query

play08:52

produced by creature, then the dot products in these two spots would be some large

play08:57

positive numbers.

play08:59

In the lingo, machine learning people would say that this means the

play09:02

embeddings of fluffy and blue attend to the embedding of creature.

play09:06

By contrast to the dot product between the key for some other

play09:09

word like the and the query for creature would be some small

play09:12

or negative value that reflects that are unrelated to each other.

play09:17

So we have this grid of values that can be any real number from

play09:21

negative infinity to infinity, giving us a score for how relevant

play09:25

each word is to updating the meaning of every other word.

play09:29

The way we're about to use these scores is to take a certain

play09:32

weighted sum along each column, weighted by the relevance.

play09:36

So instead of having values range from negative infinity to infinity,

play09:40

what we want is for the numbers in these columns to be between 0 and 1,

play09:44

and for each column to add up to 1, as if they were a probability distribution.

play09:49

If you're coming in from the last chapter, you know what we need to do then.

play09:52

We compute a softmax along each one of these columns to normalize the values.

play10:00

In our picture, after you apply softmax to all of the columns,

play10:03

we'll fill in the grid with these normalized values.

play10:06

At this point you're safe to think about each column as giving weights according

play10:10

to how relevant the word on the left is to the corresponding value at the top.

play10:15

We call this grid an attention pattern.

play10:18

Now if you look at the original transformer paper,

play10:20

there's a really compact way that they write this all down.

play10:23

Here the variables q and k represent the full arrays of query

play10:27

and key vectors respectively, those little vectors you get by

play10:31

multiplying the embeddings by the query and the key matrices.

play10:35

This expression up in the numerator is a really compact way to represent

play10:39

the grid of all possible dot products between pairs of keys and queries.

play10:44

A small technical detail that I didn't mention is that for numerical stability,

play10:48

it happens to be helpful to divide all of these values by the

play10:51

square root of the dimension in that key query space.

play10:54

Then this softmax that's wrapped around the full expression

play10:57

is meant to be understood to apply column by column.

play11:01

As to that v term, we'll talk about it in just a second.

play11:05

Before that, there's one other technical detail that so far I've skipped.

play11:09

During the training process, when you run this model on a given text example,

play11:13

and all of the weights are slightly adjusted and tuned to either reward or punish it

play11:17

based on how high a probability it assigns to the true next word in the passage,

play11:21

it turns out to make the whole training process a lot more efficient if you

play11:25

simultaneously have it predict every possible next token following each initial

play11:29

subsequence of tokens in this passage.

play11:31

For example, with the phrase that we've been focusing on,

play11:34

it might also be predicting what words follow creature and what words follow the.

play11:39

This is really nice, because it means what would otherwise

play11:42

be a single training example effectively acts as many.

play11:46

For the purposes of our attention pattern, it means that you never

play11:49

want to allow later words to influence earlier words,

play11:52

since otherwise they could kind of give away the answer for what comes next.

play11:56

What this means is that we want all of these spots here,

play11:59

the ones representing later tokens influencing earlier ones,

play12:02

to somehow be forced to be zero.

play12:05

The simplest thing you might think to do is to set them equal to zero,

play12:08

but if you did that the columns wouldn't add up to one anymore,

play12:11

they wouldn't be normalized.

play12:13

So instead, a common way to do this is that before applying softmax,

play12:16

you set all of those entries to be negative infinity.

play12:19

If you do that, then after applying softmax, all of those get turned into zero,

play12:23

but the columns stay normalized.

play12:26

This process is called masking.

play12:27

There are versions of attention where you don't apply it, but in our GPT example,

play12:31

even though this is more relevant during the training phase than it would be,

play12:34

say, running it as a chatbot or something like that,

play12:37

you do always apply this masking to prevent later tokens from influencing earlier ones.

play12:42

Another fact that's worth reflecting on about this attention

play12:45

pattern is how its size is equal to the square of the context size.

play12:49

So this is why context size can be a really huge bottleneck for large language models,

play12:54

and scaling it up is non-trivial.

play12:56

As you imagine, motivated by a desire for bigger and bigger context windows,

play13:00

recent years have seen some variations to the attention mechanism aimed at making

play13:04

context more scalable, but right here, you and I are staying focused on the basics.

play13:10

Okay, great, computing this pattern lets the model

play13:12

deduce which words are relevant to which other words.

play13:16

Now you need to actually update the embeddings,

play13:18

allowing words to pass information to whichever other words they're relevant to.

play13:22

For example, you want the embedding of Fluffy to somehow cause a change

play13:26

to Creature that moves it to a different part of this 12,000-dimensional

play13:30

embedding space that more specifically encodes a Fluffy creature.

play13:35

What I'm going to do here is first show you the most straightforward

play13:38

way that you could do this, though there's a slight way that

play13:40

this gets modified in the context of multi-headed attention.

play13:44

This most straightforward way would be to use a third matrix,

play13:47

what we call the value matrix, which you multiply by the embedding of that first word,

play13:51

for example Fluffy.

play13:53

The result of this is what you would call a value vector,

play13:55

and this is something that you add to the embedding of the second word,

play13:59

in this case something you add to the embedding of Creature.

play14:02

So this value vector lives in the same very high-dimensional space as the embeddings.

play14:07

When you multiply this value matrix by the embedding of a word,

play14:10

you might think of it as saying, if this word is relevant to adjusting the meaning of

play14:15

something else, what exactly should be added to the embedding of that something else

play14:19

in order to reflect this?

play14:22

Looking back in our diagram, let's set aside all of the keys and the queries,

play14:26

since after you compute the attention pattern you're done with those,

play14:29

then you're going to take this value matrix and multiply it by every

play14:32

one of those embeddings to produce a sequence of value vectors.

play14:37

You might think of these value vectors as being

play14:39

kind of associated with the corresponding keys.

play14:42

For each column in this diagram, you multiply each of the

play14:45

value vectors by the corresponding weight in that column.

play14:50

For example here, under the embedding of Creature,

play14:52

you would be adding large proportions of the value vectors for Fluffy and Blue,

play14:57

while all of the other value vectors get zeroed out, or at least nearly zeroed out.

play15:02

And then finally, the way to actually update the embedding associated with this column,

play15:06

previously encoding some context-free meaning of Creature,

play15:09

you add together all of these rescaled values in the column,

play15:13

producing a change that you want to add, that I'll label delta-e,

play15:16

and then you add that to the original embedding.

play15:19

Hopefully what results is a more refined vector encoding the more

play15:23

contextually rich meaning, like that of a fluffy blue creature.

play15:27

And of course you don't just do this to one embedding,

play15:30

you apply the same weighted sum across all of the columns in this picture,

play15:34

producing a sequence of changes, adding all of those changes to the corresponding

play15:38

embeddings, produces a full sequence of more refined embeddings popping out

play15:42

of the attention block.

play15:44

Zooming out, this whole process is what you would describe as a single head of attention.

play15:49

As I've described things so far, this process is parameterized by three distinct

play15:54

matrices, all filled with tunable parameters, the key, the query, and the value.

play15:59

I want to take a moment to continue what we started in the last chapter,

play16:02

with the scorekeeping where we count up the total number of model parameters using the

play16:07

numbers from GPT-3.

play16:09

These key and query matrices each have 12,288 columns, matching the embedding dimension,

play16:15

and 128 rows, matching the dimension of that smaller key query space.

play16:20

This gives us an additional 1.5 million or so parameters for each one.

play16:24

If you look at that value matrix by contrast, the way I've described things so

play16:30

far would suggest that it's a square matrix that has 12,288 columns and 12,288 rows,

play16:35

since both its inputs and outputs live in this very large embedding space.

play16:41

If true, that would mean about 150 million added parameters.

play16:45

And to be clear, you could do that.

play16:47

You could devote orders of magnitude more parameters

play16:49

to the value map than to the key and query.

play16:52

But in practice, it is much more efficient if instead you make

play16:55

it so that the number of parameters devoted to this value map

play16:57

is the same as the number devoted to the key and the query.

play17:01

This is especially relevant in the setting of

play17:03

running multiple attention heads in parallel.

play17:06

The way this looks is that the value map is factored as a product of two smaller matrices.

play17:11

Conceptually, I would still encourage you to think about the overall linear map,

play17:15

one with inputs and outputs, both in this larger embedding space,

play17:18

for example taking the embedding of blue to this blueness direction that you would

play17:23

add to nouns.

play17:27

It's just that it's a smaller number of rows,

play17:29

typically the same size as the key query space.

play17:33

What this means is you can think of it as mapping the

play17:35

large embedding vectors down to a much smaller space.

play17:39

This is not the conventional naming, but I'm going to call this the value down matrix.

play17:43

The second matrix maps from this smaller space back up to the embedding space,

play17:47

producing the vectors that you use to make the actual updates.

play17:51

I'm going to call this one the value up matrix, which again is not conventional.

play17:55

The way that you would see this written in most papers looks a little different.

play17:58

I'll talk about it in a minute.

play17:59

In my opinion, it tends to make things a little more conceptually confusing.

play18:03

To throw in linear algebra jargon here, what we're basically doing

play18:06

is constraining the overall value map to be a low rank transformation.

play18:11

Turning back to the parameter count, all four of these matrices have the same size,

play18:16

and adding them all up we get about 6.3 million parameters for one attention head.

play18:22

As a quick side note, to be a little more accurate,

play18:24

everything described so far is what people would call a self-attention head,

play18:27

to distinguish it from a variation that comes up in other models that's

play18:30

called cross-attention.

play18:32

This isn't relevant to our GPT example, but if you're curious,

play18:35

cross-attention involves models that process two distinct types of data,

play18:39

like text in one language and text in another language that's part of an

play18:43

ongoing generation of a translation, or maybe audio input of speech and an

play18:48

ongoing transcription.

play18:50

A cross-attention head looks almost identical.

play18:52

The only difference is that the key and query maps act on different data sets.

play18:57

In a model doing translation, for example, the keys might come from one language,

play19:02

while the queries come from another, and the attention pattern could describe

play19:06

which words from one language correspond to which words in another.

play19:10

And in this setting there would typically be no masking,

play19:12

since there's not really any notion of later tokens affecting earlier ones.

play19:17

Staying focused on self-attention though, if you understood everything so far,

play19:20

and if you were to stop here, you would come away with the essence of what attention

play19:24

really is.

play19:25

All that's really left to us is to lay out the

play19:28

sense in which you do this many many different times.

play19:32

In our central example we focused on adjectives updating nouns,

play19:35

but of course there are lots of different ways that context can influence the

play19:38

meaning of a word.

play19:40

If the words they crashed the preceded the word car,

play19:43

it has implications for the shape and structure of that car.

play19:47

And a lot of associations might be less grammatical.

play19:49

If the word wizard is anywhere in the same passage as Harry,

play19:52

it suggests that this might be referring to Harry Potter,

play19:55

whereas if instead the words Queen, Sussex, and William were in that passage,

play20:00

then perhaps the embedding of Harry should instead be updated to refer to the prince.

play20:05

For every different type of contextual updating that you might imagine,

play20:08

the parameters of these key and query matrices would be different to

play20:11

capture the different attention patterns, and the parameters of our

play20:15

value map would be different based on what should be added to the embeddings.

play20:19

And again, in practice the true behavior of these maps is much more

play20:23

difficult to interpret, where the weights are set to do whatever the

play20:26

model needs them to do to best accomplish its goal of predicting the next token.

play20:31

As I said before, everything we described is a single head of attention,

play20:35

and a full attention block inside a transformer consists of what's

play20:38

called multi-headed attention, where you run a lot of these operations in parallel,

play20:43

each with its own distinct key query and value maps.

play20:47

GPT-3 for example uses 96 attention heads inside each block.

play20:52

Considering that each one is already a bit confusing,

play20:54

it's certainly a lot to hold in your head.

play20:56

Just to spell it all out very explicitly, this means you have 96

play21:00

distinct key and query matrices producing 96 distinct attention patterns.

play21:05

Then each head has its own distinct value matrices

play21:08

used to produce 96 sequences of value vectors.

play21:12

These are all added together using the corresponding attention patterns as weights.

play21:17

What this means is that for each position in the context, each token,

play21:21

every one of these heads produces a proposed change to be added to the embedding in

play21:26

that position.

play21:27

So what you do is you sum together all of those proposed changes,

play21:31

one for each head, and you add the result to the original embedding of that position.

play21:36

This entire sum here would be one slice of what's outputted from this multi-headed

play21:41

attention block, a single one of those refined embeddings that pops out the other end

play21:47

of it.

play21:48

Again, this is a lot to think about, so don't

play21:50

worry at all if it takes some time to sink in.

play21:52

The overall idea is that by running many distinct heads in parallel,

play21:56

you're giving the model the capacity to learn many distinct ways that context

play22:00

changes meaning.

play22:03

Pulling up our running tally for parameter count with 96 heads,

play22:07

each including its own variation of these four matrices,

play22:10

each block of multi-headed attention ends up with around 600 million parameters.

play22:16

There's one added slightly annoying thing that I should really

play22:19

mention for any of you who go on to read more about transformers.

play22:22

You remember how I said that the value map is factored out into these two

play22:25

distinct matrices, which I labeled as the value down and the value up matrices.

play22:29

The way that I framed things would suggest that you see this pair of matrices

play22:34

inside each attention head, and you could absolutely implement it this way.

play22:38

That would be a valid design.

play22:40

But the way that you see this written in papers and the way

play22:42

that it's implemented in practice looks a little different.

play22:45

All of these value up matrices for each head appear stapled together in one giant matrix

play22:50

that we call the output matrix, associated with the entire multi-headed attention block.

play22:56

And when you see people refer to the value matrix for a given attention head,

play23:00

they're typically only referring to this first step,

play23:03

the one that I was labeling as the value down projection into the smaller space.

play23:08

For the curious among you, I've left an on-screen note about it.

play23:11

It's one of those details that runs the risk of distracting

play23:13

from the main conceptual points, but I do want to call it out

play23:16

just so that you know if you read about this in other sources.

play23:19

Setting aside all the technical nuances, in the preview from the last chapter we saw

play23:23

how data flowing through a transformer doesn't just flow through a single attention block.

play23:28

For one thing, it also goes through these other operations called multi-layer perceptrons.

play23:33

We'll talk more about those in the next chapter.

play23:35

And then it repeatedly goes through many many copies of both of these operations.

play23:39

What this means is that after a given word imbibes some of its context,

play23:43

there are many more chances for this more nuanced embedding

play23:47

to be influenced by its more nuanced surroundings.

play23:50

The further down the network you go, with each embedding taking in more and more

play23:54

meaning from all the other embeddings, which themselves are getting more and more

play23:59

nuanced, the hope is that there's the capacity to encode higher level and more

play24:03

abstract ideas about a given input beyond just descriptors and grammatical structure.

play24:07

Things like sentiment and tone and whether it's a poem and what underlying

play24:11

scientific truths are relevant to the piece and things like that.

play24:16

Turning back one more time to our scorekeeping, GPT-3 includes 96 distinct layers,

play24:22

so the total number of key query and value parameters is multiplied by another 96,

play24:27

which brings the total sum to just under 58 billion distinct parameters

play24:32

devoted to all of the attention heads.

play24:34

That is a lot to be sure, but it's only about a third

play24:38

of the 175 billion that are in the network in total.

play24:41

So even though attention gets all of the attention,

play24:44

the majority of parameters come from the blocks sitting in between these steps.

play24:48

In the next chapter, you and I will talk more about those

play24:51

other blocks and also a lot more about the training process.

play24:54

A big part of the story for the success of the attention mechanism is not so much any

play24:58

specific kind of behavior that it enables, but the fact that it's extremely

play25:03

parallelizable, meaning that you can run a huge number of computations in a short time

play25:07

using GPUs.

play25:09

Given that one of the big lessons about deep learning in the last decade or two has

play25:13

been that scale alone seems to give huge qualitative improvements in model performance,

play25:17

there's a huge advantage to parallelizable architectures that let you do this.

play25:22

If you want to learn more about this stuff, I've left lots of links in the description.

play25:25

In particular, anything produced by Andrej Karpathy or Chris Ola tend to be pure gold.

play25:30

In this video, I wanted to just jump into attention in its current form,

play25:33

but if you're curious about more of the history for how we got here

play25:36

and how you might reinvent this idea for yourself,

play25:38

my friend Vivek just put up a couple videos giving a lot more of that motivation.

play25:43

Also, Britt Cruz from the channel The Art of the Problem has

play25:45

a really nice video about the history of large language models.

play26:04

Thank you.

Rate This

5.0 / 5 (0 votes)

Related Tags
Transformers ExplainedAttention MechanismLanguage ModelsAI TechnologyDeep LearningContextual UnderstandingEmbedding SpaceSemantic MeaningParallel ComputingModel Training