Visualizing Attention, a Transformer's Heart | Chapter 6, Deep Learning

3Blue1Brown
7 Apr 202426:09

Summary

TLDRВ этом видео скрипте рассматривается механизм внимания (attention mechanism), который является ключевым компонентом трансформеров (transformers) - технологий, используемых в современных языковых моделях ИИ. Автор объясняет, как трансформеры преобразуют текст в векторы, которые кодируют семантические значения слов, и как механизм внимания позволяет этим векторам понемногу приобретать более богатую контекстуальную информацию. Видео также охватывает технические детали, такие как матричные умножения и использование маскировки для обучения моделей. В конце автор подводит итоги, отмечая важность параллелизма в архитектуре трансформеров для улучшения производительности и масштабирования.

Takeaways

  • 📚 Трансформеры - ключевая технология в современных языковых моделях AI, впервые представленная в статье 2017 года 'Attention is All You Need'.
  • 🔍 Цель модели - предсказание следующего слова в тексте, используя разбитые на токены тексты, которые могут быть словами или частями слов.
  • 📈 Трансформеры ассоциируют каждый токен с вектором высокого разряда, называемым его вмещением (embedding), где разные направления могут соответствовать семантическому значению.
  • 🧠 Внимательность (attention mechanism) позволяет моделям не только кодировать отдельные слова, но и включать в это более богатую контекстуальную информацию.
  • 🤔 Внимательность может быть сложной для понимания, но она позволяет модели обрабатывать контекст и изменять смысл слова в зависимости от окружения.
  • 🔄 Процесс обновления вмещения включает в себя использование матриц запроса (query), ключа (key) и значения (value) для передачи информации между словами.
  • 🔢 После вычисления внимания (attention pattern), модели применяют матрицу значений (value matrix) для обновления вмещений слов, передавая информацию между ними.
  • 🚀 Внимательность в трансформерах может быть многоголовой (multi-headed), что позволяет модели параллельно обучать различные способы, которыми контекст может изменять значение слов.
  • 🔗 Каждая головка внимания имеет свои собственные матрицы ключа, запроса и значения, что увеличивает гибкость и точность модели.
  • 📈 Множественные головки внимания позволяют модели учитывать различные контекстные связи и обновлять смысл слов с учетом этих связей.
  • 🌐 Масштабируемость и параллелизм внимания являются ключевыми факторами успешности трансформеров и современных языковых моделей.

Q & A

  • Трансформеры являются ключевым технологическим элементом в каких типах моделей?

    -Трансформеры являются ключевым технологическим элементом в больших языковых моделях и многих других инструментах современной волны искусственного интеллекта.

  • Какой известный документ внес значительный вклад в популяризацию трансформеров?

    -Трансформеры впервые стали известны после публикации в 2017 году известной статьи под названием 'Attention is All You Need'.

  • Чему сводится цель модели, которую рассматривают в видео?

    -Цель модели - принимать на вход текст и предсказывать, какую слово следует в следующий раз.

  • Какие два типа матриц участвуют в первичном формировании запросов для слов?

    -Два типа матриц, участвующих в формировании запросов, - это матрица запроса (query matrix) и матрица ключа (key matrix).

  • Что такое механизм внимания и как он позволяет модели обрабатывать данные?

    -Механизм внимания - это процесс, при котором модель учитывает контекст, в котором используется слово, для более точного определения его смысла и предсказания следующего слова в тексте.

  • Какой технический шаг используется для нормализации значений в матрице внимания?

    -Для нормализации значений в матрице внимания используется softmax-функция, применяемая по колонкам.

  • Чему называется процесс, при котором значения, относящиеся к более поздним словам, принудительно устанавливаются в значение ноль?

    -Этот процесс называется маскированием (masking) и используется для предотвращения влияния более поздних слов на более ранние в процессе обучения.

  • Какой элемент используется для обновления векторов-эмбеддингов слов на основе значений, полученных из матрицы значений?

    -Чтобы обновить векторы-эмбеддинги слов, используется матрица значений (value matrix), которая умножается на эмбеддинги слов для получения векторов значений (value vectors).

  • Чему называется пара матриц, используемых для уменьшения и увеличения размерности векторов значений?

    -Эти две матрицы называются матрицей значений вниз (value down matrix) и матрицей значений вверх (value up matrix).

  • Какой тип внимания рассматривается в основном в видео?

    -В видео рассматривается само-внимание (self-attention), которое отличается от перекрестного внимания (cross-attention), где ключи и запросы могут действовать на разные наборы данных.

  • Какой механизм позволяет модели учитывать различные контекстные обновления параллельно?

    -Механизм мультиголового внимания (multi-headed attention) позволяет модели учитывать различные контекстные обновления параллельно, используя множество голов внимания с различными матрицами ключей, запросов и значений.

  • Каков общий объем параметров в модели GPT-3, связанных с головами внимания?

    -В модели GPT-3 общий объем параметров, связанных с головами внимания, составляет около 58 миллиардов различных параметров.

Outlines

00:00

🤖 Введение в трансформеры и механизм внимания

Раздел 1 вводит читателя в технологию трансформеров, ключевую часть современных языковых моделей ИИ. Обсуждается механизм внимания, представленный в 2017 году в статье 'Attention is All You Need'. Цель модели - предсказание следующего слова в тексте. Вводятся концепции токенов, эмбеддингов и важность направлений в многомерном пространстве эмбеддингов для семантического значения.

05:04

📚 Сопоставление слов с векторами и их обновление

Раздел 2 описывает процесс кодирования слов с помощью векторов и последующее их обновление с учетом контекста. Введение позиционной информации в вектора, создание запросов (query) и ключей (key) для определения семантического взаимодействия между словами, вычисление скалярного произведения для определения соответствия и нормализация значений с помощью softmax.

10:07

🔍 Маска и обновление эмбеддингов с помощью внимания

Раздел 3 рассматривает использование маскировки для предотвращения влияния последующих слов на предыдущие при обучении модели. Обсуждаются технические детали, такие как использование матрицы значений (value matrix) для обновления эмбеддингов слов, включая разделение этой матрицы на две части для эффективности.

15:07

🧠 Многоголовая внимательность и ее параметры

Раздел 4 объясняет концепцию многоголовой внимания, где несколько 'головых' обрабатывают входные данные параллельно, каждый с собственными матрицами ключа, запроса и значения. Описывается, как параметры каждого головного блока настраиваются для разных типов контекстуальных обновлений и как это увеличивает способность модели к обучению.

20:09

🚀 Масштабирование и параллелизация механизма внимания

Раздел 5 подчеркивает важность параллелизации в архитектуре механизма внимания, что позволяет выполнять большое количество вычислений за короткое время с использованием GPU. Обсуждаются преимущества масштабирования и параллелизма для улучшения производительности языковых моделей.

25:09

📚 Ресурсы для дальнейшего изучения механизма внимания

В заключительном разделе предоставляются ссылки на дополнительные ресурсы для изучения механизма внимания и истории развития больших языковых моделей. Рекомендуются материалы от Andrej Karpathy, Chris Olah и Vivek, а также видео о истории языковых моделей от канала The Art of the Problem.

Mindmap

Keywords

💡Трансформеры (Transformers)

Трансформеры - это ключевая технология, используемая в больших языковых моделях и других инструментах современного ИИ. Они были впервые представлены в знаменитой статье 2017 года 'Attention is All You Need'. В видео они рассматриваются как основной механизм, который позволяет моделям понимать контекст и предсказывать следующее слово в тексте.

💡Механизм внимания (Attention Mechanism)

Механизм внимания - это фундаментальная составляющая трансформеров, который позволяет модели анализировать контекст и определять, как каждое слово влияет на другие слова. В видео он объясняется через примеры, такие как различные значения слова 'mole' в разных контекстах.

💡Токены (Tokens)

Токены - это небольшие части текста, обычно слова или части слов, которые модель использует для анализа. В видео они обозначаются как элементы, для которых модель вычисляет векторы-эмбеддинги, представляющие их смысл в контексте.

💡Эмбеддинги (Embeddings)

Эмбеддинги - это высокоразмерные векторы, которые кодируют смысл каждого токена. В видео рассматривается, как эти векторы со временем учитывают контекст и становятся более сложными и точными в отражении смысла.

💡Матричные умножения (Matrix Multiplications)

Матричные умножения являются основным методом вычислений в трансформерах. Они используются для вычисления ключевых, вопросительных и значимых векторов, которые в дальнейшем определяют, как слова влияют друг на друга.

💡Ключевые матрицы (Key Matrices)

Ключевые матрицы - это матрицы с параметрами, которые модель учитывает для вычисления ключевых векторов. Они помогают определить, как каждое слово может быть связано с другими словами в контексте.

💡Запросительные матрицы (Query Matrices)

Запросительные матрицы используются для создания вопросительных векторов, которые помогают модели определить, какие слова могут быть связаны с текущим словом в контексте. Они являются ключевым элементом механизма внимания.

💡Матрицы значимости (Value Matrices)

Матрицы значимости используются для вычисления значимых векторов, которые добавляются к эмбеддингам слов для обновления их смысла в контексте. Они являются частью процесса обновления эмбеддингов в трансформерах.

💡Маскировка (Masking)

Маскировка - это техника, используемая для предотвращения влияния последующих слов на предыдущие в процессе обучения модели. Это важно для того, чтобы модель могла правильно предсказывать следующее слово в тексте.

💡Многоголовая внимательность (Multi-Headed Attention)

Многоголовая внимательность - это технология, используемая в трансформерах для параллельного выполнения операций внимания. Это позволяет модели учитывать различные аспекты контекста и улучшает качество предсказания следующего слова.

💡Параметры модели (Model Parameters)

Параметры модели - это коэффициенты, которые модель обучает на основе данных. В видео упоминается, что в GPT-3 есть около 175 миллиардов параметров, что делает модель мощной для обработки большого количества контекста.

Highlights

Transformers are a key technology in large language models and modern AI tools, introduced in the 2017 paper 'Attention is All You Need'.

The model's goal is to take in text and predict the next word, using tokens which are often words or pieces of words.

Each token is associated with a high-dimensional vector, or embedding, which encodes semantic meaning.

The attention mechanism in transformers adjusts embeddings to encode richer contextual meaning.

The attention mechanism can be confusing, but it's designed to enable certain behaviors like context-based word meaning adjustments.

The transformer's first step involves creating a token embedding, which initially has no context.

Subsequent steps allow surrounding embeddings to pass contextual information to a given token.

Well-trained attention blocks calculate necessary adjustments to generic embeddings based on context.

The transformer can update embeddings to reflect context, such as changing 'tower' to specifically represent 'Eiffel Tower'.

Attention blocks enable the model to move information from one embedding to another, even across long distances.

The final vector in a sequence is crucial for accurately predicting the next word in a text.

A simple example demonstrates how adjectives can adjust the meanings of their corresponding nouns through attention.

Each word's embedding also encodes its position in the context, which is important for the attention mechanism.

The attention process involves matrices (query, key, value) with tunable parameters that the model learns from data.

Multi-headed attention, used in transformers like GPT-3, runs many attention operations in parallel, each with unique matrices.

GPT-3 uses 96 attention heads, each with its parameters, contributing to its large parameter count of around 58 billion.

Attention mechanism's success is partly due to its parallelizable nature, allowing for efficient computation on GPUs.

The attention mechanism is just one part of the transformer; other operations like multi-layer perceptrons are also crucial.

The total number of parameters in GPT-3 is 175 billion, with most parameters outside the attention mechanism.

Transcripts

play00:00

In the last chapter, you and I started to step

play00:02

through the internal workings of a transformer.

play00:04

This is one of the key pieces of technology inside large language models,

play00:07

and a lot of other tools in the modern wave of AI.

play00:10

It first hit the scene in a now-famous 2017 paper called Attention is All You Need,

play00:15

and in this chapter you and I will dig into what this attention mechanism is,

play00:19

visualizing how it processes data.

play00:26

As a quick recap, here's the important context I want you to have in mind.

play00:30

The goal of the model that you and I are studying is to

play00:33

take in a piece of text and predict what word comes next.

play00:36

The input text is broken up into little pieces that we call tokens,

play00:40

and these are very often words or pieces of words,

play00:43

but just to make the examples in this video easier for you and me to think about,

play00:47

let's simplify by pretending that tokens are always just words.

play00:51

The first step in a transformer is to associate each token

play00:54

with a high-dimensional vector, what we call its embedding.

play00:57

The most important idea I want you to have in mind is how directions in this

play01:02

high-dimensional space of all possible embeddings can correspond with semantic meaning.

play01:07

In the last chapter we saw an example for how direction can correspond to gender,

play01:11

in the sense that adding a certain step in this space can take you from the

play01:15

embedding of a masculine noun to the embedding of the corresponding feminine noun.

play01:20

That's just one example you could imagine how many other directions in this

play01:23

high-dimensional space could correspond to numerous other aspects of a word's meaning.

play01:28

The aim of a transformer is to progressively adjust these

play01:31

embeddings so that they don't merely encode an individual word,

play01:35

but instead they bake in some much, much richer contextual meaning.

play01:40

I should say up front that a lot of people find the attention mechanism,

play01:43

this key piece in a transformer, very confusing,

play01:46

so don't worry if it takes some time for things to sink in.

play01:49

I think that before we dive into the computational details and

play01:52

all the matrix multiplications, it's worth thinking about a couple

play01:55

examples for the kind of behavior that we want attention to enable.

play02:00

Consider the phrases American true mole, one mole of carbon dioxide,

play02:04

and take a biopsy of the mole.

play02:06

You and I know that the word mole has different meanings in each one of these,

play02:10

based on the context.

play02:11

But after the first step of a transformer, the one that breaks up the text

play02:15

and associates each token with a vector, the vector that's associated with

play02:18

mole would be the same in all of these cases, because this initial token

play02:22

embedding is effectively a lookup table with no reference to the context.

play02:26

It's only in the next step of the transformer that the surrounding

play02:30

embeddings have the chance to pass information into this one.

play02:33

The picture you might have in mind is that there are multiple distinct directions in

play02:38

this embedding space encoding the multiple distinct meanings of the word mole,

play02:42

and that a well-trained attention block calculates what you need to add to the generic

play02:47

embedding to move it to one of these specific directions, as a function of the context.

play02:53

To take another example, consider the embedding of the word tower.

play02:57

This is presumably some very generic, non-specific direction in the space,

play03:01

associated with lots of other large, tall nouns.

play03:04

If this word was immediately preceded by Eiffel,

play03:06

you could imagine wanting the mechanism to update this vector so that

play03:10

it points in a direction that more specifically encodes the Eiffel tower,

play03:14

maybe correlated with vectors associated with Paris and France and things made of steel.

play03:19

If it was also preceded by the word miniature,

play03:22

then the vector should be updated even further,

play03:24

so that it no longer correlates with large, tall things.

play03:29

More generally than just refining the meaning of a word,

play03:32

the attention block allows the model to move information encoded in

play03:35

one embedding to that of another, potentially ones that are quite far away,

play03:39

and potentially with information that's much richer than just a single word.

play03:43

What we saw in the last chapter was how after all of the vectors flow through the

play03:47

network, including many different attention blocks,

play03:50

the computation you perform to produce a prediction of the next token is entirely a

play03:55

function of the last vector in the sequence.

play03:59

Imagine, for example, that the text you input is most of an entire mystery novel,

play04:03

all the way up to a point near the end, which reads, therefore the murderer was.

play04:08

If the model is going to accurately predict the next word,

play04:11

that final vector in the sequence, which began its life simply embedding the word was,

play04:16

will have to have been updated by all of the attention blocks to represent much,

play04:20

much more than any individual word, somehow encoding all of the information

play04:24

from the full context window that's relevant to predicting the next word.

play04:29

To step through the computations, though, let's take a much simpler example.

play04:32

Imagine that the input includes the phrase, a

play04:35

fluffy blue creature roamed the verdant forest.

play04:38

And for the moment, suppose that the only type of update that we care about

play04:42

is having the adjectives adjust the meanings of their corresponding nouns.

play04:47

What I'm about to describe is what we would call a single head of attention,

play04:50

and later we will see how the attention block consists of many different heads run in

play04:54

parallel.

play04:56

Again, the initial embedding for each word is some high dimensional vector

play04:59

that only encodes the meaning of that particular word with no context.

play05:04

Actually, that's not quite true.

play05:05

They also encode the position of the word.

play05:07

There's a lot more to say way that positions are encoded, but right now,

play05:11

all you need to know is that the entries of this vector are enough to

play05:15

tell you both what the word is and where it exists in the context.

play05:19

Let's go ahead and denote these embeddings with the letter e.

play05:22

The goal is to have a series of computations produce a new refined

play05:26

set of embeddings where, for example, those corresponding to the

play05:29

nouns have ingested the meaning from their corresponding adjectives.

play05:33

And playing the deep learning game, we want most of the computations

play05:37

involved to look like matrix-vector products, where the matrices are

play05:40

full of tunable weights, things that the model will learn based on data.

play05:44

To be clear, I'm making up this example of adjectives updating nouns just to

play05:48

illustrate the type of behavior that you could imagine an attention head doing.

play05:52

As with so much deep learning, the true behavior is much harder to parse because it's

play05:57

based on tweaking and tuning a huge number of parameters to minimize some cost function.

play06:01

It's just that as we step through all of different matrices filled with parameters

play06:05

that are involved in this process, I think it's really helpful to have an imagined

play06:09

example of something that it could be doing to help keep it all more concrete.

play06:14

For the first step of this process, you might imagine each noun, like creature,

play06:18

asking the question, hey, are there any adjectives sitting in front of me?

play06:22

And for the words fluffy and blue, to each be able to answer,

play06:25

yeah, I'm an adjective and I'm in that position.

play06:28

That question is somehow encoded as yet another vector,

play06:32

another list of numbers, which we call the query for this word.

play06:36

This query vector though has a much smaller dimension than the embedding vector, say 128.

play06:42

Computing this query looks like taking a certain matrix,

play06:46

which I'll label wq, and multiplying it by the embedding.

play06:50

Compressing things a bit, let's write that query vector as q,

play06:54

and then anytime you see me put a matrix next to an arrow like this one,

play06:58

it's meant to represent that multiplying this matrix by the vector at the arrow's start

play07:02

gives you the vector at the arrow's end.

play07:05

In this case, you multiply this matrix by all of the embeddings in the context,

play07:10

producing one query vector for each token.

play07:13

The entries of this matrix are parameters of the model,

play07:16

which means the true behavior is learned from data, and in practice,

play07:19

what this matrix does in a particular attention head is challenging to parse.

play07:23

But for our sake, imagining an example that we might hope that it would learn,

play07:27

we'll suppose that this query matrix maps the embeddings of nouns to

play07:31

certain directions in this smaller query space that somehow encodes

play07:34

the notion of looking for adjectives in preceding positions.

play07:38

As to what it does to other embeddings, who knows?

play07:41

Maybe it simultaneously tries to accomplish some other goal with those.

play07:44

Right now, we're laser focused on the nouns.

play07:47

At the same time, associated with this is a second matrix called the key matrix,

play07:51

which you also multiply by every one of the embeddings.

play07:55

This produces a second sequence of vectors that we call the keys.

play07:59

Conceptually, you want to think of the keys as potentially answering the queries.

play08:03

This key matrix is also full of tunable parameters, and just like the query matrix,

play08:07

it maps the embedding vectors to that same smaller dimensional space.

play08:12

You think of the keys as matching the queries whenever they closely align with each other.

play08:17

In our example, you would imagine that the key matrix maps the adjectives like fluffy

play08:21

and blue to vectors that are closely aligned with the query produced by the word creature.

play08:27

To measure how well each key matches each query,

play08:30

you compute a dot product between each possible key-query pair.

play08:34

I like to visualize a grid full of a bunch of dots,

play08:37

where the bigger dots correspond to the larger dot products,

play08:40

the places where the keys and queries align.

play08:43

For our adjective noun example, that would look a little more like this,

play08:47

where if the keys produced by fluffy and blue really do align closely with the query

play08:52

produced by creature, then the dot products in these two spots would be some large

play08:57

positive numbers.

play08:59

In the lingo, machine learning people would say that this means the

play09:02

embeddings of fluffy and blue attend to the embedding of creature.

play09:06

By contrast to the dot product between the key for some other

play09:09

word like the and the query for creature would be some small

play09:12

or negative value that reflects that are unrelated to each other.

play09:17

So we have this grid of values that can be any real number from

play09:21

negative infinity to infinity, giving us a score for how relevant

play09:25

each word is to updating the meaning of every other word.

play09:29

The way we're about to use these scores is to take a certain

play09:32

weighted sum along each column, weighted by the relevance.

play09:36

So instead of having values range from negative infinity to infinity,

play09:40

what we want is for the numbers in these columns to be between 0 and 1,

play09:44

and for each column to add up to 1, as if they were a probability distribution.

play09:49

If you're coming in from the last chapter, you know what we need to do then.

play09:52

We compute a softmax along each one of these columns to normalize the values.

play10:00

In our picture, after you apply softmax to all of the columns,

play10:03

we'll fill in the grid with these normalized values.

play10:06

At this point you're safe to think about each column as giving weights according

play10:10

to how relevant the word on the left is to the corresponding value at the top.

play10:15

We call this grid an attention pattern.

play10:18

Now if you look at the original transformer paper,

play10:20

there's a really compact way that they write this all down.

play10:23

Here the variables q and k represent the full arrays of query

play10:27

and key vectors respectively, those little vectors you get by

play10:31

multiplying the embeddings by the query and the key matrices.

play10:35

This expression up in the numerator is a really compact way to represent

play10:39

the grid of all possible dot products between pairs of keys and queries.

play10:44

A small technical detail that I didn't mention is that for numerical stability,

play10:48

it happens to be helpful to divide all of these values by the

play10:51

square root of the dimension in that key query space.

play10:54

Then this softmax that's wrapped around the full expression

play10:57

is meant to be understood to apply column by column.

play11:01

As to that v term, we'll talk about it in just a second.

play11:05

Before that, there's one other technical detail that so far I've skipped.

play11:09

During the training process, when you run this model on a given text example,

play11:13

and all of the weights are slightly adjusted and tuned to either reward or punish it

play11:17

based on how high a probability it assigns to the true next word in the passage,

play11:21

it turns out to make the whole training process a lot more efficient if you

play11:25

simultaneously have it predict every possible next token following each initial

play11:29

subsequence of tokens in this passage.

play11:31

For example, with the phrase that we've been focusing on,

play11:34

it might also be predicting what words follow creature and what words follow the.

play11:39

This is really nice, because it means what would otherwise

play11:42

be a single training example effectively acts as many.

play11:46

For the purposes of our attention pattern, it means that you never

play11:49

want to allow later words to influence earlier words,

play11:52

since otherwise they could kind of give away the answer for what comes next.

play11:56

What this means is that we want all of these spots here,

play11:59

the ones representing later tokens influencing earlier ones,

play12:02

to somehow be forced to be zero.

play12:05

The simplest thing you might think to do is to set them equal to zero,

play12:08

but if you did that the columns wouldn't add up to one anymore,

play12:11

they wouldn't be normalized.

play12:13

So instead, a common way to do this is that before applying softmax,

play12:16

you set all of those entries to be negative infinity.

play12:19

If you do that, then after applying softmax, all of those get turned into zero,

play12:23

but the columns stay normalized.

play12:26

This process is called masking.

play12:27

There are versions of attention where you don't apply it, but in our GPT example,

play12:31

even though this is more relevant during the training phase than it would be,

play12:34

say, running it as a chatbot or something like that,

play12:37

you do always apply this masking to prevent later tokens from influencing earlier ones.

play12:42

Another fact that's worth reflecting on about this attention

play12:45

pattern is how its size is equal to the square of the context size.

play12:49

So this is why context size can be a really huge bottleneck for large language models,

play12:54

and scaling it up is non-trivial.

play12:56

As you imagine, motivated by a desire for bigger and bigger context windows,

play13:00

recent years have seen some variations to the attention mechanism aimed at making

play13:04

context more scalable, but right here, you and I are staying focused on the basics.

play13:10

Okay, great, computing this pattern lets the model

play13:12

deduce which words are relevant to which other words.

play13:16

Now you need to actually update the embeddings,

play13:18

allowing words to pass information to whichever other words they're relevant to.

play13:22

For example, you want the embedding of Fluffy to somehow cause a change

play13:26

to Creature that moves it to a different part of this 12,000-dimensional

play13:30

embedding space that more specifically encodes a Fluffy creature.

play13:35

What I'm going to do here is first show you the most straightforward

play13:38

way that you could do this, though there's a slight way that

play13:40

this gets modified in the context of multi-headed attention.

play13:44

This most straightforward way would be to use a third matrix,

play13:47

what we call the value matrix, which you multiply by the embedding of that first word,

play13:51

for example Fluffy.

play13:53

The result of this is what you would call a value vector,

play13:55

and this is something that you add to the embedding of the second word,

play13:59

in this case something you add to the embedding of Creature.

play14:02

So this value vector lives in the same very high-dimensional space as the embeddings.

play14:07

When you multiply this value matrix by the embedding of a word,

play14:10

you might think of it as saying, if this word is relevant to adjusting the meaning of

play14:15

something else, what exactly should be added to the embedding of that something else

play14:19

in order to reflect this?

play14:22

Looking back in our diagram, let's set aside all of the keys and the queries,

play14:26

since after you compute the attention pattern you're done with those,

play14:29

then you're going to take this value matrix and multiply it by every

play14:32

one of those embeddings to produce a sequence of value vectors.

play14:37

You might think of these value vectors as being

play14:39

kind of associated with the corresponding keys.

play14:42

For each column in this diagram, you multiply each of the

play14:45

value vectors by the corresponding weight in that column.

play14:50

For example here, under the embedding of Creature,

play14:52

you would be adding large proportions of the value vectors for Fluffy and Blue,

play14:57

while all of the other value vectors get zeroed out, or at least nearly zeroed out.

play15:02

And then finally, the way to actually update the embedding associated with this column,

play15:06

previously encoding some context-free meaning of Creature,

play15:09

you add together all of these rescaled values in the column,

play15:13

producing a change that you want to add, that I'll label delta-e,

play15:16

and then you add that to the original embedding.

play15:19

Hopefully what results is a more refined vector encoding the more

play15:23

contextually rich meaning, like that of a fluffy blue creature.

play15:27

And of course you don't just do this to one embedding,

play15:30

you apply the same weighted sum across all of the columns in this picture,

play15:34

producing a sequence of changes, adding all of those changes to the corresponding

play15:38

embeddings, produces a full sequence of more refined embeddings popping out

play15:42

of the attention block.

play15:44

Zooming out, this whole process is what you would describe as a single head of attention.

play15:49

As I've described things so far, this process is parameterized by three distinct

play15:54

matrices, all filled with tunable parameters, the key, the query, and the value.

play15:59

I want to take a moment to continue what we started in the last chapter,

play16:02

with the scorekeeping where we count up the total number of model parameters using the

play16:07

numbers from GPT-3.

play16:09

These key and query matrices each have 12,288 columns, matching the embedding dimension,

play16:15

and 128 rows, matching the dimension of that smaller key query space.

play16:20

This gives us an additional 1.5 million or so parameters for each one.

play16:24

If you look at that value matrix by contrast, the way I've described things so

play16:30

far would suggest that it's a square matrix that has 12,288 columns and 12,288 rows,

play16:35

since both its inputs and outputs live in this very large embedding space.

play16:41

If true, that would mean about 150 million added parameters.

play16:45

And to be clear, you could do that.

play16:47

You could devote orders of magnitude more parameters

play16:49

to the value map than to the key and query.

play16:52

But in practice, it is much more efficient if instead you make

play16:55

it so that the number of parameters devoted to this value map

play16:57

is the same as the number devoted to the key and the query.

play17:01

This is especially relevant in the setting of

play17:03

running multiple attention heads in parallel.

play17:06

The way this looks is that the value map is factored as a product of two smaller matrices.

play17:11

Conceptually, I would still encourage you to think about the overall linear map,

play17:15

one with inputs and outputs, both in this larger embedding space,

play17:18

for example taking the embedding of blue to this blueness direction that you would

play17:23

add to nouns.

play17:27

It's just that it's a smaller number of rows,

play17:29

typically the same size as the key query space.

play17:33

What this means is you can think of it as mapping the

play17:35

large embedding vectors down to a much smaller space.

play17:39

This is not the conventional naming, but I'm going to call this the value down matrix.

play17:43

The second matrix maps from this smaller space back up to the embedding space,

play17:47

producing the vectors that you use to make the actual updates.

play17:51

I'm going to call this one the value up matrix, which again is not conventional.

play17:55

The way that you would see this written in most papers looks a little different.

play17:58

I'll talk about it in a minute.

play17:59

In my opinion, it tends to make things a little more conceptually confusing.

play18:03

To throw in linear algebra jargon here, what we're basically doing

play18:06

is constraining the overall value map to be a low rank transformation.

play18:11

Turning back to the parameter count, all four of these matrices have the same size,

play18:16

and adding them all up we get about 6.3 million parameters for one attention head.

play18:22

As a quick side note, to be a little more accurate,

play18:24

everything described so far is what people would call a self-attention head,

play18:27

to distinguish it from a variation that comes up in other models that's

play18:30

called cross-attention.

play18:32

This isn't relevant to our GPT example, but if you're curious,

play18:35

cross-attention involves models that process two distinct types of data,

play18:39

like text in one language and text in another language that's part of an

play18:43

ongoing generation of a translation, or maybe audio input of speech and an

play18:48

ongoing transcription.

play18:50

A cross-attention head looks almost identical.

play18:52

The only difference is that the key and query maps act on different data sets.

play18:57

In a model doing translation, for example, the keys might come from one language,

play19:02

while the queries come from another, and the attention pattern could describe

play19:06

which words from one language correspond to which words in another.

play19:10

And in this setting there would typically be no masking,

play19:12

since there's not really any notion of later tokens affecting earlier ones.

play19:17

Staying focused on self-attention though, if you understood everything so far,

play19:20

and if you were to stop here, you would come away with the essence of what attention

play19:24

really is.

play19:25

All that's really left to us is to lay out the

play19:28

sense in which you do this many many different times.

play19:32

In our central example we focused on adjectives updating nouns,

play19:35

but of course there are lots of different ways that context can influence the

play19:38

meaning of a word.

play19:40

If the words they crashed the preceded the word car,

play19:43

it has implications for the shape and structure of that car.

play19:47

And a lot of associations might be less grammatical.

play19:49

If the word wizard is anywhere in the same passage as Harry,

play19:52

it suggests that this might be referring to Harry Potter,

play19:55

whereas if instead the words Queen, Sussex, and William were in that passage,

play20:00

then perhaps the embedding of Harry should instead be updated to refer to the prince.

play20:05

For every different type of contextual updating that you might imagine,

play20:08

the parameters of these key and query matrices would be different to

play20:11

capture the different attention patterns, and the parameters of our

play20:15

value map would be different based on what should be added to the embeddings.

play20:19

And again, in practice the true behavior of these maps is much more

play20:23

difficult to interpret, where the weights are set to do whatever the

play20:26

model needs them to do to best accomplish its goal of predicting the next token.

play20:31

As I said before, everything we described is a single head of attention,

play20:35

and a full attention block inside a transformer consists of what's

play20:38

called multi-headed attention, where you run a lot of these operations in parallel,

play20:43

each with its own distinct key query and value maps.

play20:47

GPT-3 for example uses 96 attention heads inside each block.

play20:52

Considering that each one is already a bit confusing,

play20:54

it's certainly a lot to hold in your head.

play20:56

Just to spell it all out very explicitly, this means you have 96

play21:00

distinct key and query matrices producing 96 distinct attention patterns.

play21:05

Then each head has its own distinct value matrices

play21:08

used to produce 96 sequences of value vectors.

play21:12

These are all added together using the corresponding attention patterns as weights.

play21:17

What this means is that for each position in the context, each token,

play21:21

every one of these heads produces a proposed change to be added to the embedding in

play21:26

that position.

play21:27

So what you do is you sum together all of those proposed changes,

play21:31

one for each head, and you add the result to the original embedding of that position.

play21:36

This entire sum here would be one slice of what's outputted from this multi-headed

play21:41

attention block, a single one of those refined embeddings that pops out the other end

play21:47

of it.

play21:48

Again, this is a lot to think about, so don't

play21:50

worry at all if it takes some time to sink in.

play21:52

The overall idea is that by running many distinct heads in parallel,

play21:56

you're giving the model the capacity to learn many distinct ways that context

play22:00

changes meaning.

play22:03

Pulling up our running tally for parameter count with 96 heads,

play22:07

each including its own variation of these four matrices,

play22:10

each block of multi-headed attention ends up with around 600 million parameters.

play22:16

There's one added slightly annoying thing that I should really

play22:19

mention for any of you who go on to read more about transformers.

play22:22

You remember how I said that the value map is factored out into these two

play22:25

distinct matrices, which I labeled as the value down and the value up matrices.

play22:29

The way that I framed things would suggest that you see this pair of matrices

play22:34

inside each attention head, and you could absolutely implement it this way.

play22:38

That would be a valid design.

play22:40

But the way that you see this written in papers and the way

play22:42

that it's implemented in practice looks a little different.

play22:45

All of these value up matrices for each head appear stapled together in one giant matrix

play22:50

that we call the output matrix, associated with the entire multi-headed attention block.

play22:56

And when you see people refer to the value matrix for a given attention head,

play23:00

they're typically only referring to this first step,

play23:03

the one that I was labeling as the value down projection into the smaller space.

play23:08

For the curious among you, I've left an on-screen note about it.

play23:11

It's one of those details that runs the risk of distracting

play23:13

from the main conceptual points, but I do want to call it out

play23:16

just so that you know if you read about this in other sources.

play23:19

Setting aside all the technical nuances, in the preview from the last chapter we saw

play23:23

how data flowing through a transformer doesn't just flow through a single attention block.

play23:28

For one thing, it also goes through these other operations called multi-layer perceptrons.

play23:33

We'll talk more about those in the next chapter.

play23:35

And then it repeatedly goes through many many copies of both of these operations.

play23:39

What this means is that after a given word imbibes some of its context,

play23:43

there are many more chances for this more nuanced embedding

play23:47

to be influenced by its more nuanced surroundings.

play23:50

The further down the network you go, with each embedding taking in more and more

play23:54

meaning from all the other embeddings, which themselves are getting more and more

play23:59

nuanced, the hope is that there's the capacity to encode higher level and more

play24:03

abstract ideas about a given input beyond just descriptors and grammatical structure.

play24:07

Things like sentiment and tone and whether it's a poem and what underlying

play24:11

scientific truths are relevant to the piece and things like that.

play24:16

Turning back one more time to our scorekeeping, GPT-3 includes 96 distinct layers,

play24:22

so the total number of key query and value parameters is multiplied by another 96,

play24:27

which brings the total sum to just under 58 billion distinct parameters

play24:32

devoted to all of the attention heads.

play24:34

That is a lot to be sure, but it's only about a third

play24:38

of the 175 billion that are in the network in total.

play24:41

So even though attention gets all of the attention,

play24:44

the majority of parameters come from the blocks sitting in between these steps.

play24:48

In the next chapter, you and I will talk more about those

play24:51

other blocks and also a lot more about the training process.

play24:54

A big part of the story for the success of the attention mechanism is not so much any

play24:58

specific kind of behavior that it enables, but the fact that it's extremely

play25:03

parallelizable, meaning that you can run a huge number of computations in a short time

play25:07

using GPUs.

play25:09

Given that one of the big lessons about deep learning in the last decade or two has

play25:13

been that scale alone seems to give huge qualitative improvements in model performance,

play25:17

there's a huge advantage to parallelizable architectures that let you do this.

play25:22

If you want to learn more about this stuff, I've left lots of links in the description.

play25:25

In particular, anything produced by Andrej Karpathy or Chris Ola tend to be pure gold.

play25:30

In this video, I wanted to just jump into attention in its current form,

play25:33

but if you're curious about more of the history for how we got here

play25:36

and how you might reinvent this idea for yourself,

play25:38

my friend Vivek just put up a couple videos giving a lot more of that motivation.

play25:43

Also, Britt Cruz from the channel The Art of the Problem has

play25:45

a really nice video about the history of large language models.

play26:04

Thank you.

Rate This

5.0 / 5 (0 votes)

Связанные теги
ТрансформерыАвтоматическое машинное обучениеАнализ данныхИскусственный интеллектМодель предсказанияКонтекстная семантикаМатричные умноженияМногоголовая аттенцияТехнологический прогрессИнформационные технологииЛингвистический анализ