Visualizing Attention, a Transformer's Heart | Chapter 6, Deep Learning

3Blue1Brown
7 Apr 202426:09

Summary

TLDRIn diesem Video-Skript wird erläutert, wie der Aufmerksamkeitsmechanismus in Transformers funktioniert, einer Schlüsseltechnologie in modernen KI-Sprachmodellen. Es wird erklärt, wie ein Modell Text verarbeiten und das nächste Wort vorhersagen kann, indem es Tokens in hochdimensionale Vektoren übersetzt, die kontextuelle Bedeutung aufnehmen. Der Schwerpunkt liegt auf der Aufmerksamkeitsmechanik, die es ermöglicht, die Bedeutung von Wörtern im Kontext zu verstehen, und auf der Visualisierung der Datenverarbeitung. Es wird auch auf die Parallelisierbarkeit hingewiesen, die es erlaubt, eine große Anzahl von Berechnungen schnell durchzuführen, was für die Leistungsfähigkeit von KI-Modellen von großer Bedeutung ist.

Takeaways

  • 🧠 Transformer ist ein Schlüsseltechnologie in modernen KI-Sprachmodellen, eingeführt durch das berühmte 2017-Paper 'Attention is All You Need'.
  • 🔍 Der Zweck des Modells ist es, Text zu nehmen und das nächste zukünftige Wort vorherzusagen.
  • 📄 Der Eingabetext wird in sogenannte Tokens aufgeteilt, die häufig Wörter oder Teile von Wörtern sind.
  • 📊 Der erste Schritt eines Transformers besteht darin, jedem Token ein hochdimensionales Vektor-Embedding zuzuordnen.
  • 🌐 In diesem hochdimensionalen Raum der Embeddings können verschiedene Richtungen semantische Bedeutungen entsprechen.
  • 🤔 Der Transformer zielt darauf ab, diese Embeddings sukzessive anzupassen, sodass sie nicht nur die Bedeutung eines einzelnen Wortes, sondern auch reichere kontextuelle Bedeutungen kodieren.
  • 🤷‍♂️ Die Aufmerksamkeitsmechanismen in Transformers finden viele Menschen verwirrend, aber sie sind entscheidend, um das Modell in der Lage zu machen, den Kontext zu verstehen.
  • 🔄 Der Prozess der Aufmerksamkeit umfasst mehrere Schritte: Query-, Key- und Value-Vektoren werden erstellt, um die Relevanz von Tokens füreinander zu bestimmen.
  • 🔑 Die Schlüssel- und Abfragematrix (Key und Query) sowie die Wertematrix (Value) sind mit vielen Parametern ausgestattet, die das Modell aus Daten lernen kann.
  • 🎯 Die Aufmerksamkeitsmuster (Attention Patterns) ermöglichen es dem Modell, die Relevanz von Wörtern für die Bedeutungsaktualisierung anderer Wörter zu bestimmen.
  • 🔗 Die Mehrköpfige Aufmerksamkeit (Multi-Headed Attention) in einem Transformer ermöglicht es, viele verschiedene Arten der kontextuellen Bedeutungsänderung parallel zu lernen.

Q & A

  • Was ist das Ziel des Modells, das im Skript studiert wird?

    -Das Ziel des Modells ist, einen Text zu nehmen und das nächste Wörter vorherzusagen.

  • Was sind 'Tokens' im Kontext des Skripts?

    -Tokens sind kleine Teile des Textes, die häufig Wörter oder Teile von Wörtern sind.

  • Wie wird ein Token in einem Transformer zu einem hochdimensionalen Vektor, dem sogenannten 'Embedding', assoziiert?

    -Jeder Token wird mit einem Embedding assoziiert, was durch die Anwendung einer Matrix (sogenannter Einbettungsmatrix) auf das Token geschieht.

  • Was ist die Bedeutung von 'high-dimensional space' im Zusammenhang mit Embeddings?

    -Der 'high-dimensional space' bezieht sich auf den Raum, in dem alle möglichen Embeddings platziert werden können, wobei verschiedene Richtungen in diesem Raum semantische Bedeutungen entsprechen können.

  • Was ist das Ziel eines Transformers, wenn es diese Embeddings anpasst?

    -Ein Transformer zielt darauf ab, die Embeddings stufenweise anzupassen, sodass sie nicht nur die Bedeutung eines einzelnen Wortes, sondern auch reichere kontextuelle Bedeutungen enthalten.

  • Was ist die Aufgabe des Aufmerksamkeitsmechanismus in einem Transformer?

    -Der Aufmerksamkeitsmechanismus ermöglicht es dem Modell, die Bedeutung eines Wortes basierend auf seinem Kontext zu verfeinern und Informationen aus einem Embedding in ein anderes zu übertragen.

  • Wie viele Arten von Matrizen sind für eine Aufmerksamkeitskopfoperation verantwortlich?

    -Es gibt drei Arten von Matrizen: Query-Matrix, Key-Matrix und Value-Matrix.

  • Was ist ein 'Attention Head' und was passiert in einem 'Multi-Headed Attention' Block?

    -Ein 'Attention Head' ist eine Aufmerksamkeitsoperation mit eigenen Key, Query und Value-Matrizen. Ein 'Multi-Headed Attention' Block führt viele solcher Operationen parallel durch, um viele verschiedene Arten von kontextuellen Bedeutungsänderungen zu erfassen.

  • Wie viele Aufmerksamkeitsköpfe hat GPT-3 in jedem Block?

    -GPT-3 hat 96 Aufmerksamkeitsköpfe in jedem Block.

  • Was sind die zusätzlichen Operationen, die Daten in einem Transformer nach dem Durchlaufen eines Aufmerksamkeitsblocks durchlaufen?

    -Daten durchlaufen auch sogenannte Multi-Layer-Perzeptronen (MLPs) und wiederholen dann diesen Prozess oft, um die Einbettungen weiter zu verfeinern.

  • Welche Fähigkeit des Aufmerksamkeitsmechanismus ist für die Leistung von Großsprachmodellen so wichtig?

    -Die Fähigkeit des Aufmerksamkeitsmechanismus, eine große Anzahl von Berechnungen parallel durchzuführen, ist für die Leistung von Großsprachmodellen entscheidend, da Parallelisierung die Skalierbarkeit und Effizienz steigert.

Outlines

00:00

🧠 Einführung in Transformers und Aufmerksamkeitsmechanismen

Dieser Absatz stellt die Funktionsweise von Transformers und dem Aufmerksamkeitsmechanismus vor, der in modernen KI-Werkzeugen und Sprachmodellen eine Schlüsselrolle spielt. Er erinnert an die Ziele des Modells, das Texte einliest und das nächste Wort vorhersagen soll. Die Einheiten des Textes, sogenannte Tokens, werden in hochdimensionale Vektoren, sogenannte Embeddings, umgewandelt. Der Schwerpunkt liegt auf der Bedeutung von Richtungen in diesem Raum, die semantische Bedeutungen widerspiegeln können. Der Transformer ist darauf ausgerichtet, diese Embeddings sukzessive anzupassen, um sie an reichere kontextuelle Bedeutungen anzupassen.

05:04

🔍 Aufmerksamkeitsmechanismus und seine Anwendung

Der Absatz erläutert, wie der Aufmerksamkeitsmechanismus funktioniert und wie er in der Praxis verwendet wird. Es werden Beispiele gegeben, wie der Kontext das Verständnis von Wörtern wie 'Mole' beeinflusst. Der Fokus liegt auf der Visualisierung der Datenverarbeitung und dem Ziel, die Embeddings so anzupassen, dass sie nicht nur die Bedeutung eines einzelnen Wortes, sondern auch reichere kontextuelle Bedeutungen enthalten. Es wird auch auf die Herausforderungen hingewiesen, die mit dem Verständnis der Aufmerksamkeitsmechanismen verbunden sind, und es werden Beispiele gegeben, wie Adjektive die Bedeutung von Substantiven anpassen können.

10:07

🧮 Detaillierte Erklärung der Aufmerksamkeitsberechnungen

Dieser Absatz geht auf die technischen Details der Aufmerksamkeitsberechnungen ein. Es wird beschrieben, wie Abfragen (Query), Schlüssel (Key) und Werte (Value) verwendet werden, um die Relevanz von Wörtern für die Bedeutungsaktualisierung anderer Wörter zu bestimmen. Es wird erklärt, wie die Abfragen und Schlüssel mithilfe von Matrizen aus den Embeddings erstellt werden und wie die Punktprodukte zur Bestimmung der Übereinstimmung verwendet werden. Der Begriff der Aufmerksamkeitsmuster wird eingeführt, und es wird erläutert, wie die softmax-Funktion verwendet wird, um die Werte zu normalisieren und eine Verteilung zu erhalten, die wie Wahrscheinlichkeiten funktioniert.

15:07

🔄 Aktualisierung der Embeddings durch Aufmerksamkeitsmuster

Der Absatz beschreibt, wie die Aufmerksamkeitsmuster verwendet werden, um die Embeddings zu aktualisieren und Informationen von relevanten Wörtern an andere Wörter zu übertragen. Es wird erklärt, wie der Wertematrix verwendet wird, um die Embeddings in sogenannte Werte-Vektoren umzuwandeln, die dann zu den ursprünglichen Embeddings hinzugefügt werden, um die Bedeutungen zu aktualisieren. Der Prozess wird als Teil eines Aufmerksamkeitskopfes beschrieben, und es wird betont, dass es in einem Transformer mehrere solcher Kopf parallel laufen.

20:09

🌐 Mehrköpfige Aufmerksamkeit und ihre Auswirkungen

Dieser Absatz vertieft das Konzept der Mehrköpfigen Aufmerksamkeit, bei der viele verschiedene Aufmerksamkeitsoperationen parallel ausgeführt werden, um verschiedene Arten von kontextuellen Aktualisierungen zu erfassen. Es wird erklärt, wie jeder Kopf seine eigenen Schlüssel, Abfragen und Wertematrizen hat und wie diese zusammengeführt werden, um die Embeddings zu aktualisieren. Der Fokus liegt auf der Fähigkeit, viele verschiedene Arten der Bedeutungsänderung durch den Kontext zu lernen und auf die Herausforderungen, die mit der Interpretation der komplexen Gewichtungen verbunden sind, die in diesen Matrizen gespeichert sind.

25:09

🔄 Skalierbarkeit und Parallelität von Aufmerksamkeitsarchitekturen

Der letzte Absatz betont die Parallelisierbarkeit der Aufmerksamkeitsarchitektur und wie sie es ermöglicht, eine große Anzahl von Berechnungen in kurzer Zeit durchzuführen, was für die Leistung von KI-Modellen von großer Bedeutung ist. Es wird auch auf zusätzliche Ressourcen verwiesen, die mehr über die Geschichte und die Funktionsweise von Aufmerksamkeitsmechanismen und Sprachmodellen liefern, und es wird auf die zukünftigen Diskussionen über andere Teile der Transformer-Architektur und den Trainingsprozess hingewiesen.

Mindmap

Keywords

💡Transformer

Ein Transformer ist ein Schlüsselkonzept in der modernen KI, das in der Sprachverarbeitung und anderen Modellen eingesetzt wird. Ursprünglich in einem berühmten 2017 Paper vorgestellt, hat es die Fähigkeit, Text zu verarbeiten und das nächste zukünftige Wort vorherzusagen. Im Video wird erläutert, wie ein Transformer interne Vektoren, sogenannte Embeddings, für jedes Texttoken erstellt und dann kontextuelle Bedeutungen hinzufügt, um die Vorhersagegenauigkeit zu erhöhen.

💡Attention Mechanism

Die Aufmerksamkeitsmechanisme sind Kernstücke in Transformern, die es dem Modell ermöglichen, die Bedeutung von Wörtern im Kontext zu verstehen. Im Video wird erklärt, dass diese Mechanismen helfen, die Vektoren der Einbettungen (Embeddings) entsprechend der umliegenden Wörter anzupassen, um die kontextuelle Bedeutung jedes Wortes zu erfassen.

💡Embedding

Ein Embedding ist ein hochdimensionaler Vektor, der eine Entität wie ein Wort repräsentiert. Im Kontext des Videos ist das Ziel, dass diese Vektoren nicht nur die Bedeutung des einzelnen Wortes, sondern auch die kontextuelle Bedeutung im Text kodieren, indem sie von anderen Embeddings in der Umgebung beeinflusst werden.

💡Token

Ein Token ist ein Stück Text, normalerweise ein Wort oder ein Teil davon, das als Einheit für die Verarbeitung durch den Transformer verwendet wird. Im Video wird beschrieben, wie Tokens in Embeddings umgewandelt werden, um die semantische Bedeutung zu erfassen.

💡Kontext

Kontext bezieht sich auf die Umgebung oder die Situation, in der ein Wort oder eine Phrase erscheint und die Bedeutung beeinflusst. Im Video wird gezeigt, wie der Attention Mechanismus verwendet wird, um die kontextuelle Bedeutung von Wörtern zu verstehen und die Embeddings entsprechend anzupassen.

💡Query, Key, Value

In der Attention Mechanisme sind Query, Key und Value zentrale Konzepte, die verwendet werden, um die Relevanz von Tokens im Text füreinander zu messen. Querys und Keys werden verwendet, um die Übereinstimmung zu bestimmen, während Values die tatsächlichen Informationen repräsentieren, die bei der Aktualisierung der Embeddings verwendet werden.

💡Self-Attention

Selbstanmerkung ist eine Art von Aufmerksamkeitsmechanisme, bei der das Modell selbst auf sich selbst aufmerksam ist, also die Beziehungen zwischen den Wörtern im Text selbst bestimmt. Im Video wird erläutert, wie die Selbstanmerkung dazu beiträgt, die kontextuelle Bedeutung der Wörter zu verstehen.

💡Multi-Headed Attention

Multi-Headed Attention ist ein Konzept, bei dem mehrere Aufmerksamkeits-Heads parallel ausgeführt werden, um verschiedene Aspekte der Kontextualität zu erfassen. Im Video wird beschrieben, wie GPT-3 96 verschiedene Aufmerksamkeits-Heads verwendet, um die komplexe Semantik im Text zu verstehen.

💡Parameter

Parameter in diesem Zusammenhang sind die Gewichte, die das Modell lernt, um die beste Vorhersage für das nächste Wort zu treffen. Im Video wird die enorme Anzahl von Parametern in einem Transformer-Modell, insbesondere in GPT-3, erläutert, die dazu beitragen, die komplexen Beziehungen im Text zu verstehen.

💡Masking

Masking ist ein Prozess, bei dem bestimmte Teile der Aufmerksamkeitsmatrix deaktiviert werden, um zu verhindern, dass spätere Wörter im Text die Bedeutung früherer Wörter beeinflussen. Im Video wird erklärt, wie Masking verwendet wird, um die korrekte Vorhersage des nächsten Wortes zu ermöglichen.

Highlights

Transformers sind eine Schlüsseltechnologie in modernen KI-Werkzeugen, eingeführt durch das berühmte 2017-Paper 'Attention is All You Need'.

Das Ziel von Transformers ist, Textstücke einzulesen und das nächste Wörter vorherzusagen.

Text wird in sogenannte Tokens unterteilt, oft Wörter oder Teile von Wörtern.

Tokens werden mit Embedding-Vektoren in einem hochdimensionalen Raum verknüpft, der semantische Bedeutungen abbilden kann.

Der Attention-Mechanismus ist zentral in Transformers und kann verwirrend sein, aber ist für das Verständnis wesentlich.

Attention ermöglicht es, den Kontext zu nutzen, um die Bedeutung von Wörtern zu verfeinern, wie im Beispiel 'mole' in verschiedenen Sätzen.

Der Embedding-Vektor von 'tower' könnte durch Wörter wie 'Eiffel' aktualisiert werden, um spezifischere Bedeutungen zu kodieren.

Attention-Block ermöglicht es, Informationen aus einem Embedding in ein anderes zu übertragen, unabhängig von der Distanz.

Das Modell lernt, wie es die Embeddings aktualisiert, um die nächste Token-Vorhersage zu treffen, basierend auf dem gesamten Kontext.

Ein einfaches Beispiel zeigt, wie Adjektive die Bedeutung von Nomen aktualisieren können, was als 'single head of attention' bezeichnet wird.

Query-, Key- und Value-Matrizen sind Parameter der Modelle, die durch Daten gelernt werden und die Aufmerksamkeitsmuster steuern.

Die Dot-Produkte zwischen Key- und Query-Vektoren messen die Relevanz von Wörtern im Kontext.

Softmax normalisiert die Werte, sodass sie wie eine Wahrscheinlichkeitsverteilung aussehen.

Masking verhindert, dass spätere Wörter den Einfluss auf frühere Wörter haben, was für das Training wichtig ist.

Value-Vektoren werden addiert, um die Embeddings entsprechend der Kontextrelevanz zu aktualisieren.

Multi-Headed Attention ermöglicht es, viele verschiedene Arten von Kontext-Aktualisierungen parallel zu verarbeiten.

GPT-3 verwendet 96 Aufmerksamkeitsköpfe, was die Fähigkeit zur parallelen Verarbeitung enorm erhöht.

Die Fähigkeit zur Parallelisierung ist für die Leistung von großen Sprachmodellen entscheidend und führt zu qualitativen Verbesserungen.

Die meisten Parameter im Modell kommen nicht von den Aufmerksamkeitsköpfen, sondern von den zwischenliegenden Blöcken.

Die Transformer-Architektur ermöglicht es, hochparallele Berechnungen durchzuführen, was zu einer Skalierung der Modellleistung beiträgt.

Transcripts

play00:00

In the last chapter, you and I started to step

play00:02

through the internal workings of a transformer.

play00:04

This is one of the key pieces of technology inside large language models,

play00:07

and a lot of other tools in the modern wave of AI.

play00:10

It first hit the scene in a now-famous 2017 paper called Attention is All You Need,

play00:15

and in this chapter you and I will dig into what this attention mechanism is,

play00:19

visualizing how it processes data.

play00:26

As a quick recap, here's the important context I want you to have in mind.

play00:30

The goal of the model that you and I are studying is to

play00:33

take in a piece of text and predict what word comes next.

play00:36

The input text is broken up into little pieces that we call tokens,

play00:40

and these are very often words or pieces of words,

play00:43

but just to make the examples in this video easier for you and me to think about,

play00:47

let's simplify by pretending that tokens are always just words.

play00:51

The first step in a transformer is to associate each token

play00:54

with a high-dimensional vector, what we call its embedding.

play00:57

The most important idea I want you to have in mind is how directions in this

play01:02

high-dimensional space of all possible embeddings can correspond with semantic meaning.

play01:07

In the last chapter we saw an example for how direction can correspond to gender,

play01:11

in the sense that adding a certain step in this space can take you from the

play01:15

embedding of a masculine noun to the embedding of the corresponding feminine noun.

play01:20

That's just one example you could imagine how many other directions in this

play01:23

high-dimensional space could correspond to numerous other aspects of a word's meaning.

play01:28

The aim of a transformer is to progressively adjust these

play01:31

embeddings so that they don't merely encode an individual word,

play01:35

but instead they bake in some much, much richer contextual meaning.

play01:40

I should say up front that a lot of people find the attention mechanism,

play01:43

this key piece in a transformer, very confusing,

play01:46

so don't worry if it takes some time for things to sink in.

play01:49

I think that before we dive into the computational details and

play01:52

all the matrix multiplications, it's worth thinking about a couple

play01:55

examples for the kind of behavior that we want attention to enable.

play02:00

Consider the phrases American true mole, one mole of carbon dioxide,

play02:04

and take a biopsy of the mole.

play02:06

You and I know that the word mole has different meanings in each one of these,

play02:10

based on the context.

play02:11

But after the first step of a transformer, the one that breaks up the text

play02:15

and associates each token with a vector, the vector that's associated with

play02:18

mole would be the same in all of these cases, because this initial token

play02:22

embedding is effectively a lookup table with no reference to the context.

play02:26

It's only in the next step of the transformer that the surrounding

play02:30

embeddings have the chance to pass information into this one.

play02:33

The picture you might have in mind is that there are multiple distinct directions in

play02:38

this embedding space encoding the multiple distinct meanings of the word mole,

play02:42

and that a well-trained attention block calculates what you need to add to the generic

play02:47

embedding to move it to one of these specific directions, as a function of the context.

play02:53

To take another example, consider the embedding of the word tower.

play02:57

This is presumably some very generic, non-specific direction in the space,

play03:01

associated with lots of other large, tall nouns.

play03:04

If this word was immediately preceded by Eiffel,

play03:06

you could imagine wanting the mechanism to update this vector so that

play03:10

it points in a direction that more specifically encodes the Eiffel tower,

play03:14

maybe correlated with vectors associated with Paris and France and things made of steel.

play03:19

If it was also preceded by the word miniature,

play03:22

then the vector should be updated even further,

play03:24

so that it no longer correlates with large, tall things.

play03:29

More generally than just refining the meaning of a word,

play03:32

the attention block allows the model to move information encoded in

play03:35

one embedding to that of another, potentially ones that are quite far away,

play03:39

and potentially with information that's much richer than just a single word.

play03:43

What we saw in the last chapter was how after all of the vectors flow through the

play03:47

network, including many different attention blocks,

play03:50

the computation you perform to produce a prediction of the next token is entirely a

play03:55

function of the last vector in the sequence.

play03:59

Imagine, for example, that the text you input is most of an entire mystery novel,

play04:03

all the way up to a point near the end, which reads, therefore the murderer was.

play04:08

If the model is going to accurately predict the next word,

play04:11

that final vector in the sequence, which began its life simply embedding the word was,

play04:16

will have to have been updated by all of the attention blocks to represent much,

play04:20

much more than any individual word, somehow encoding all of the information

play04:24

from the full context window that's relevant to predicting the next word.

play04:29

To step through the computations, though, let's take a much simpler example.

play04:32

Imagine that the input includes the phrase, a

play04:35

fluffy blue creature roamed the verdant forest.

play04:38

And for the moment, suppose that the only type of update that we care about

play04:42

is having the adjectives adjust the meanings of their corresponding nouns.

play04:47

What I'm about to describe is what we would call a single head of attention,

play04:50

and later we will see how the attention block consists of many different heads run in

play04:54

parallel.

play04:56

Again, the initial embedding for each word is some high dimensional vector

play04:59

that only encodes the meaning of that particular word with no context.

play05:04

Actually, that's not quite true.

play05:05

They also encode the position of the word.

play05:07

There's a lot more to say way that positions are encoded, but right now,

play05:11

all you need to know is that the entries of this vector are enough to

play05:15

tell you both what the word is and where it exists in the context.

play05:19

Let's go ahead and denote these embeddings with the letter e.

play05:22

The goal is to have a series of computations produce a new refined

play05:26

set of embeddings where, for example, those corresponding to the

play05:29

nouns have ingested the meaning from their corresponding adjectives.

play05:33

And playing the deep learning game, we want most of the computations

play05:37

involved to look like matrix-vector products, where the matrices are

play05:40

full of tunable weights, things that the model will learn based on data.

play05:44

To be clear, I'm making up this example of adjectives updating nouns just to

play05:48

illustrate the type of behavior that you could imagine an attention head doing.

play05:52

As with so much deep learning, the true behavior is much harder to parse because it's

play05:57

based on tweaking and tuning a huge number of parameters to minimize some cost function.

play06:01

It's just that as we step through all of different matrices filled with parameters

play06:05

that are involved in this process, I think it's really helpful to have an imagined

play06:09

example of something that it could be doing to help keep it all more concrete.

play06:14

For the first step of this process, you might imagine each noun, like creature,

play06:18

asking the question, hey, are there any adjectives sitting in front of me?

play06:22

And for the words fluffy and blue, to each be able to answer,

play06:25

yeah, I'm an adjective and I'm in that position.

play06:28

That question is somehow encoded as yet another vector,

play06:32

another list of numbers, which we call the query for this word.

play06:36

This query vector though has a much smaller dimension than the embedding vector, say 128.

play06:42

Computing this query looks like taking a certain matrix,

play06:46

which I'll label wq, and multiplying it by the embedding.

play06:50

Compressing things a bit, let's write that query vector as q,

play06:54

and then anytime you see me put a matrix next to an arrow like this one,

play06:58

it's meant to represent that multiplying this matrix by the vector at the arrow's start

play07:02

gives you the vector at the arrow's end.

play07:05

In this case, you multiply this matrix by all of the embeddings in the context,

play07:10

producing one query vector for each token.

play07:13

The entries of this matrix are parameters of the model,

play07:16

which means the true behavior is learned from data, and in practice,

play07:19

what this matrix does in a particular attention head is challenging to parse.

play07:23

But for our sake, imagining an example that we might hope that it would learn,

play07:27

we'll suppose that this query matrix maps the embeddings of nouns to

play07:31

certain directions in this smaller query space that somehow encodes

play07:34

the notion of looking for adjectives in preceding positions.

play07:38

As to what it does to other embeddings, who knows?

play07:41

Maybe it simultaneously tries to accomplish some other goal with those.

play07:44

Right now, we're laser focused on the nouns.

play07:47

At the same time, associated with this is a second matrix called the key matrix,

play07:51

which you also multiply by every one of the embeddings.

play07:55

This produces a second sequence of vectors that we call the keys.

play07:59

Conceptually, you want to think of the keys as potentially answering the queries.

play08:03

This key matrix is also full of tunable parameters, and just like the query matrix,

play08:07

it maps the embedding vectors to that same smaller dimensional space.

play08:12

You think of the keys as matching the queries whenever they closely align with each other.

play08:17

In our example, you would imagine that the key matrix maps the adjectives like fluffy

play08:21

and blue to vectors that are closely aligned with the query produced by the word creature.

play08:27

To measure how well each key matches each query,

play08:30

you compute a dot product between each possible key-query pair.

play08:34

I like to visualize a grid full of a bunch of dots,

play08:37

where the bigger dots correspond to the larger dot products,

play08:40

the places where the keys and queries align.

play08:43

For our adjective noun example, that would look a little more like this,

play08:47

where if the keys produced by fluffy and blue really do align closely with the query

play08:52

produced by creature, then the dot products in these two spots would be some large

play08:57

positive numbers.

play08:59

In the lingo, machine learning people would say that this means the

play09:02

embeddings of fluffy and blue attend to the embedding of creature.

play09:06

By contrast to the dot product between the key for some other

play09:09

word like the and the query for creature would be some small

play09:12

or negative value that reflects that are unrelated to each other.

play09:17

So we have this grid of values that can be any real number from

play09:21

negative infinity to infinity, giving us a score for how relevant

play09:25

each word is to updating the meaning of every other word.

play09:29

The way we're about to use these scores is to take a certain

play09:32

weighted sum along each column, weighted by the relevance.

play09:36

So instead of having values range from negative infinity to infinity,

play09:40

what we want is for the numbers in these columns to be between 0 and 1,

play09:44

and for each column to add up to 1, as if they were a probability distribution.

play09:49

If you're coming in from the last chapter, you know what we need to do then.

play09:52

We compute a softmax along each one of these columns to normalize the values.

play10:00

In our picture, after you apply softmax to all of the columns,

play10:03

we'll fill in the grid with these normalized values.

play10:06

At this point you're safe to think about each column as giving weights according

play10:10

to how relevant the word on the left is to the corresponding value at the top.

play10:15

We call this grid an attention pattern.

play10:18

Now if you look at the original transformer paper,

play10:20

there's a really compact way that they write this all down.

play10:23

Here the variables q and k represent the full arrays of query

play10:27

and key vectors respectively, those little vectors you get by

play10:31

multiplying the embeddings by the query and the key matrices.

play10:35

This expression up in the numerator is a really compact way to represent

play10:39

the grid of all possible dot products between pairs of keys and queries.

play10:44

A small technical detail that I didn't mention is that for numerical stability,

play10:48

it happens to be helpful to divide all of these values by the

play10:51

square root of the dimension in that key query space.

play10:54

Then this softmax that's wrapped around the full expression

play10:57

is meant to be understood to apply column by column.

play11:01

As to that v term, we'll talk about it in just a second.

play11:05

Before that, there's one other technical detail that so far I've skipped.

play11:09

During the training process, when you run this model on a given text example,

play11:13

and all of the weights are slightly adjusted and tuned to either reward or punish it

play11:17

based on how high a probability it assigns to the true next word in the passage,

play11:21

it turns out to make the whole training process a lot more efficient if you

play11:25

simultaneously have it predict every possible next token following each initial

play11:29

subsequence of tokens in this passage.

play11:31

For example, with the phrase that we've been focusing on,

play11:34

it might also be predicting what words follow creature and what words follow the.

play11:39

This is really nice, because it means what would otherwise

play11:42

be a single training example effectively acts as many.

play11:46

For the purposes of our attention pattern, it means that you never

play11:49

want to allow later words to influence earlier words,

play11:52

since otherwise they could kind of give away the answer for what comes next.

play11:56

What this means is that we want all of these spots here,

play11:59

the ones representing later tokens influencing earlier ones,

play12:02

to somehow be forced to be zero.

play12:05

The simplest thing you might think to do is to set them equal to zero,

play12:08

but if you did that the columns wouldn't add up to one anymore,

play12:11

they wouldn't be normalized.

play12:13

So instead, a common way to do this is that before applying softmax,

play12:16

you set all of those entries to be negative infinity.

play12:19

If you do that, then after applying softmax, all of those get turned into zero,

play12:23

but the columns stay normalized.

play12:26

This process is called masking.

play12:27

There are versions of attention where you don't apply it, but in our GPT example,

play12:31

even though this is more relevant during the training phase than it would be,

play12:34

say, running it as a chatbot or something like that,

play12:37

you do always apply this masking to prevent later tokens from influencing earlier ones.

play12:42

Another fact that's worth reflecting on about this attention

play12:45

pattern is how its size is equal to the square of the context size.

play12:49

So this is why context size can be a really huge bottleneck for large language models,

play12:54

and scaling it up is non-trivial.

play12:56

As you imagine, motivated by a desire for bigger and bigger context windows,

play13:00

recent years have seen some variations to the attention mechanism aimed at making

play13:04

context more scalable, but right here, you and I are staying focused on the basics.

play13:10

Okay, great, computing this pattern lets the model

play13:12

deduce which words are relevant to which other words.

play13:16

Now you need to actually update the embeddings,

play13:18

allowing words to pass information to whichever other words they're relevant to.

play13:22

For example, you want the embedding of Fluffy to somehow cause a change

play13:26

to Creature that moves it to a different part of this 12,000-dimensional

play13:30

embedding space that more specifically encodes a Fluffy creature.

play13:35

What I'm going to do here is first show you the most straightforward

play13:38

way that you could do this, though there's a slight way that

play13:40

this gets modified in the context of multi-headed attention.

play13:44

This most straightforward way would be to use a third matrix,

play13:47

what we call the value matrix, which you multiply by the embedding of that first word,

play13:51

for example Fluffy.

play13:53

The result of this is what you would call a value vector,

play13:55

and this is something that you add to the embedding of the second word,

play13:59

in this case something you add to the embedding of Creature.

play14:02

So this value vector lives in the same very high-dimensional space as the embeddings.

play14:07

When you multiply this value matrix by the embedding of a word,

play14:10

you might think of it as saying, if this word is relevant to adjusting the meaning of

play14:15

something else, what exactly should be added to the embedding of that something else

play14:19

in order to reflect this?

play14:22

Looking back in our diagram, let's set aside all of the keys and the queries,

play14:26

since after you compute the attention pattern you're done with those,

play14:29

then you're going to take this value matrix and multiply it by every

play14:32

one of those embeddings to produce a sequence of value vectors.

play14:37

You might think of these value vectors as being

play14:39

kind of associated with the corresponding keys.

play14:42

For each column in this diagram, you multiply each of the

play14:45

value vectors by the corresponding weight in that column.

play14:50

For example here, under the embedding of Creature,

play14:52

you would be adding large proportions of the value vectors for Fluffy and Blue,

play14:57

while all of the other value vectors get zeroed out, or at least nearly zeroed out.

play15:02

And then finally, the way to actually update the embedding associated with this column,

play15:06

previously encoding some context-free meaning of Creature,

play15:09

you add together all of these rescaled values in the column,

play15:13

producing a change that you want to add, that I'll label delta-e,

play15:16

and then you add that to the original embedding.

play15:19

Hopefully what results is a more refined vector encoding the more

play15:23

contextually rich meaning, like that of a fluffy blue creature.

play15:27

And of course you don't just do this to one embedding,

play15:30

you apply the same weighted sum across all of the columns in this picture,

play15:34

producing a sequence of changes, adding all of those changes to the corresponding

play15:38

embeddings, produces a full sequence of more refined embeddings popping out

play15:42

of the attention block.

play15:44

Zooming out, this whole process is what you would describe as a single head of attention.

play15:49

As I've described things so far, this process is parameterized by three distinct

play15:54

matrices, all filled with tunable parameters, the key, the query, and the value.

play15:59

I want to take a moment to continue what we started in the last chapter,

play16:02

with the scorekeeping where we count up the total number of model parameters using the

play16:07

numbers from GPT-3.

play16:09

These key and query matrices each have 12,288 columns, matching the embedding dimension,

play16:15

and 128 rows, matching the dimension of that smaller key query space.

play16:20

This gives us an additional 1.5 million or so parameters for each one.

play16:24

If you look at that value matrix by contrast, the way I've described things so

play16:30

far would suggest that it's a square matrix that has 12,288 columns and 12,288 rows,

play16:35

since both its inputs and outputs live in this very large embedding space.

play16:41

If true, that would mean about 150 million added parameters.

play16:45

And to be clear, you could do that.

play16:47

You could devote orders of magnitude more parameters

play16:49

to the value map than to the key and query.

play16:52

But in practice, it is much more efficient if instead you make

play16:55

it so that the number of parameters devoted to this value map

play16:57

is the same as the number devoted to the key and the query.

play17:01

This is especially relevant in the setting of

play17:03

running multiple attention heads in parallel.

play17:06

The way this looks is that the value map is factored as a product of two smaller matrices.

play17:11

Conceptually, I would still encourage you to think about the overall linear map,

play17:15

one with inputs and outputs, both in this larger embedding space,

play17:18

for example taking the embedding of blue to this blueness direction that you would

play17:23

add to nouns.

play17:27

It's just that it's a smaller number of rows,

play17:29

typically the same size as the key query space.

play17:33

What this means is you can think of it as mapping the

play17:35

large embedding vectors down to a much smaller space.

play17:39

This is not the conventional naming, but I'm going to call this the value down matrix.

play17:43

The second matrix maps from this smaller space back up to the embedding space,

play17:47

producing the vectors that you use to make the actual updates.

play17:51

I'm going to call this one the value up matrix, which again is not conventional.

play17:55

The way that you would see this written in most papers looks a little different.

play17:58

I'll talk about it in a minute.

play17:59

In my opinion, it tends to make things a little more conceptually confusing.

play18:03

To throw in linear algebra jargon here, what we're basically doing

play18:06

is constraining the overall value map to be a low rank transformation.

play18:11

Turning back to the parameter count, all four of these matrices have the same size,

play18:16

and adding them all up we get about 6.3 million parameters for one attention head.

play18:22

As a quick side note, to be a little more accurate,

play18:24

everything described so far is what people would call a self-attention head,

play18:27

to distinguish it from a variation that comes up in other models that's

play18:30

called cross-attention.

play18:32

This isn't relevant to our GPT example, but if you're curious,

play18:35

cross-attention involves models that process two distinct types of data,

play18:39

like text in one language and text in another language that's part of an

play18:43

ongoing generation of a translation, or maybe audio input of speech and an

play18:48

ongoing transcription.

play18:50

A cross-attention head looks almost identical.

play18:52

The only difference is that the key and query maps act on different data sets.

play18:57

In a model doing translation, for example, the keys might come from one language,

play19:02

while the queries come from another, and the attention pattern could describe

play19:06

which words from one language correspond to which words in another.

play19:10

And in this setting there would typically be no masking,

play19:12

since there's not really any notion of later tokens affecting earlier ones.

play19:17

Staying focused on self-attention though, if you understood everything so far,

play19:20

and if you were to stop here, you would come away with the essence of what attention

play19:24

really is.

play19:25

All that's really left to us is to lay out the

play19:28

sense in which you do this many many different times.

play19:32

In our central example we focused on adjectives updating nouns,

play19:35

but of course there are lots of different ways that context can influence the

play19:38

meaning of a word.

play19:40

If the words they crashed the preceded the word car,

play19:43

it has implications for the shape and structure of that car.

play19:47

And a lot of associations might be less grammatical.

play19:49

If the word wizard is anywhere in the same passage as Harry,

play19:52

it suggests that this might be referring to Harry Potter,

play19:55

whereas if instead the words Queen, Sussex, and William were in that passage,

play20:00

then perhaps the embedding of Harry should instead be updated to refer to the prince.

play20:05

For every different type of contextual updating that you might imagine,

play20:08

the parameters of these key and query matrices would be different to

play20:11

capture the different attention patterns, and the parameters of our

play20:15

value map would be different based on what should be added to the embeddings.

play20:19

And again, in practice the true behavior of these maps is much more

play20:23

difficult to interpret, where the weights are set to do whatever the

play20:26

model needs them to do to best accomplish its goal of predicting the next token.

play20:31

As I said before, everything we described is a single head of attention,

play20:35

and a full attention block inside a transformer consists of what's

play20:38

called multi-headed attention, where you run a lot of these operations in parallel,

play20:43

each with its own distinct key query and value maps.

play20:47

GPT-3 for example uses 96 attention heads inside each block.

play20:52

Considering that each one is already a bit confusing,

play20:54

it's certainly a lot to hold in your head.

play20:56

Just to spell it all out very explicitly, this means you have 96

play21:00

distinct key and query matrices producing 96 distinct attention patterns.

play21:05

Then each head has its own distinct value matrices

play21:08

used to produce 96 sequences of value vectors.

play21:12

These are all added together using the corresponding attention patterns as weights.

play21:17

What this means is that for each position in the context, each token,

play21:21

every one of these heads produces a proposed change to be added to the embedding in

play21:26

that position.

play21:27

So what you do is you sum together all of those proposed changes,

play21:31

one for each head, and you add the result to the original embedding of that position.

play21:36

This entire sum here would be one slice of what's outputted from this multi-headed

play21:41

attention block, a single one of those refined embeddings that pops out the other end

play21:47

of it.

play21:48

Again, this is a lot to think about, so don't

play21:50

worry at all if it takes some time to sink in.

play21:52

The overall idea is that by running many distinct heads in parallel,

play21:56

you're giving the model the capacity to learn many distinct ways that context

play22:00

changes meaning.

play22:03

Pulling up our running tally for parameter count with 96 heads,

play22:07

each including its own variation of these four matrices,

play22:10

each block of multi-headed attention ends up with around 600 million parameters.

play22:16

There's one added slightly annoying thing that I should really

play22:19

mention for any of you who go on to read more about transformers.

play22:22

You remember how I said that the value map is factored out into these two

play22:25

distinct matrices, which I labeled as the value down and the value up matrices.

play22:29

The way that I framed things would suggest that you see this pair of matrices

play22:34

inside each attention head, and you could absolutely implement it this way.

play22:38

That would be a valid design.

play22:40

But the way that you see this written in papers and the way

play22:42

that it's implemented in practice looks a little different.

play22:45

All of these value up matrices for each head appear stapled together in one giant matrix

play22:50

that we call the output matrix, associated with the entire multi-headed attention block.

play22:56

And when you see people refer to the value matrix for a given attention head,

play23:00

they're typically only referring to this first step,

play23:03

the one that I was labeling as the value down projection into the smaller space.

play23:08

For the curious among you, I've left an on-screen note about it.

play23:11

It's one of those details that runs the risk of distracting

play23:13

from the main conceptual points, but I do want to call it out

play23:16

just so that you know if you read about this in other sources.

play23:19

Setting aside all the technical nuances, in the preview from the last chapter we saw

play23:23

how data flowing through a transformer doesn't just flow through a single attention block.

play23:28

For one thing, it also goes through these other operations called multi-layer perceptrons.

play23:33

We'll talk more about those in the next chapter.

play23:35

And then it repeatedly goes through many many copies of both of these operations.

play23:39

What this means is that after a given word imbibes some of its context,

play23:43

there are many more chances for this more nuanced embedding

play23:47

to be influenced by its more nuanced surroundings.

play23:50

The further down the network you go, with each embedding taking in more and more

play23:54

meaning from all the other embeddings, which themselves are getting more and more

play23:59

nuanced, the hope is that there's the capacity to encode higher level and more

play24:03

abstract ideas about a given input beyond just descriptors and grammatical structure.

play24:07

Things like sentiment and tone and whether it's a poem and what underlying

play24:11

scientific truths are relevant to the piece and things like that.

play24:16

Turning back one more time to our scorekeeping, GPT-3 includes 96 distinct layers,

play24:22

so the total number of key query and value parameters is multiplied by another 96,

play24:27

which brings the total sum to just under 58 billion distinct parameters

play24:32

devoted to all of the attention heads.

play24:34

That is a lot to be sure, but it's only about a third

play24:38

of the 175 billion that are in the network in total.

play24:41

So even though attention gets all of the attention,

play24:44

the majority of parameters come from the blocks sitting in between these steps.

play24:48

In the next chapter, you and I will talk more about those

play24:51

other blocks and also a lot more about the training process.

play24:54

A big part of the story for the success of the attention mechanism is not so much any

play24:58

specific kind of behavior that it enables, but the fact that it's extremely

play25:03

parallelizable, meaning that you can run a huge number of computations in a short time

play25:07

using GPUs.

play25:09

Given that one of the big lessons about deep learning in the last decade or two has

play25:13

been that scale alone seems to give huge qualitative improvements in model performance,

play25:17

there's a huge advantage to parallelizable architectures that let you do this.

play25:22

If you want to learn more about this stuff, I've left lots of links in the description.

play25:25

In particular, anything produced by Andrej Karpathy or Chris Ola tend to be pure gold.

play25:30

In this video, I wanted to just jump into attention in its current form,

play25:33

but if you're curious about more of the history for how we got here

play25:36

and how you might reinvent this idea for yourself,

play25:38

my friend Vivek just put up a couple videos giving a lot more of that motivation.

play25:43

Also, Britt Cruz from the channel The Art of the Problem has

play25:45

a really nice video about the history of large language models.

play26:04

Thank you.

Rate This

5.0 / 5 (0 votes)

Ähnliche Tags
Artificial IntelligenceMachine LearningDeep LearningNLPTransformer ModelAttention MechanismLanguage ProcessingAI TechnologyData ScienceModel Training