Transformers - Part 7 - Decoder (2): masked self-attention

Lennart Svensson
18 Nov 202008:37

Summary

TLDRThis video script delves into the concept of masked self-attention within the decoder of a neural network, a crucial component for parallelizing calculations during training. It explains how the decoder maintains input shape and uses embedded vectors of the same length as the output sequence. The key focus is on the masked multi-head self-attention layer, which ensures the network doesn't 'cheat' by accessing future words in the sequence, thus learning to predict unseen words effectively. The script illustrates the construction of this layer, detailing the process from computing queries and keys to applying a mask that zeroes out weights for subsequent words, culminating in the computation of new embeddings that depend solely on preceding input vectors.

Takeaways

  • 🧠 Masked Self-Attention is a key component of the decoder in parallelizing calculations during training.
  • 🔄 The decoder consists of 'N' decoder blocks with the same structure but different parameters, similar to the encoder.
  • 📏 The input and output matrices in the decoder maintain the same shape as the embedded output sequence.
  • 🔢 The number of vectors in the decoder is generally different from the number processed by the encoder.
  • 🚀 The goal is to compute all next word prediction probabilities in parallel, which speeds up training.
  • 🚫 The decoder should not 'cheat' by having access to the target word when predicting the next word in the sequence.
  • 🔑 Queries, keys, and values are computed for each input token, which are essential for the self-attention mechanism.
  • 🎭 Masking is applied to unnormalized weights to ensure that later words do not influence the current word's embedding.
  • 📊 The softmax operation is used to normalize weights, but with a mask to prevent future words from influencing the current word.
  • 🔍 The masked self-attention layer ensures that the embedding for a word only depends on the preceding words.
  • 🔄 Multi-head self-attention combines multiple instances of masked self-attention to create a comprehensive output.

Q & A

  • What is the main purpose of the masked self-attention mechanism in the decoder?

    -The main purpose of the masked self-attention mechanism in the decoder is to enable parallelization of calculations during training and to ensure that the prediction of each word in the output sequence does not have access to future words in the sequence.

  • How does the decoder maintain the shape of the input?

    -The decoder maintains the shape of the input by ensuring that the matrices passed between the layers have the same shape as the embedded version of the output sequence, with the number of vectors being the same as the number of elements in the output sequence.

  • What is the significance of the number of vectors in the decoder being generally different from the encoder?

    -The significance is that it allows the decoder to process the output sequence in a way that is tailored to the specific requirements of generating translations, which may differ from the input sequence processed by the encoder.

  • Why is it important to feed the entire output sequence into the decoder at once during training?

    -Feeding the entire output sequence into the decoder at once during training allows for the computation of all next word prediction probabilities in parallel, which speeds up the training process.

  • How does the decoder prevent the network from 'cheating' during the computation of next word probabilities?

    -The decoder prevents cheating by designing the network such that when computing the probabilities of a word in the sequence, it does not have access to that word or any subsequent words.

  • What role does the start of sequence token play in the decoder's computation of word probabilities?

    -The start of sequence token serves as the initial input for the decoder, and it is used along with the output from the encoder to compute the probabilities for the first word in the output sequence.

  • Can you explain how the masked multi-head self-attention layer is constructed using an example?

    -The masked multi-head self-attention layer is constructed by first computing queries, keys, and values for each input token. Then, it masks the unnormalized weights to ensure that words in the sequence do not influence the computation of previous words. After normalization, the new embeddings are computed as a weighted average of the value vectors, ensuring that each word embedding only depends on the preceding input vectors.

  • Why is it unnecessary to compute z43 and z53 when focusing on computing y3?

    -It is unnecessary to compute z43 and z53 when focusing on y3 because the masked self-attention mechanism ensures that the embedding for the third word (y3) should only depend on the first three input words (x1, x2, and x3), and not on any subsequent words.

  • How is the masked self-attention expressed in matrix form?

    -In matrix form, the masked self-attention first computes queries, keys, values, and z values using weight matrices. It then applies a mask to set the weights for later words to zero when computing the weights for a specific word. After normalization, the new embeddings are computed by taking the product of the value matrix and the weight matrix for the output.

  • What is the final step in the computation of the masked multi-head self-attention layer?

    -The final step is to concatenate the different y matrices computed by the different heads and then multiply this tall matrix with a weight matrix (w_o) to obtain the output y, which has the same dimension as the input x.

  • How does the order of input vectors affect the masked self-attention mechanism?

    -The order of input vectors is important in masked self-attention because it determines the dependencies between words in the sequence. The mechanism ensures that each word embedding only depends on the preceding input vectors, reflecting the sequential nature of language.

Outlines

00:00

🤖 Masked Self-Attention in Decoders

This paragraph introduces the concept of masked self-attention within the decoder, a crucial component that enables parallelization of calculations during training. The decoder is composed of 'N' identically structured blocks with unique parameters, maintaining the input shape while ensuring that the matrices passed between layers match the shape of the embedded output sequence. The goal is to understand the masked multi-head self-attention layer, which is the first of two self-attention layers in the decoder. The paragraph emphasizes the need for the decoder to compute all next word prediction probabilities in parallel during training, without 'cheating' by accessing future words in the sequence. This is illustrated with an example of how the network should only consider the output from the encoder and preceding words when predicting the next word, ensuring the network learns to predict unseen words for effective translation.

05:01

🔍 Constructing the Masked Multi-Head Self-Attention Layer

The second paragraph delves into the construction of the masked multi-head self-attention layer using an example. It explains the process of computing new embeddings for each word in the translation, ensuring that later words do not influence earlier ones. The paragraph details the computation of queries, keys, values, and z-values, and the application of a mask to prevent future words from affecting the current word's embedding. The normalization of weights to sum to one and the computation of the new embeddings as a weighted average of value vectors are also discussed. The paragraph concludes with the expression of the masked self-attention layer in matrix form, highlighting the use of weight matrices and softmax operations to ensure that each word's embedding depends only on the preceding input vectors, aligning with the desired property of the decoder.

Mindmap

Keywords

💡Decoder

The decoder is a component of a neural network model, specifically in the context of this video, it is part of a transformer model used for tasks such as translation. It is designed to read the output from an encoder and produce an output sequence, typically a translation of the input text. In the video, the decoder is composed of multiple blocks with the same structure but different parameters, maintaining the shape of the input and ensuring that the matrices passed between layers have the same shape as the embedded output sequence.

💡Masked Self-Attention

Masked self-attention is a mechanism that allows the decoder to focus only on the relevant parts of the input sequence when making predictions. This is crucial for parallelizing calculations during training, as it prevents the decoder from 'cheating' by looking ahead in the sequence. The script describes how this is achieved by masking future tokens' influence on the current token's computation, ensuring that each token's embedding only depends on the preceding tokens.

💡Parallelization

Parallelization refers to the process of performing multiple calculations simultaneously, which is a technique used in training neural networks to speed up the process. In the context of the video, the goal is to compute all next word prediction probabilities in parallel by feeding the entire output sequence into the decoder at once during training. This is an important optimization that makes training more efficient.

💡Multi-Head Self-Attention

Multi-head self-attention is an extension of the self-attention mechanism where the input is split into multiple 'heads' that each compute a different representation of the input. This allows the model to jointly attend to information at different positions from different representational spaces. The script explains that the decoder includes a masked multi-head self-attention layer as its first self-attention layer, which is crucial for the model's ability to handle sequences effectively.

💡Embedding

In the context of neural networks, an embedding is a numerical representation of a word or a phrase that captures its semantic meaning. The video script mentions that the decoder maintains the shape of the input and that the embedded vectors have the same length as the output embeddings. The embeddings are the numerical inputs that the decoder uses to generate the output sequence.

💡Softmax

Softmax is a function often used in neural networks to convert a vector of values into a probability distribution. In the script, it is mentioned that the softmax function is used to obtain weights from the z values, which are then used to compute the new embeddings in the self-attention mechanism. The masked softmax is a variant used to ensure that certain weights are zeroed out, preventing future tokens from influencing the current computation.

💡Query, Key, Value

In the self-attention mechanism, queries, keys, and values are derived from the input embeddings. Queries are used to probe the sequence represented by keys, and the resulting scores are used to weight the values. The script explains that these are computed for each input token and are essential for determining how much attention each word in the sequence should receive.

💡Z Values

Z values, as mentioned in the script, are the result of taking the inner product between key and query vectors. They represent the unnormalized attention scores between different tokens in the sequence. These scores are then used to compute the attention weights through a softmax operation, with masking applied to ensure future tokens do not influence the current token's computation.

💡Normalization

Normalization in this context refers to the process of adjusting the attention weights so that each column in the weight matrix sums to 1. This is important for ensuring that the final embeddings are a proper weighted sum of the value vectors, as explained in the script. Normalization is a crucial step in the self-attention mechanism to make sure the model's predictions are well-calibrated.

💡Weighted Average

A weighted average is a mathematical operation that combines different values according to their weights. In the script, it is mentioned that the new embeddings for each word are computed as a weighted average over the value vectors, where the weights are determined by the attention mechanism. This operation ensures that the embeddings reflect the importance of each word in the sequence relative to the current word being processed.

💡Start of Sequence Token

The start of sequence token is a special token used in sequence processing tasks to indicate the beginning of a sequence. In the script, it is mentioned that when computing the probabilities for the second word in the sequence, the network should only have access to the output from the encoder and the start of sequence token, which serves as the initial context for the decoder.

Highlights

The decoder in the video is studied for its use of masked self-attention to parallelize calculations during training.

The decoder contains 'N' decoder blocks with the same structure but different parameters, similar to the encoder.

The decoder maintains the shape of the input and the matrices passed between layers have the same shape as the embedded output sequence.

The number of vectors in the decoder is generally different from the number processed by the encoder.

The goal is to learn about the masked multi-head self-attention layer, the first of two self-attention layers in the decoder.

Computing all next word prediction probabilities in parallel during training is a desired property for speeding up training.

Feeding the entire output sequence into the decoder at once is only possible during training.

The network must not have access to the target word when computing probabilities to avoid 'cheating' and ensure learning to predict unseen words.

Masked multi-head self-attention is constructed to ensure dependencies are only on preceding words, not subsequent ones.

The query, key, and value vectors are computed for each input token in the masked self-attention layer.

Unnormalized weights are masked to prevent influence from future tokens before normalization.

The operation performed corresponds to a softmax with respect to certain z values, with unnecessary computations avoided.

The new embedding for a word is computed as a weighted average over different value vectors, with future words' weights set to zero.

The complete expression for a masked self-attention layer is presented in matrix form, emphasizing the dependency only on preceding input vectors.

A mask is applied to ensure that weights for later words are zero when computing the weights for a given word.

The multi-head attention layer concatenates outputs from different heads and multiplies with a matrix to obtain an output with the same dimension as the input.

Outputs from different self-attention heads are computed using masked self-attention to prevent embeddings from depending on later input words.

Transcripts

play00:00

in this second video about the decoder

play00:02

we study

play00:03

masked self-attention which is what

play00:06

enables us

play00:07

to parallelize calculations during

play00:10

training

play00:12

here is an illustration of the decoder

play00:14

which contains

play00:15

capital n decoder blocks where all have

play00:18

the same structure

play00:19

but different parameters similarly to

play00:23

the encoder

play00:24

the decoder maintains the shape of the

play00:26

input

play00:27

however it's important to note that the

play00:30

matrices that are passed

play00:31

between the layers in the decoder all

play00:34

have the same shape as the embedded

play00:36

version of the

play00:37

output sequence that is the number of

play00:40

vectors

play00:41

is always the same as the number of

play00:43

elements in the output sequence

play00:46

and the embedded vectors all have the

play00:48

same length

play00:49

as the output embeddings feeded into

play00:52

the first self-attention layer we note

play00:56

specifically

play00:56

that the number of vectors is generally

play00:59

different

play01:00

to the number of vectors processed by

play01:02

the encoder

play01:04

the objective in this video is to learn

play01:06

about the masked

play01:08

multi-head self-attention layer which is

play01:10

the first of the two

play01:12

self-attention layers in the decoder

play01:14

however

play01:15

let us first reason about one of the

play01:18

properties that we would like the

play01:19

decoder to have

play01:21

to speed up training it would be good if

play01:24

we could compute

play01:25

all the next word prediction

play01:26

probabilities in parallel

play01:29

we will then feed the entire output

play01:31

sequence

play01:32

into the decoder at once i'm here using

play01:35

x

play01:36

to denote the desired translation since

play01:38

that's used as

play01:39

input to the decoder we note that

play01:42

this is only possible during training

play01:45

since we obviously

play01:46

don't have access to the target

play01:48

translation when we want to use the

play01:50

network

play01:51

to translate the sentence so what we

play01:54

describe here

play01:55

only speeds things up during training

play01:59

but this is of course also very

play02:01

important

play02:02

finally in order to work we need to

play02:06

design the network

play02:07

such that it cannot cheat that is when

play02:10

computing the probabilities of the next

play02:12

word in the sequence

play02:14

it obviously should not have access to

play02:16

that word

play02:18

for instance when computing the

play02:20

probabilities of x2

play02:22

the network should only have access to

play02:25

the output from the encoder

play02:26

and x1 which in this case is the start

play02:29

of sequence token

play02:31

if we also tell the network that x2 is i

play02:35

the task would be trivial and the

play02:37

network would not learn to predict

play02:40

unseen words which is the ability needed

play02:43

in order to later produce translations

play02:47

similarly when computing the

play02:49

probabilities for

play02:50

x4 the network should only have access

play02:53

to the output from the encoder

play02:55

as well as x1 x2 and x3

play02:59

and we shouldn't tell the network that

play03:01

the value of x4

play03:02

is uh as you can see to keep the

play03:05

notation simple

play03:07

i've only written the probability of x4

play03:09

here

play03:10

which means that i've omitted the

play03:12

variables that we condition on which

play03:14

in this case should be the encoder

play03:16

output and x1

play03:18

x2 and x3 we have also omitted

play03:21

the variables that we condition on in

play03:23

the other expressions on this

play03:24

slide let us now illustrate how the

play03:27

masked

play03:28

multihead self-attention layer is

play03:30

constructed

play03:31

using an example this layer receives

play03:34

one input vector for each word in the

play03:37

translation

play03:38

as illustrated in the figure if we focus

play03:41

on how we compute

play03:43

the new embedding for y3 we realized

play03:46

that it should only depend on

play03:48

x1 x2 and x3 since the embedding for the

play03:51

third word in our sequence

play03:53

will eventually be used to predict x4

play03:56

as a first step we proceed as usual and

play03:59

compute queries

play04:00

keys and values for each input token

play04:03

we can also compute the z values by

play04:05

taking the inner products between

play04:07

keys and queries since we are focusing

play04:09

on how to compute

play04:11

y3 it's actually sufficient to compute

play04:14

the query vector q3

play04:16

and the z values z13 z23

play04:19

up until z53 for instance to compute

play04:23

z13 we take the inner product between

play04:25

the key vector k1

play04:27

and the query vector q3 and we then

play04:30

divide by the square root of d

play04:32

where d is the length of these query and

play04:34

key vectors

play04:36

we would normally feed these values into

play04:38

a softmax

play04:39

to obtain the weights but here we want

play04:41

to ensure that x4

play04:42

and x5 do not influence y3

play04:45

and we therefore first mask the

play04:48

unnormalized weights

play04:49

by setting the weights for the fourth

play04:51

and the fifth token

play04:52

to zero we then simply normalize the

play04:55

weights to obtain weights that sum to

play04:57

one

play04:58

however if you look closely it's easy to

play05:01

see that the operation that we perform

play05:03

here

play05:03

actually corresponds to taking a soft

play05:06

max with respect to z13

play05:08

to z 3 3. we can therefore simply write

play05:11

this as follows

play05:12

where the first three elements are given

play05:14

by the softmax

play05:15

whereas the final two elements are zero

play05:19

as you can see we never actually use z43

play05:22

and z53 and it's therefore unnecessary

play05:25

to compute them

play05:26

finally we compute y3 by taking a

play05:29

weighted average

play05:30

over the different value vectors since

play05:32

the weights for the fourth and the fifth

play05:34

value vectors

play05:35

are zero y3 does not depend on x4 and x5

play05:40

which is what we wanted to achieve the

play05:43

complete expression

play05:44

for a masked self-attention layer is

play05:47

easy to express

play05:48

in matrix form we first compute queries

play05:51

keys values and z values using the

play05:54

weight matrices

play05:55

w q w k w v

play05:58

and the standard expressions expressed

play06:01

in terms of the unnormalized weights

play06:03

we then apply a mask that sets many of

play06:06

the weights to zero

play06:08

specifically this ensures that when

play06:10

computing the weights

play06:11

for word number i the weights for all

play06:15

later words are zero these weights

play06:18

then need to be normalized such that

play06:20

each column in the matrix

play06:22

sums to 1. the result is a matrix with

play06:25

normalized weights

play06:26

in each column we can also express this

play06:29

in terms of

play06:30

softmax operations to fit at least a few

play06:32

terms on the same slide

play06:34

i've here introduced a shorthand

play06:36

notation sm

play06:38

for softmax i'm also using a sub index

play06:41

to refer to different elements in the

play06:43

output

play06:44

from the softmax operation for instance

play06:47

the first column only has one non-zero

play06:50

element

play06:50

and for that to sum to one that element

play06:53

has to be one

play06:55

in the second column the first two

play06:57

elements are non-zero

play06:58

and given by taking a soft max of z12

play07:02

and z22 in general column i has

play07:06

i non-zero elements that we can compute

play07:08

by taking

play07:09

a soft max with respect to z1i

play07:13

to zii finally we compute the new

play07:16

embeddings by taking

play07:18

capital v times capital w this implies

play07:21

that y i is a weighted sum of the value

play07:24

vectors

play07:25

v 1 to v i we therefore conclude that

play07:28

the i

play07:29

word embedding only depends on the i

play07:32

first input vectors to this layer

play07:36

this is clearly the property that we

play07:37

desired however

play07:39

we also note that the order of the input

play07:41

vectors is

play07:42

important when we use masked

play07:44

self-attention

play07:45

and we no longer have a mapping from one

play07:48

set to another

play07:49

so far we have learned about masked

play07:52

self-attention

play07:53

and the decoder combines h of these into

play07:56

the masked

play07:57

multi-head self-attention layers the

play08:00

overall structure of the multi-head

play08:02

attention layer is the same as in the

play08:04

encoder

play08:05

and we first concatenate the different y

play08:07

matrices

play08:08

computed by the different heads before

play08:11

multiplying this tall matrix with a

play08:14

matrix

play08:14

w o to obtain an output y

play08:18

which has the same dimension as the

play08:20

input x

play08:22

the only difference is that the outputs

play08:24

y i

play08:25

from the different self-attention heads

play08:27

are computed using

play08:29

masked self-attention to ensure that the

play08:32

embeddings

play08:33

never depend on later input words

Rate This

5.0 / 5 (0 votes)

Etiquetas Relacionadas
Masked Self-AttentionNeural DecoderParallelizationTraining EfficiencySequence PredictionMachine LearningNatural LanguageTranslation ModelAttention MechanismVector Embedding
¿Necesitas un resumen en inglés?