Self-Attention

IIT Madras - B.S. Degree Programme
10 Aug 202321:31

Summary

TLDRThis video script delves into the mechanics of the self-attention mechanism in neural networks, particularly within the context of the Transformer model. It explains how a single word embedding can be transformed into three separate vectors through linear transformations using matrices. These vectorsโ€”query (Q), key (K), and value (V)โ€”play crucial roles in the attention calculation. The script walks through the process of computing attention scores, generating attention weights using softmax, and then calculating the output vector Z as a weighted sum of the value vectors. The highlight is the ability to perform these computations in parallel, contrasting with sequential models like RNNs, and introducing the concept of scaled dot product attention.

Takeaways

  • ๐Ÿ”ข Linear transformations are used to convert one word vector into three vectors: query (Q), key (K), and value (V).
  • ๐Ÿง  The Q, K, and V vectors are generated using learnable matrices (WQ, WK, WV) and linear transformations.
  • ๐Ÿ”„ These three vectors are used in attention computations, where Q remains fixed and K varies to compute scores.
  • ๐Ÿ“Š The scoring function is a dot product between Q and K vectors to determine the attention weights.
  • โš–๏ธ The attention weights are normalized using the softmax function to compute the importance of each word relative to the query.
  • ๐Ÿงฎ Z, the final output representation, is computed by taking a weighted sum of the V vectors using the attention weights.
  • โšก All Q, K, and V computations can be parallelized, ensuring that each word's attention is computed simultaneously.
  • ๐Ÿ“ Matrix multiplications allow for the parallel computation of Q, K, and V, leading to faster and more efficient calculations compared to sequential approaches like RNNs.
  • ๐Ÿ”€ The dot product between Q and K forms a T x T attention matrix, which is used in further computations for self-attention.
  • ๐Ÿงฉ The final contextual representation Z is obtained by performing multiple matrix multiplications and applying softmax in a scalable way, known as scaled dot-product attention.

Q & A

  • What is the purpose of the linear transformation in the context of the script?

    -The purpose of the linear transformation in the script is to generate three different vectors (query, key, and value) from a single input vector (embedding). This is done using learnable matrices WQ, WK, and WV, which are used to transform the input vector into the respective query, key, and value vectors.

  • What role do the query, key, and value vectors play in the attention mechanism?

    -In the attention mechanism described in the script, the query vector is used to compute the importance of all other words with respect to a particular word. The key vectors are used to calculate the attention scores with the query vector, and the value vectors are used to compute the weighted sum that forms the output representation.

  • How are the attention scores computed between the query and key vectors?

    -The attention scores between the query and key vectors are computed using the dot product of the query vector with each of the key vectors. This results in a score for each key vector with respect to the query vector.

  • What is the significance of the soft Max function in the attention computation?

    -The soft Max function is used to normalize the attention scores into a probability distribution, which represents the importance of each word with respect to the query word. This allows the model to focus more on the relevant words and less on the irrelevant ones.

  • How is the final output vector (Z) computed in the self-attention mechanism?

    -The final output vector (Z) in the self-attention mechanism is computed as a weighted sum of the value vectors. The weights are derived from the attention scores after applying the soft Max function.

  • Why is it beneficial to compute the attention mechanism in parallel rather than sequentially?

    -Computing the attention mechanism in parallel is beneficial because it allows for faster processing of the input sequence. Unlike RNNs, which produce outputs sequentially, the self-attention mechanism can produce all output vectors simultaneously, which is more efficient and scales better with larger datasets.

  • What is the term used to describe the dot product attention mechanism when it is scaled by the dimension?

    -When the dot product attention mechanism is scaled by the dimension, it is referred to as 'scaled dot product attention'. This scaling helps in stabilizing the gradients during training.

  • How does the script demonstrate that the entire attention computation can be vectorized?

    -The script demonstrates that the entire attention computation can be vectorized by showing that the computation of query, key, and value vectors can be done in parallel using matrix multiplications. This includes the computation of the attention matrix, the application of the soft Max function, and the final weighted sum to get the output vectors.

  • What is the term 'self-attention' referring to in the context of the script?

    -In the context of the script, 'self-attention' refers to the mechanism where the input sequence attends to itself. This means that each word in the sequence is used as a query to attend to all words in the sequence, including itself, to compute the contextual representation.

  • What are the dimensions of the matrices involved in the self-attention computation?

    -The dimensions of the matrices involved in the self-attention computation are as follows: the input matrix is D x T, where D is the dimensionality of the input embeddings and T is the number of words in the sequence. The transformation matrices WQ, WK, and WV are D x D, and the output matrices for Q, K, and V are D x T.

Outlines

00:00

๐Ÿง  Understanding Self-Attention Mechanism

This paragraph introduces the concept of self-attention in neural networks, specifically focusing on the transformation of input embeddings into query, key, and value vectors through linear transformations using learnable matrices. The process involves generating three vectors for each word in the input sequence, which are then utilized in the attention mechanism. The paragraph emphasizes the parallelizability of these transformations, contrasting it with the sequential nature of RNNs, and sets the stage for explaining how attention scores are computed.

05:01

๐Ÿ” Computing Attention Weights and Contextual Representations

The second paragraph delves into the computation of attention weights using the query and key vectors. It describes how scores are calculated through dot products between the query vector and all key vectors, followed by applying a softmax function to obtain the attention weights. These weights are then used to create a weighted sum of the value vectors, resulting in a contextual representation for each word. The summary also touches on the self-attention mechanism's ability to assess the importance of all words relative to a given word, highlighting the parallel computation of these representations.

10:03

๐Ÿ“š Parallelizing Attention Computations

This paragraph discusses the parallelization of the attention mechanism, allowing for the simultaneous computation of query, key, and value vectors for all words in the input. It explains how matrix multiplications can be used to compute these vectors in bulk, rather than sequentially, which is a significant advantage over RNNs. The paragraph illustrates how the entire output of the self-attention layer can be generated in one go, emphasizing the efficiency and speed of this approach.

15:03

๐Ÿ”„ Matrix Multiplication in Self-Attention

The fourth paragraph focuses on the matrix multiplications involved in the self-attention mechanism. It describes how the attention matrix, which is a T x T matrix resulting from the multiplication of the query matrix and the key matrix's transpose, is used to weight the value matrix. This results in the computation of the output matrix Z, which contains the contextual representations for all input words. The explanation includes the concept of scaled dot product attention, where the dot products are scaled by the square root of the dimension to help stabilize the gradients.

20:04

๐Ÿš€ Overview of the Self-Attention Layer

The final paragraph provides a comprehensive overview of the self-attention layer, summarizing the key steps involved in the process. It reiterates the initial linear transformations of the input embeddings, the subsequent matrix multiplications for computing the scaled dot product attention, and the final matrix multiplication to obtain the output vectors. The paragraph concludes by emphasizing the ability of the self-attention mechanism to compute contextual representations in parallel, which is a fundamental aspect of the Transformer model's efficiency.

Mindmap

Keywords

๐Ÿ’กEmbedding

Embedding in the context of the video refers to a numerical representation of words or phrases in a continuous vector space. It's a fundamental concept in natural language processing where words are transformed into vectors of real numbers. In the video, embeddings are used as inputs to the attention mechanism, and the script mentions 'H1' as an example of an embedding, which undergoes linear transformations to produce different vectors for the attention computation.

๐Ÿ’กLinear Transformation

A linear transformation is a function that maps vectors to vectors while preserving the operations of vector addition and scalar multiplication. In the video, linear transformations are performed using matrices to convert the initial word embeddings into different vector representations necessary for the attention mechanism. The script describes multiplying an embedding by a matrix to obtain a transformed vector, which is then used in subsequent computations.

๐Ÿ’กAttention Mechanism

The attention mechanism is a technique used in neural networks to weigh the importance of different parts of the input data. In the video, the attention mechanism is central to understanding how the model processes sequences of data, such as words in a sentence. The script explains how the attention mechanism uses query, key, and value vectors to compute the importance of each word relative to the others.

๐Ÿ’กQuery, Key, Value Vectors

In the context of the attention mechanism, query, key, and value vectors are generated from the input embeddings. The query vector represents the current word being processed, while key vectors represent all other words. Value vectors are used to compute the weighted sum that forms the output. The script describes how these vectors are computed through linear transformations with learnable matrices and are essential for the attention scores calculation.

๐Ÿ’กSoftmax

Softmax is a function often used in machine learning to convert a vector of real numbers into a probability distribution consisting of probabilities proportional to the exponentials of the input numbers. In the video, softmax is applied to the attention scores to obtain a set of attention weights, which are then used to compute the weighted sum of the value vectors. The script mentions using softmax to compute the alphas from the dot products of query and key vectors.

๐Ÿ’กContextual Representation

A contextual representation in natural language processing refers to a representation of a word that is influenced by the context in which it appears. In the video, the contextual representation is the output of the self-attention layer, which takes into account the relationships between words to produce a more meaningful representation. The script describes how 'Z1' is computed as the contextual representation for a word by considering the importance of all other words relative to it.

๐Ÿ’กParallel Computation

Parallel computation refers to the ability to process multiple computations simultaneously. The video emphasizes the advantage of the attention mechanism over sequential models like RNNs, as it allows for parallel computation of the output vectors. The script explains how the computation of query, key, and value vectors for all words can be done in parallel, which is a significant efficiency improvement over sequential processing.

๐Ÿ’กMatrix Multiplication

Matrix multiplication is a mathematical operation that takes a set of vectors and transforms them into another set of vectors. In the video, matrix multiplication is used extensively to compute the query, key, and value vectors in a parallelizable manner. The script provides examples of how matrix multiplication is used to compute these vectors for all words in the input simultaneously.

๐Ÿ’กScaled Dot Product Attention

Scaled dot product attention is a specific type of attention mechanism where the dot product of query and key vectors is scaled by the square root of the dimension of the vectors. This scaling helps prevent the gradients from becoming too small during training. The video script describes this process, emphasizing the scaling of the dot product by the dimension before applying the softmax function.

๐Ÿ’กSelf-Attention Layer

The self-attention layer, also known as intra-attention, is a mechanism where the sequence attends to itself. This means that each word in the sequence is related to all other words in the sequence. In the video, the self-attention layer is a core component of the Transformer model, which the script describes as computing a contextual representation for each word by considering its relationship with all other words in the sequence.

Highlights

Introduction to the concept of word embeddings and their transformation into different vector spaces.

Explanation of generating three vectors from one word embedding using linear transformations.

The role of the three vectors in the attention equation within the self-attention mechanism.

Details on the creation of query, key, and value vectors from word embeddings.

The importance of learnable parameters in the transformation matrices WQ, WV, and WK.

Computing the contextual representation Z1 for the word 'I' through the self-attention layer.

Parallel computation of key, query, and value vectors for all words in the input.

Understanding the computation of attention scores between query and key vectors.

The use of dot product as the scoring function in the attention mechanism.

Transformation of raw attention scores into normalized attention weights using softmax.

Derivation of the final representation Z as a weighted sum of value vectors.

Parallelization of the entire self-attention computation process.

Matrix multiplication techniques to compute all query, key, and value vectors simultaneously.

Scaled dot product attention as a method for efficient computation in self-attention.

Overview of the self-attention layer's role in providing contextual representations for input vectors.

The significance of parallel computation in self-attention compared to sequential processing in RNNs.

Introduction to the concept of multi-headed attention and its role in the Transformer Network.

Transcripts

play00:00

foreign

play00:06

[Music]

play00:18

so you had this this is what is

play00:21

happening here so you had this one word

play00:23

embedding okay

play00:25

hi

play00:26

let's just call it H1 right if you want

play00:29

H1 H1

play00:31

you did this linear transformation right

play00:33

so let's say this was an r t dimensional

play00:38

embedding then you could multiply it by

play00:40

a d cross D dimensional Matrix and again

play00:44

get a d dimensional vector and you could

play00:47

do the same thing at all these three

play00:49

places right so from now one vector you

play00:52

have been able to generate three vectors

play00:53

and you needed these three vectors

play00:55

because three vectors were participating

play00:56

in your attention equation earlier right

play00:59

so you have these three vectors now

play01:02

now what do you do with these three

play01:04

vectors so these are

play01:06

they just look at their names again so

play01:08

this was all the hi's that you had all

play01:10

the edges these were your matrices these

play01:13

were the learnable matrices the

play01:14

transformation matrices WQ for query WV

play01:17

V for value and w k k for key and what

play01:20

you get out is the query Vector Q the

play01:24

value vector

play01:26

B and the key Vector okay right so for

play01:28

each word from your perspective for each

play01:30

word you have now been able to create

play01:32

three vectors from it right now how do

play01:34

you use these vectors how do you compute

play01:35

the attention is something that will

play01:37

come later on right but for now what

play01:38

we're doing is something simple right

play01:40

for whatever reasons we are taking one

play01:41

vector and Computing three vectors from

play01:43

it and these three vectors are being

play01:45

Computing with the help of linear

play01:46

Transformations coming from a matrix

play01:49

right so and these are learnable

play01:50

parameters so you have already

play01:51

introduced some parameters into the X

play01:53

okay

play01:54

so I'm just repeating what I had said on

play01:56

the slide and these are called the

play01:58

respective transformations

play02:00

now so now let's focus on uh Computing

play02:03

the output for the first uh guy right so

play02:07

this is the first input right

play02:10

let's see

play02:11

so you have H1 H2 all the way up to H5

play02:14

and now let's see how Z1 gets computed

play02:17

right which is the contextual

play02:18

representation for the word I right and

play02:21

through the self retention layer and

play02:22

what do these key query value vectors

play02:25

that I showed how do they play a role in

play02:27

this right so let's see what is going to

play02:29

happen there yeah so just zoom into this

play02:32

I'll just clear the annotations

play02:38

so you had this single embedding

play02:41

which was H1 right and from there you

play02:45

can see there are three arrows coming

play02:46

out so you would have realized that I'm

play02:47

going to compute three different values

play02:49

from this one vector right so let's see

play02:52

what those three values are

play02:56

sorry

play02:58

so I'll first do a linear transformation

play03:00

with WQ to get a new Vector which is I'm

play03:03

going to call as the Q by Q Vector so

play03:05

this is going to be the q1 similarly

play03:08

I'll do a linear transformation with K

play03:10

and I'll call it as K1 similarly I'll do

play03:13

a linear transformation with v and I'm

play03:15

going to call it as VR right now how I'm

play03:17

going to use this q1 K1 V1 doesn't

play03:19

matter right but at least pectorially

play03:21

you get to know what is happening here I

play03:22

had one vector I did three linear

play03:24

transformations to get three different

play03:26

vectors from that I am calling them as

play03:27

q1 K1 V1 I'll continue the same for all

play03:30

the uh all my words in the input all of

play03:33

this can be done in parallel of course

play03:36

so I'll again take the last word for

play03:38

example I'll do this for all the word

play03:39

but I'm just showing it for the last

play03:40

word again so I'll take uh H Phi in this

play03:43

case pass it through WQ and I'll get my

play03:46

Q Phi similarly I'll get K Phi and

play03:49

similarly I'll get V5 right so if I had

play03:52

t words in the input I have computed

play03:54

this 3 into T vectors using these linear

play03:58

Transformations okay now what next what

play04:01

am I going to do next right so now what

play04:04

I want to do is

play04:05

I want to compute earlier I was

play04:08

Computing the score

play04:10

let me just flash the equation and then

play04:12

I can do it

play04:15

so earlier I was Computing the score

play04:16

between

play04:18

St minus 1 and HJ and this was used to

play04:21

give me the importance of the j8 word

play04:24

and the TX time step right so again now

play04:25

I'll have some score

play04:27

and my indices would be I and J right

play04:30

the ith word and the jth word but now

play04:32

what does go here that's the question

play04:34

right so let's see what goes here

play04:38

the function that I'm going to use is to

play04:41

compute the score between q1 and K J

play04:45

right so now my this is my query Vector

play04:50

so I'm interested in knowing the weights

play04:53

of all the other words with respect to

play04:55

the first one that's why that's my query

play04:58

the query is the first word and I want

play05:00

to understand the importance of all the

play05:02

words with respect to that word so I'll

play05:05

pass different values of k k 1 k 2 K 3 k

play05:08

4 K5 through it and compute the score

play05:11

right so I'm going to compute uh five

play05:13

such scores right so earlier

play05:17

this was the St minus 1 was fixed right

play05:21

uh because that was the state of the

play05:23

decoder time step T minus one that did

play05:25

not change so that was in some sense my

play05:26

query because for at time step T I was

play05:29

interested in knowing the weights of all

play05:31

the input words now with respect to my

play05:34

query or by my first word or my word of

play05:37

focus I want to know the weights of all

play05:39

the other words so that word is going to

play05:41

remain fixed I'm going to have q1 and

play05:43

I'm going to compute capital t such

play05:45

values right so each telling me the

play05:47

importance of the first word second word

play05:49

third word fourth word fifth word right

play05:51

notice that I'm also Computing the

play05:52

importance of the word itself right that

play05:54

also I'm doing so q1 with respect to K1

play05:57

K2 K3 K4 K5 right so q1 with respect to

play06:01

K1 K2 K3 K4 K5 so these five scores I'm

play06:06

going to compute and maybe I'll call

play06:07

them as is okay uh

play06:11

now

play06:16

the function that I'm going to use is

play06:17

just the dot product like what is the

play06:19

scoring function my scoring function is

play06:20

just dot product so my E's which are the

play06:23

unnormalized attention weights are just

play06:26

going to be the dot product between q1

play06:27

and K1 Q2 and q1 and K2 q1 and K3 all

play06:31

the way up to q1 and K5 right so as I

play06:33

mentioned here q1 remains fixed and I

play06:35

just keep changing the key that means

play06:38

the word or the representation with

play06:40

which I want to compute the attention

play06:42

right so that keeps changing so I'll get

play06:45

some uh dot products here it should be

play06:48

some real values and the way I'll

play06:50

compute the alphas from that is that

play06:52

I'll just do a soft Max on this Vector

play06:54

right so that's what this equation is

play06:55

saying to compute Alpha so this is e 1 1

play06:59

this is e one two all the way up to E1

play07:02

Phi so to compute alpha 1 2 I'll just

play07:05

take uh

play07:07

okay my I should have chosen my

play07:09

variables carefully so e raised to e

play07:11

Point 2 divided by the summation over

play07:14

all E's right that's what I'll do right

play07:18

so I'm just going to take the soft Max

play07:20

here

play07:21

now once I have taken the soft Max that

play07:23

gives me the alphas but how am I going

play07:25

to compute my Z my Z is going to be a

play07:29

weighted sum of the inputs right and

play07:31

what I what are the vectors that I am

play07:32

considering as input I am considering

play07:35

the v's now right so these are known as

play07:37

the value vectors right so q and K

play07:39

participated in Computing Alphas once I

play07:42

am going to compute Alphas my new

play07:44

representation which is z is going to be

play07:46

a weighted sum of the v's right so you

play07:48

have all the three vectors participating

play07:50

in this computation q and K participate

play07:52

in Computing the alpha and then there

play07:54

was a v right just as you had a v

play07:56

sitting outside if you remember right

play07:57

you had this V getting multiplied by

play08:00

everything that was happening inside is

play08:03

here this VA transpose damage into

play08:05

something something and this is where St

play08:08

minus 1 and H J were so they were

play08:10

participating in Computing this and

play08:12

finally you get a vector and then you

play08:14

multiply it by this right so similarly

play08:16

now the VJs here are participating uh in

play08:20

the final computing station that you

play08:22

have right so Z1 is going to be computed

play08:25

this way now how do you compute Z2 the

play08:28

same story

play08:29

everything Remains the Same right so now

play08:32

you would have Q2 as the query right so

play08:36

your e2s would be the dot product

play08:38

between all the Q2 and all the vectors

play08:41

then you will compute the alphas as the

play08:43

soft Max once you have computed the

play08:45

alphas these are all Alpha twos you will

play08:48

compute the Z as a weighted sum of your

play08:50

Phi input vectors and the weights would

play08:52

come from the alpha right so very easy

play08:55

to understand now you had one vector you

play08:57

had H1 to H5 from each of these H1 h2h5

play09:00

you computed three vectors using the

play09:02

carry query transformation key

play09:05

transformation and value transformation

play09:07

the query Vector is used to find the

play09:10

importance of all the words with respect

play09:12

to this word right so this is fixed and

play09:14

you're trying to find the importance of

play09:15

all the words so the vector contains the

play09:17

dot product between your query and all

play09:19

the keys that you have once you have

play09:22

computed the importance now you need to

play09:23

take a weighted sum of all the words so

play09:26

for taking the weighted sum you look at

play09:27

the value Vector right so that's how you

play09:29

have these three vectors which get used

play09:31

in this computation right so I think

play09:33

it's uh

play09:34

should be clear from the diagram and the

play09:37

equation how the zis will be computed so

play09:39

you'll just do this Z1 Z2 for all the

play09:42

cells right now let's see if we can

play09:45

vectorize all of these computations

play09:46

right that means can we compute the Z1

play09:48

to ZT in one group right so here I first

play09:51

told you how to compute Z1 then I told

play09:53

you how to compute Z2 and then similarly

play09:54

Z3 up to ZT right but my whole point was

play09:57

that I don't want to compute these

play09:58

outputs sequentially right that's why I

play10:00

didn't like rnns because the output of

play10:02

the RN and if this was the RNN block

play10:04

then these outputs were coming one by

play10:06

one and at the end of all this I don't

play10:08

want to do the same thing again right I

play10:10

want all of these to come out in

play10:11

parallel otherwise it doesn't help me

play10:13

right I might as well have stayed with

play10:15

RNs right so

play10:17

now can I do this in parallel so now

play10:19

let's see how I am Computing uh q1 q1

play10:22

was The Matrix WQ multiplied by H1 how

play10:27

would I compute Q2 it would be the

play10:29

Matrix WQ multiplied by H2 and that

play10:33

would give me uh Q2

play10:38

how would I compute Q3 it would be the

play10:40

Matrix WQ multiplied by H3 which would

play10:43

give me Q3 all the way up to h capital T

play10:47

so it would be WQ multiplied by h t

play10:50

which would then give me Q capital T

play10:54

right so now I can just write this as a

play10:57

matrix operation you can just put all of

play10:58

these vectors

play11:00

inside a matrix and if you multiply this

play11:03

matrix by this Matrix then you get this

play11:05

Matrix as an output right so all the

play11:07

cues you can compute at one go right you

play11:10

don't need to do that sequentially all

play11:12

the query vectors can be computed at one

play11:14

go by just multiplying these two

play11:16

matrices WQ multiplied by The Matrix

play11:18

containing all these inputs as columns

play11:22

and then you get the queues at one go

play11:24

similarly now what is the dimension of Q

play11:26

right so let's let's look at that right

play11:28

so

play11:29

let me see how to do that

play11:37

yeah so this is say the input Dimension

play11:41

was d right and for I'll just take d as

play11:45

64 for the purpose of explanation so

play11:47

this is a 64 cross T Matrix

play11:52

right

play11:55

so each of these is 64 Dimension and you

play11:58

have t such entries now this would get

play12:02

multiplied by

play12:04

uh say uh

play12:16

so this would say get multiplied by

play12:19

a 64 cross

play12:24

64 Matrix right so you could have just

play12:27

think thought of this as D cross D and

play12:29

this as D cross T so these two matrices

play12:32

multiply and I get a d cross D output in

play12:34

my case I have just taken d as 64 right

play12:36

so this Cube would also be of the same

play12:39

size as your input representation but

play12:41

you could have different sizes also

play12:43

right so if you had chosen this as D1

play12:44

then your output would be D1 cross T

play12:46

right and so you could have either the

play12:49

D1 could either be bigger than D or

play12:51

smaller than D based on whether you want

play12:53

to generate just project to a smaller

play12:55

space or a higher space but that is not

play12:56

the main point here right the main point

play12:58

is that you had T inputs and you have t

play13:01

output right this is not changing right

play13:03

this capital T Remains the Same that

play13:06

means you had T input representations

play13:07

and now you have got T query

play13:09

representations from those t uh input

play13:11

representations right what do you choose

play13:13

the size of D1 and D2 is up to you in

play13:15

this example I have chosen it as 64 both

play13:17

D1 and D equal to 64. so if my input

play13:20

this was 64 plus T my output would also

play13:23

be 64 cross D thank you what is

play13:25

important as this T that you had T

play13:27

inputs and you got T outputs in parallel

play13:30

right you've got uh you did the entire

play13:32

computation in parallel similarly your

play13:34

value uh your key vectors you can

play13:38

compute in parallel so you have uh

play13:42

w k multiplied by H1 gives you K1 w k

play13:45

multiplied by H2 gives you K2 and so on

play13:47

so you might as well stack them up in a

play13:49

matrix and then you do these two

play13:51

multiplications so you'll get K1 to KT

play13:52

in parallel again the key thing here is

play13:55

you had T inputs right and you got T

play13:58

outputs in parallel right so your K

play14:01

Matrix is also going to be something

play14:03

cross capital T okay that's what is

play14:05

important that you get T outputs and

play14:07

lastly same for the value Matrix also

play14:09

you'll get these T values in parallel

play14:11

right so now you have already

play14:12

parallelized the computation of the k q

play14:15

and a v right so that at least I don't

play14:17

need to do sequentially now can I do the

play14:19

rest of it also in parallel is the

play14:21

question right so let's see how I can

play14:22

compute the entire output in power right

play14:24

so this is how I can compute the entire

play14:26

output in parallel so what was I doing

play14:28

earlier right so how did I compute uh Z1

play14:32

so now can I compute the entire output

play14:35

in parallel that means can I compute all

play14:37

the Z's in parallel and yes you can so

play14:39

this is what your Z Matrix would look

play14:41

like again it would have these uh

play14:43

capital T outputs and all of them can be

play14:46

computed in parallel by using this

play14:48

equation how does that make sense so

play14:50

what was what what did I wanted if you

play14:53

remember I had started with this wish

play14:54

list that I had these words I am

play14:57

enjoying blah blah and then I am

play15:01

enjoying blah blah so this was a t cross

play15:03

T Matrix right so I had to compute this

play15:05

T cross T attention Matrix is that what

play15:09

is happening here indeed right so if you

play15:10

remember Q was 64 cross T So Q transpose

play15:14

would be T cross 64 and K was T cross

play15:18

64. oh sorry 64 cross T so when you are

play15:21

multiplying these two matrices you are

play15:23

getting a t cross T Matrix which is

play15:25

essentially the attention Matrix and now

play15:27

once you have the attention weights you

play15:29

are multiplying them by the value Matrix

play15:31

right which contains the B1 V2 up to VT

play15:35

right and now if you take the product of

play15:37

these two matrices then you will again

play15:39

get a t dimensional uh output because

play15:42

this is D cross t uh and this is uh

play15:53

sorry yeah so this is V transpose sorry

play15:56

so this is going to be T cross D and

play15:58

this is T cross T Matrix so this 2 will

play16:01

multiply to give you a t cross D output

play16:03

so you'll get a capital t z Each of

play16:07

which is going to be D dimensional right

play16:09

so you can do the entire computation in

play16:11

parallel and you can make sure right you

play16:14

can go and check this back that if I

play16:17

were to look at let's see like we can

play16:19

just do it right away

play16:21

um

play16:23

okay so what I would like to show is

play16:24

that Z2 is indeed if you do this large

play16:28

matrix multiplication Z2 indeed comes

play16:31

out to be the same as what we had seen

play16:32

in the figure right so let's try to uh

play16:34

do that so I'll just uh start from

play16:37

scratch so I had this T cross T Matrix

play16:41

which was Q transpose K right so if I

play16:44

look at the second row of this Matrix

play16:45

right uh so I'm let's see what is

play16:48

happening here right so this was Q

play16:49

transpose so this was

play16:52

q1 transpose this was Q2 transpose and

play16:57

so on and this was

play16:59

K1

play17:01

K2 and so on right so if I look at the

play17:05

second row of so any ijth value of this

play17:08

entry of this Matrix is just going to be

play17:11

Qi transpose uh k j right and in

play17:15

particular if I were to look at the

play17:16

second row of this Matrix it is going to

play17:19

be Q

play17:20

2 transpose K1

play17:23

Q2 transpose K2 and so on up to Q2

play17:28

transpose K T right that's what the

play17:31

second row is going to look like and now

play17:33

that this Matrix Q transpose K this is

play17:36

what the Q transpose K Matrix looks like

play17:38

is going to get multiplied by V

play17:39

transpose right so then this would have

play17:42

V1 here

play17:44

V2 here

play17:46

and so on right and if I multiply these

play17:50

two what is the second row going to be

play17:51

the second row is going to be a linear

play17:54

combination of all the rows of this

play17:55

Matrix where the weights are these right

play17:58

and that's exactly what I wanted I

play18:00

wanted Z2 to be q1 Q2 transpose K1 right

play18:05

which was Alpha 2 1

play18:07

multiplied by V1

play18:10

and then Alpha to 2 multiplied by V2 and

play18:13

so on and that's exactly the computation

play18:15

which is happening here right so that's

play18:16

why this entire thing can be produced as

play18:19

a matrix matrix multiplication and all

play18:21

of this can be done in parallel so just

play18:24

to recap your V's can be computed in

play18:27

parallel your case can be computed in

play18:28

parallel your cues can be computed in

play18:30

parallel once you have that you get all

play18:32

the Z's in parallel at one shot right so

play18:35

that's what is happening here

play18:38

which are able to compute they are able

play18:40

to paralyze the entire computation of

play18:41

your Z and what actually is z just in

play18:43

case you have forgotten so you have the

play18:45

input as H1 H2 all the way up to h t and

play18:48

Z was the output of this network

play18:52

right and my main goal was that this

play18:54

output should be computed in parallel as

play18:56

computed as compared to rnns where I was

play18:59

getting 1 then 2 then 3 and then four

play19:01

right but now I have shown you that all

play19:03

of this can be done in parallel and you

play19:05

just need to execute this matrix

play19:06

multiplication which is which is in turn

play19:09

has many matrix multiplication V itself

play19:11

comes from a matrix multiplication K

play19:13

comes from a matrix multiplication Q

play19:15

comes from a matrix multiplication once

play19:17

you have that you do these Matrix

play19:18

multiplications and you get the Z at one

play19:20

go right but all of this is

play19:21

parallelizable I don't need to wait uh

play19:24

to compute ZT minus 1 to compute ZT

play19:27

right that's the main takeaway that I

play19:28

have here right and you see them

play19:30

something here which is you're scaling

play19:32

it by the dimension right so this D was

play19:34

64 in the examples that I had done so

play19:37

there is uh some uh there's some

play19:40

justification for why you need to do

play19:42

that I'll not go into it uh

play19:45

but for all practical purposes you're

play19:47

taking the dot product and then just

play19:48

scaling it by some value right so this

play19:50

is what the dimensions look like as I

play19:52

had already explained uh so you get all

play19:54

the capital T outputs and since you're

play19:57

taking the dot product and then scaling

play19:58

it this is known as scaled dot product

play20:00

attention and this is how you should

play20:02

look at the series of operations

play20:04

happening here right so you had Q you

play20:06

had K and B you did a matrix

play20:08

multiplication to get Q transpose K when

play20:11

you did a scaling to get the scaled

play20:14

output then you did a soft Max on that

play20:16

and then whatever you got as soft Max

play20:19

which would have been a matrix right

play20:20

this soft Max of this entire thing

play20:25

this is again going to be a t cross T

play20:27

Matrix which is going to get multiplied

play20:29

by a t cross 64 Matrix and that's the

play20:32

mat mile that I'm talking about here and

play20:34

then finally you will get a t cross

play20:40

uh 64 output right so this is the

play20:43

attention that you get this is the self

play20:47

attention layer block that you have this

play20:49

is what the full block looks like it

play20:51

starts with these linear Transformations

play20:53

then the matrix multiplication soft Max

play20:55

and then again matrix multiplication to

play20:57

give you the capital Z at the output

play20:59

which in turn contains Z1 Z2 all the way

play21:03

up to set T right so we have been able

play21:05

to see the self attention layer which

play21:07

does a parallel computation to give you

play21:10

a contextual representation for the

play21:13

input vectors that you had provided

play21:15

right so I'll stop here and then uh in

play21:18

the next lecture we'll look at uh

play21:20

multi-headed attention and then I'll

play21:23

talk about a few other components of the

play21:25

Transformer Network thank you

Rate This
โ˜…
โ˜…
โ˜…
โ˜…
โ˜…

5.0 / 5 (0 votes)

Related Tags
Self-AttentionTransformer ModelMachine LearningParallel ComputingDeep LearningNeural NetworksAttention MechanismMatrix MultiplicationAI TechnologyData Science