Self-Attention
Summary
TLDRThis video script delves into the mechanics of the self-attention mechanism in neural networks, particularly within the context of the Transformer model. It explains how a single word embedding can be transformed into three separate vectors through linear transformations using matrices. These vectors—query (Q), key (K), and value (V)—play crucial roles in the attention calculation. The script walks through the process of computing attention scores, generating attention weights using softmax, and then calculating the output vector Z as a weighted sum of the value vectors. The highlight is the ability to perform these computations in parallel, contrasting with sequential models like RNNs, and introducing the concept of scaled dot product attention.
Takeaways
- 🔢 Linear transformations are used to convert one word vector into three vectors: query (Q), key (K), and value (V).
- 🧠 The Q, K, and V vectors are generated using learnable matrices (WQ, WK, WV) and linear transformations.
- 🔄 These three vectors are used in attention computations, where Q remains fixed and K varies to compute scores.
- 📊 The scoring function is a dot product between Q and K vectors to determine the attention weights.
- ⚖️ The attention weights are normalized using the softmax function to compute the importance of each word relative to the query.
- 🧮 Z, the final output representation, is computed by taking a weighted sum of the V vectors using the attention weights.
- ⚡ All Q, K, and V computations can be parallelized, ensuring that each word's attention is computed simultaneously.
- 📐 Matrix multiplications allow for the parallel computation of Q, K, and V, leading to faster and more efficient calculations compared to sequential approaches like RNNs.
- 🔀 The dot product between Q and K forms a T x T attention matrix, which is used in further computations for self-attention.
- 🧩 The final contextual representation Z is obtained by performing multiple matrix multiplications and applying softmax in a scalable way, known as scaled dot-product attention.
Q & A
What is the purpose of the linear transformation in the context of the script?
-The purpose of the linear transformation in the script is to generate three different vectors (query, key, and value) from a single input vector (embedding). This is done using learnable matrices WQ, WK, and WV, which are used to transform the input vector into the respective query, key, and value vectors.
What role do the query, key, and value vectors play in the attention mechanism?
-In the attention mechanism described in the script, the query vector is used to compute the importance of all other words with respect to a particular word. The key vectors are used to calculate the attention scores with the query vector, and the value vectors are used to compute the weighted sum that forms the output representation.
How are the attention scores computed between the query and key vectors?
-The attention scores between the query and key vectors are computed using the dot product of the query vector with each of the key vectors. This results in a score for each key vector with respect to the query vector.
What is the significance of the soft Max function in the attention computation?
-The soft Max function is used to normalize the attention scores into a probability distribution, which represents the importance of each word with respect to the query word. This allows the model to focus more on the relevant words and less on the irrelevant ones.
How is the final output vector (Z) computed in the self-attention mechanism?
-The final output vector (Z) in the self-attention mechanism is computed as a weighted sum of the value vectors. The weights are derived from the attention scores after applying the soft Max function.
Why is it beneficial to compute the attention mechanism in parallel rather than sequentially?
-Computing the attention mechanism in parallel is beneficial because it allows for faster processing of the input sequence. Unlike RNNs, which produce outputs sequentially, the self-attention mechanism can produce all output vectors simultaneously, which is more efficient and scales better with larger datasets.
What is the term used to describe the dot product attention mechanism when it is scaled by the dimension?
-When the dot product attention mechanism is scaled by the dimension, it is referred to as 'scaled dot product attention'. This scaling helps in stabilizing the gradients during training.
How does the script demonstrate that the entire attention computation can be vectorized?
-The script demonstrates that the entire attention computation can be vectorized by showing that the computation of query, key, and value vectors can be done in parallel using matrix multiplications. This includes the computation of the attention matrix, the application of the soft Max function, and the final weighted sum to get the output vectors.
What is the term 'self-attention' referring to in the context of the script?
-In the context of the script, 'self-attention' refers to the mechanism where the input sequence attends to itself. This means that each word in the sequence is used as a query to attend to all words in the sequence, including itself, to compute the contextual representation.
What are the dimensions of the matrices involved in the self-attention computation?
-The dimensions of the matrices involved in the self-attention computation are as follows: the input matrix is D x T, where D is the dimensionality of the input embeddings and T is the number of words in the sequence. The transformation matrices WQ, WK, and WV are D x D, and the output matrices for Q, K, and V are D x T.
Outlines
🧠 Understanding Self-Attention Mechanism
This paragraph introduces the concept of self-attention in neural networks, specifically focusing on the transformation of input embeddings into query, key, and value vectors through linear transformations using learnable matrices. The process involves generating three vectors for each word in the input sequence, which are then utilized in the attention mechanism. The paragraph emphasizes the parallelizability of these transformations, contrasting it with the sequential nature of RNNs, and sets the stage for explaining how attention scores are computed.
🔍 Computing Attention Weights and Contextual Representations
The second paragraph delves into the computation of attention weights using the query and key vectors. It describes how scores are calculated through dot products between the query vector and all key vectors, followed by applying a softmax function to obtain the attention weights. These weights are then used to create a weighted sum of the value vectors, resulting in a contextual representation for each word. The summary also touches on the self-attention mechanism's ability to assess the importance of all words relative to a given word, highlighting the parallel computation of these representations.
📚 Parallelizing Attention Computations
This paragraph discusses the parallelization of the attention mechanism, allowing for the simultaneous computation of query, key, and value vectors for all words in the input. It explains how matrix multiplications can be used to compute these vectors in bulk, rather than sequentially, which is a significant advantage over RNNs. The paragraph illustrates how the entire output of the self-attention layer can be generated in one go, emphasizing the efficiency and speed of this approach.
🔄 Matrix Multiplication in Self-Attention
The fourth paragraph focuses on the matrix multiplications involved in the self-attention mechanism. It describes how the attention matrix, which is a T x T matrix resulting from the multiplication of the query matrix and the key matrix's transpose, is used to weight the value matrix. This results in the computation of the output matrix Z, which contains the contextual representations for all input words. The explanation includes the concept of scaled dot product attention, where the dot products are scaled by the square root of the dimension to help stabilize the gradients.
🚀 Overview of the Self-Attention Layer
The final paragraph provides a comprehensive overview of the self-attention layer, summarizing the key steps involved in the process. It reiterates the initial linear transformations of the input embeddings, the subsequent matrix multiplications for computing the scaled dot product attention, and the final matrix multiplication to obtain the output vectors. The paragraph concludes by emphasizing the ability of the self-attention mechanism to compute contextual representations in parallel, which is a fundamental aspect of the Transformer model's efficiency.
Mindmap
Keywords
💡Embedding
💡Linear Transformation
💡Attention Mechanism
💡Query, Key, Value Vectors
💡Softmax
💡Contextual Representation
💡Parallel Computation
💡Matrix Multiplication
💡Scaled Dot Product Attention
💡Self-Attention Layer
Highlights
Introduction to the concept of word embeddings and their transformation into different vector spaces.
Explanation of generating three vectors from one word embedding using linear transformations.
The role of the three vectors in the attention equation within the self-attention mechanism.
Details on the creation of query, key, and value vectors from word embeddings.
The importance of learnable parameters in the transformation matrices WQ, WV, and WK.
Computing the contextual representation Z1 for the word 'I' through the self-attention layer.
Parallel computation of key, query, and value vectors for all words in the input.
Understanding the computation of attention scores between query and key vectors.
The use of dot product as the scoring function in the attention mechanism.
Transformation of raw attention scores into normalized attention weights using softmax.
Derivation of the final representation Z as a weighted sum of value vectors.
Parallelization of the entire self-attention computation process.
Matrix multiplication techniques to compute all query, key, and value vectors simultaneously.
Scaled dot product attention as a method for efficient computation in self-attention.
Overview of the self-attention layer's role in providing contextual representations for input vectors.
The significance of parallel computation in self-attention compared to sequential processing in RNNs.
Introduction to the concept of multi-headed attention and its role in the Transformer Network.
Transcripts
foreign
[Music]
so you had this this is what is
happening here so you had this one word
embedding okay
hi
let's just call it H1 right if you want
H1 H1
you did this linear transformation right
so let's say this was an r t dimensional
embedding then you could multiply it by
a d cross D dimensional Matrix and again
get a d dimensional vector and you could
do the same thing at all these three
places right so from now one vector you
have been able to generate three vectors
and you needed these three vectors
because three vectors were participating
in your attention equation earlier right
so you have these three vectors now
now what do you do with these three
vectors so these are
they just look at their names again so
this was all the hi's that you had all
the edges these were your matrices these
were the learnable matrices the
transformation matrices WQ for query WV
V for value and w k k for key and what
you get out is the query Vector Q the
value vector
B and the key Vector okay right so for
each word from your perspective for each
word you have now been able to create
three vectors from it right now how do
you use these vectors how do you compute
the attention is something that will
come later on right but for now what
we're doing is something simple right
for whatever reasons we are taking one
vector and Computing three vectors from
it and these three vectors are being
Computing with the help of linear
Transformations coming from a matrix
right so and these are learnable
parameters so you have already
introduced some parameters into the X
okay
so I'm just repeating what I had said on
the slide and these are called the
respective transformations
now so now let's focus on uh Computing
the output for the first uh guy right so
this is the first input right
let's see
so you have H1 H2 all the way up to H5
and now let's see how Z1 gets computed
right which is the contextual
representation for the word I right and
through the self retention layer and
what do these key query value vectors
that I showed how do they play a role in
this right so let's see what is going to
happen there yeah so just zoom into this
I'll just clear the annotations
so you had this single embedding
which was H1 right and from there you
can see there are three arrows coming
out so you would have realized that I'm
going to compute three different values
from this one vector right so let's see
what those three values are
sorry
so I'll first do a linear transformation
with WQ to get a new Vector which is I'm
going to call as the Q by Q Vector so
this is going to be the q1 similarly
I'll do a linear transformation with K
and I'll call it as K1 similarly I'll do
a linear transformation with v and I'm
going to call it as VR right now how I'm
going to use this q1 K1 V1 doesn't
matter right but at least pectorially
you get to know what is happening here I
had one vector I did three linear
transformations to get three different
vectors from that I am calling them as
q1 K1 V1 I'll continue the same for all
the uh all my words in the input all of
this can be done in parallel of course
so I'll again take the last word for
example I'll do this for all the word
but I'm just showing it for the last
word again so I'll take uh H Phi in this
case pass it through WQ and I'll get my
Q Phi similarly I'll get K Phi and
similarly I'll get V5 right so if I had
t words in the input I have computed
this 3 into T vectors using these linear
Transformations okay now what next what
am I going to do next right so now what
I want to do is
I want to compute earlier I was
Computing the score
let me just flash the equation and then
I can do it
so earlier I was Computing the score
between
St minus 1 and HJ and this was used to
give me the importance of the j8 word
and the TX time step right so again now
I'll have some score
and my indices would be I and J right
the ith word and the jth word but now
what does go here that's the question
right so let's see what goes here
the function that I'm going to use is to
compute the score between q1 and K J
right so now my this is my query Vector
so I'm interested in knowing the weights
of all the other words with respect to
the first one that's why that's my query
the query is the first word and I want
to understand the importance of all the
words with respect to that word so I'll
pass different values of k k 1 k 2 K 3 k
4 K5 through it and compute the score
right so I'm going to compute uh five
such scores right so earlier
this was the St minus 1 was fixed right
uh because that was the state of the
decoder time step T minus one that did
not change so that was in some sense my
query because for at time step T I was
interested in knowing the weights of all
the input words now with respect to my
query or by my first word or my word of
focus I want to know the weights of all
the other words so that word is going to
remain fixed I'm going to have q1 and
I'm going to compute capital t such
values right so each telling me the
importance of the first word second word
third word fourth word fifth word right
notice that I'm also Computing the
importance of the word itself right that
also I'm doing so q1 with respect to K1
K2 K3 K4 K5 right so q1 with respect to
K1 K2 K3 K4 K5 so these five scores I'm
going to compute and maybe I'll call
them as is okay uh
now
the function that I'm going to use is
just the dot product like what is the
scoring function my scoring function is
just dot product so my E's which are the
unnormalized attention weights are just
going to be the dot product between q1
and K1 Q2 and q1 and K2 q1 and K3 all
the way up to q1 and K5 right so as I
mentioned here q1 remains fixed and I
just keep changing the key that means
the word or the representation with
which I want to compute the attention
right so that keeps changing so I'll get
some uh dot products here it should be
some real values and the way I'll
compute the alphas from that is that
I'll just do a soft Max on this Vector
right so that's what this equation is
saying to compute Alpha so this is e 1 1
this is e one two all the way up to E1
Phi so to compute alpha 1 2 I'll just
take uh
okay my I should have chosen my
variables carefully so e raised to e
Point 2 divided by the summation over
all E's right that's what I'll do right
so I'm just going to take the soft Max
here
now once I have taken the soft Max that
gives me the alphas but how am I going
to compute my Z my Z is going to be a
weighted sum of the inputs right and
what I what are the vectors that I am
considering as input I am considering
the v's now right so these are known as
the value vectors right so q and K
participated in Computing Alphas once I
am going to compute Alphas my new
representation which is z is going to be
a weighted sum of the v's right so you
have all the three vectors participating
in this computation q and K participate
in Computing the alpha and then there
was a v right just as you had a v
sitting outside if you remember right
you had this V getting multiplied by
everything that was happening inside is
here this VA transpose damage into
something something and this is where St
minus 1 and H J were so they were
participating in Computing this and
finally you get a vector and then you
multiply it by this right so similarly
now the VJs here are participating uh in
the final computing station that you
have right so Z1 is going to be computed
this way now how do you compute Z2 the
same story
everything Remains the Same right so now
you would have Q2 as the query right so
your e2s would be the dot product
between all the Q2 and all the vectors
then you will compute the alphas as the
soft Max once you have computed the
alphas these are all Alpha twos you will
compute the Z as a weighted sum of your
Phi input vectors and the weights would
come from the alpha right so very easy
to understand now you had one vector you
had H1 to H5 from each of these H1 h2h5
you computed three vectors using the
carry query transformation key
transformation and value transformation
the query Vector is used to find the
importance of all the words with respect
to this word right so this is fixed and
you're trying to find the importance of
all the words so the vector contains the
dot product between your query and all
the keys that you have once you have
computed the importance now you need to
take a weighted sum of all the words so
for taking the weighted sum you look at
the value Vector right so that's how you
have these three vectors which get used
in this computation right so I think
it's uh
should be clear from the diagram and the
equation how the zis will be computed so
you'll just do this Z1 Z2 for all the
cells right now let's see if we can
vectorize all of these computations
right that means can we compute the Z1
to ZT in one group right so here I first
told you how to compute Z1 then I told
you how to compute Z2 and then similarly
Z3 up to ZT right but my whole point was
that I don't want to compute these
outputs sequentially right that's why I
didn't like rnns because the output of
the RN and if this was the RNN block
then these outputs were coming one by
one and at the end of all this I don't
want to do the same thing again right I
want all of these to come out in
parallel otherwise it doesn't help me
right I might as well have stayed with
RNs right so
now can I do this in parallel so now
let's see how I am Computing uh q1 q1
was The Matrix WQ multiplied by H1 how
would I compute Q2 it would be the
Matrix WQ multiplied by H2 and that
would give me uh Q2
how would I compute Q3 it would be the
Matrix WQ multiplied by H3 which would
give me Q3 all the way up to h capital T
so it would be WQ multiplied by h t
which would then give me Q capital T
right so now I can just write this as a
matrix operation you can just put all of
these vectors
inside a matrix and if you multiply this
matrix by this Matrix then you get this
Matrix as an output right so all the
cues you can compute at one go right you
don't need to do that sequentially all
the query vectors can be computed at one
go by just multiplying these two
matrices WQ multiplied by The Matrix
containing all these inputs as columns
and then you get the queues at one go
similarly now what is the dimension of Q
right so let's let's look at that right
so
let me see how to do that
yeah so this is say the input Dimension
was d right and for I'll just take d as
64 for the purpose of explanation so
this is a 64 cross T Matrix
right
so each of these is 64 Dimension and you
have t such entries now this would get
multiplied by
uh say uh
so this would say get multiplied by
a 64 cross
64 Matrix right so you could have just
think thought of this as D cross D and
this as D cross T so these two matrices
multiply and I get a d cross D output in
my case I have just taken d as 64 right
so this Cube would also be of the same
size as your input representation but
you could have different sizes also
right so if you had chosen this as D1
then your output would be D1 cross T
right and so you could have either the
D1 could either be bigger than D or
smaller than D based on whether you want
to generate just project to a smaller
space or a higher space but that is not
the main point here right the main point
is that you had T inputs and you have t
output right this is not changing right
this capital T Remains the Same that
means you had T input representations
and now you have got T query
representations from those t uh input
representations right what do you choose
the size of D1 and D2 is up to you in
this example I have chosen it as 64 both
D1 and D equal to 64. so if my input
this was 64 plus T my output would also
be 64 cross D thank you what is
important as this T that you had T
inputs and you got T outputs in parallel
right you've got uh you did the entire
computation in parallel similarly your
value uh your key vectors you can
compute in parallel so you have uh
w k multiplied by H1 gives you K1 w k
multiplied by H2 gives you K2 and so on
so you might as well stack them up in a
matrix and then you do these two
multiplications so you'll get K1 to KT
in parallel again the key thing here is
you had T inputs right and you got T
outputs in parallel right so your K
Matrix is also going to be something
cross capital T okay that's what is
important that you get T outputs and
lastly same for the value Matrix also
you'll get these T values in parallel
right so now you have already
parallelized the computation of the k q
and a v right so that at least I don't
need to do sequentially now can I do the
rest of it also in parallel is the
question right so let's see how I can
compute the entire output in power right
so this is how I can compute the entire
output in parallel so what was I doing
earlier right so how did I compute uh Z1
so now can I compute the entire output
in parallel that means can I compute all
the Z's in parallel and yes you can so
this is what your Z Matrix would look
like again it would have these uh
capital T outputs and all of them can be
computed in parallel by using this
equation how does that make sense so
what was what what did I wanted if you
remember I had started with this wish
list that I had these words I am
enjoying blah blah and then I am
enjoying blah blah so this was a t cross
T Matrix right so I had to compute this
T cross T attention Matrix is that what
is happening here indeed right so if you
remember Q was 64 cross T So Q transpose
would be T cross 64 and K was T cross
64. oh sorry 64 cross T so when you are
multiplying these two matrices you are
getting a t cross T Matrix which is
essentially the attention Matrix and now
once you have the attention weights you
are multiplying them by the value Matrix
right which contains the B1 V2 up to VT
right and now if you take the product of
these two matrices then you will again
get a t dimensional uh output because
this is D cross t uh and this is uh
sorry yeah so this is V transpose sorry
so this is going to be T cross D and
this is T cross T Matrix so this 2 will
multiply to give you a t cross D output
so you'll get a capital t z Each of
which is going to be D dimensional right
so you can do the entire computation in
parallel and you can make sure right you
can go and check this back that if I
were to look at let's see like we can
just do it right away
um
okay so what I would like to show is
that Z2 is indeed if you do this large
matrix multiplication Z2 indeed comes
out to be the same as what we had seen
in the figure right so let's try to uh
do that so I'll just uh start from
scratch so I had this T cross T Matrix
which was Q transpose K right so if I
look at the second row of this Matrix
right uh so I'm let's see what is
happening here right so this was Q
transpose so this was
q1 transpose this was Q2 transpose and
so on and this was
K1
K2 and so on right so if I look at the
second row of so any ijth value of this
entry of this Matrix is just going to be
Qi transpose uh k j right and in
particular if I were to look at the
second row of this Matrix it is going to
be Q
2 transpose K1
Q2 transpose K2 and so on up to Q2
transpose K T right that's what the
second row is going to look like and now
that this Matrix Q transpose K this is
what the Q transpose K Matrix looks like
is going to get multiplied by V
transpose right so then this would have
V1 here
V2 here
and so on right and if I multiply these
two what is the second row going to be
the second row is going to be a linear
combination of all the rows of this
Matrix where the weights are these right
and that's exactly what I wanted I
wanted Z2 to be q1 Q2 transpose K1 right
which was Alpha 2 1
multiplied by V1
and then Alpha to 2 multiplied by V2 and
so on and that's exactly the computation
which is happening here right so that's
why this entire thing can be produced as
a matrix matrix multiplication and all
of this can be done in parallel so just
to recap your V's can be computed in
parallel your case can be computed in
parallel your cues can be computed in
parallel once you have that you get all
the Z's in parallel at one shot right so
that's what is happening here
which are able to compute they are able
to paralyze the entire computation of
your Z and what actually is z just in
case you have forgotten so you have the
input as H1 H2 all the way up to h t and
Z was the output of this network
right and my main goal was that this
output should be computed in parallel as
computed as compared to rnns where I was
getting 1 then 2 then 3 and then four
right but now I have shown you that all
of this can be done in parallel and you
just need to execute this matrix
multiplication which is which is in turn
has many matrix multiplication V itself
comes from a matrix multiplication K
comes from a matrix multiplication Q
comes from a matrix multiplication once
you have that you do these Matrix
multiplications and you get the Z at one
go right but all of this is
parallelizable I don't need to wait uh
to compute ZT minus 1 to compute ZT
right that's the main takeaway that I
have here right and you see them
something here which is you're scaling
it by the dimension right so this D was
64 in the examples that I had done so
there is uh some uh there's some
justification for why you need to do
that I'll not go into it uh
but for all practical purposes you're
taking the dot product and then just
scaling it by some value right so this
is what the dimensions look like as I
had already explained uh so you get all
the capital T outputs and since you're
taking the dot product and then scaling
it this is known as scaled dot product
attention and this is how you should
look at the series of operations
happening here right so you had Q you
had K and B you did a matrix
multiplication to get Q transpose K when
you did a scaling to get the scaled
output then you did a soft Max on that
and then whatever you got as soft Max
which would have been a matrix right
this soft Max of this entire thing
this is again going to be a t cross T
Matrix which is going to get multiplied
by a t cross 64 Matrix and that's the
mat mile that I'm talking about here and
then finally you will get a t cross
uh 64 output right so this is the
attention that you get this is the self
attention layer block that you have this
is what the full block looks like it
starts with these linear Transformations
then the matrix multiplication soft Max
and then again matrix multiplication to
give you the capital Z at the output
which in turn contains Z1 Z2 all the way
up to set T right so we have been able
to see the self attention layer which
does a parallel computation to give you
a contextual representation for the
input vectors that you had provided
right so I'll stop here and then uh in
the next lecture we'll look at uh
multi-headed attention and then I'll
talk about a few other components of the
Transformer Network thank you
Посмотреть больше похожих видео
Intuition Behind the Attention Mechanism from Transformers using Spreadsheets
Transformers - Part 7 - Decoder (2): masked self-attention
Dot products and duality | Chapter 9, Essence of linear algebra
Introduction to Transformer Architecture
Dot Product and Force Vectors | Mechanics Statics | (Learn to solve any question)
Illustrated Guide to Transformers Neural Network: A step by step explanation
5.0 / 5 (0 votes)