Illustrated Guide to Transformers Neural Network: A step by step explanation

The AI Hacker
27 Apr 202015:01

Summary

TLDRThe video explains how transformer models utilize an attention mechanism to achieve state-of-the-art natural language processing, overcoming limitations with recurrent neural networks. It provides an intuitive understanding of the attention mechanism using a text generation example. The transformer architecture contains encoder and decoder modules with multi-headed self-attention and point-wise feedforward layers. Attention allows the model to focus on relevant words in the context. The video walks through how the modules encode input into an abstract representation and decode output step-by-step using attention weights and masking to prevent seeing future words.

Takeaways

  • 😲 Transformers are attention-based encoder-decoder neural networks that outperform RNNs for sequence tasks.
  • 👀 The attention mechanism allows the model to focus on relevant words when encoding and decoding.
  • 📝 Transformer encoder maps input to an abstract representation with positional encodings.
  • 🔁 Encoder layers have multi-headed self-attention and feedforward layers.
  • ⏭ Decoder generates output sequences auto-regressively.
  • 🤝 Decoder attends to encoder output to focus on relevant input.
  • 😎 Masking prevents decoder from conditioning on future tokens.
  • 🧠 Multiple stacked encoder and decoder layers boost representational power.
  • 🤖 Transformers achieve state-of-the-art results for tasks like translation and text generation.
  • ⬆️ The attention mechanism is key to the transformer architecture.

Q & A

  • How do transformers work compared to RNNs like GRUs and LSTMs?

    -Transformers use an attention mechanism that allows them to learn dependencies regardless of distance between words in a sequence. This gives them a potentially infinite context window compared to RNNs like GRUs and LSTMs which have a fixed size hidden state and struggle with longer term dependencies.

  • What is the encoder-decoder architecture in transformers?

    -The encoder maps an input sequence to a continuous representation with all learned information. The decoder then takes this representation to generate an output sequence step-by-step while attending to the input sequence via the encoder output.

  • What are multi-headed self attention and masking in transformers?

    -Multi-headed self attention allows the model to associate different words in the input. Masking prevents the decoder from attending to future words not yet generated.

  • How does the decoder generate the output sequence?

    -The decoder is auto-regressive, using its own previous outputs as additional inputs. It attends to the relevant encoder output and its own previous outputs to predict the next word probabilistically.

  • What is the purpose of having multiple stacked encoder and decoder layers?

    -Stacking layers allows the model to learn different combinations of attention, increasing its representational power and predictive abilities.

  • Why use residual connections and layer normalization?

    -Residual connections help gradients flow during training. Layer normalization helps stabilize training over many epochs.

  • What are the key components of the transformer architecture?

    -The key components are the encoder-decoder structure, multi-headed self attention, positional encodings, residual connections, layer normalization and masking.

  • How are transformers used in NLP applications?

    -Transformers have led to state-of-the-art results in tasks like machine translation, question answering, summarization and many others.

  • Why were transformers introduced?

    -To overcome the limitations of RNN models like GRUs/LSTMs in learning long-range dependencies, via the self-attention mechanism.

  • What was the breakthrough application of transformers?

    -The paper 'Attention is All You Need' introduced transformers for neural machine translation, massively outperforming prior sequence-to-sequence models.

Outlines

00:00

😃 Introducing Transformers

The first paragraph introduces transformers, describing them as powerful natural language processing models that are breaking records and pushing state-of-the-art in various applications like machine translation, chatbots, and search engines. It notes transformers are popular in deep learning today.

05:03

😊 Explaining Attention Mechanism

The second paragraph explains the attention mechanism, which is key to understanding transformers. It provides an intuitive example of a text generation model writing a sci-fi story, and how it can reference relevant words when generating new text using attention. The power of attention is it has a theoretically infinite window to reference.

10:04

🤓 Walkthrough of Transformer Model

The third paragraph provides a walkthrough explaining the transformer model mechanics on a conversational chatbot example. It steps through the encoder and decoder layers, explaining embedding layers, multi-headed self-attention, masking, residual connections, and more.

Mindmap

Keywords

💡Transformers

Transformers are a novel neural network architecture introduced in the paper 'Attention Is All You Need'. They rely entirely on attention mechanisms and have no recurrence, unlike previous models like RNNs and LSTMs. Transformers have revolutionized natural language processing by achieving state-of-the-art results on tasks like machine translation. The video explains how transformers work by walking through an example of using them to build a conversational chatbot.

💡Attention mechanism

The attention mechanism is the key innovation that enables transformers to achieve superior performance. It allows the model to learn what words or parts of the input sequence to 'attend' to when generating each output. For example, when outputting the word 'fine' in the response 'I am fine', the model can learn to heavily attend to the input word 'you' to determine the appropriate response. Attention allows transformers to incorporate context from the entire input sequence.

💡Encoder

The encoder is the component of a transformer that processes the input sequence. It uses stacked encoder layers, each consisting of a multi-headed self-attention module followed by a feedforward network. The encoder maps the input into a continuous vector representation containing information about all words and their relationships, as determined by the attention mechanism.

💡Decoder

The decoder is the component that uses the encoder output to generate the target sequence. Like the encoder, it stacks decoder layers containing multi-headed attention. The decoder attends to the encoder output as well as its own previous outputs. It is auto-regressive, generating the output sequence one word at a time while preventing attention to future words.

💡Multi-headed attention

Multi-headed attention is used in both the encoder and decoder. It allows the model to focus on different aspects of the sequence by splitting the input into multiple 'heads' and applying scaled dot-product attention to each. For example, one head may focus on positional relationships while another attends to semantic similarity. The multiple outputs are concatenated, allowing the model to integrate different types of relevant information.

💡Residual connections

Residual connections are used throughout the transformer, where the output of each sub-layer is added to its input and then normalized. These connections allow information and gradients to flow directly across stacks of layers, avoiding the vanishing gradient problem and enabling effective training of very deep models.

💡Positional encoding

Since transformers have no recurrence, positional encoding is used to inject information about the order of sequence elements. For example, sine and cosine functions are used to encode the position of each word. This allows the model to incorporate sequential relationships learned during training.

💡Masking

Masking prevents the decoder from attending to future sequence positions that it has not yet generated. For example, when generating the word 'am', the decoder multi-headed attention is masked to prevent attention to the subsequent word 'fine'. This forces a uni-directional generation order.

💡Backpropagation

Backpropagation is the algorithm used to train neural networks, including transformers. The loss from the output is propagated backwards through the network layers to update the model weights and minimize the loss. Multiple passes of backpropagation on large datasets enable transformers to learn complex language relationships captured through its attention mechanisms.

💡Recurrent neural networks

Recurrent neural networks like LSTMs were previously state-of-the-art for sequence tasks like language modeling. The video explains how transformers have surpassed RNNs in performance. RNNs suffer from short-term memory, while transformers use attention to incorporate long-range context. This gives transformers superior ability to process long sequences.

Highlights

Transformers are taking the NLP world by storm with their incredible performance

Transformers use an attention mechanism to achieve state-of-the-art results by associating words with other relevant words

Attention allows models to reference the entire context while RNNs have a limited short-term memory

The Transformer introduced a novel attention-based encoder-decoder architecture

The encoder maps inputs to an abstract representation, the decoder generates outputs using that representation

Multi-headed self attention allows the model to associate each word with every other word

Queries, keys and values are computed to determine which words to focus attention on

Residual connections allow gradients to flow directly through the network layers

The decoder generates text autoregressively, attending only to past generated words

Masking prevents the decoder from looking ahead at future words

The decoder matches encoder outputs to its own previous outputs

Multiple encoder and decoder layers allow learning different attention representations

The Transformer outperforms RNNs for longer sequences due to lack of short-term memory

Transformers now enable unprecedented NLP achievements across many applications

The attention mechanism is key to the Transformer's state-of-the-art performance

Transcripts

play00:00

transformers are taking the natural

play00:02

language processing world by storm these

play00:04

incredible models are breaking multiple

play00:06

NLP records and pushing the

play00:08

state-of-the-art they are used in many

play00:10

applications like machine language

play00:12

translation conversational chat BOTS and

play00:15

even a power better search engines

play00:18

transformers are the rage and deep

play00:20

learning nowadays but how do they work

play00:22

why are they outperformed a previous

play00:24

king of sequence problems like recurrent

play00:26

neural networks gr use and LS tiens

play00:29

you've probably heard of different

play00:31

famous transformer models like Burt CBT

play00:34

and GB t2 in this video we'll focus on

play00:37

the one paper that started it all

play00:39

attention is all you need to understand

play00:43

transformers we first must understand

play00:45

the attention mechanism to get an

play00:47

intuitive understanding of the attention

play00:49

mechanism let's start with a fun text

play00:51

generation model that's capable of

play00:53

writing its own sci-fi novel we'll need

play00:56

to prime in a model with an arbitrary

play00:57

input and a model will generate the rest

play01:00

okay

play01:01

let's make the story interesting as

play01:03

aliens entered our planet and began to

play01:07

colonize earth a certain group of

play01:10

extraterrestrials begin to manipulate

play01:12

our society through their influence of a

play01:14

certain number of the elite of the

play01:16

country to keep an iron grip over the

play01:19

populace by the way I then just make

play01:22

this up this was actually generated by

play01:24

open AI is GPT to transformer model

play01:27

shout out to hugging face for an awesome

play01:29

interface to play with I'll provide a

play01:30

link in description okay so the model is

play01:33

a little dark but what's interesting is

play01:36

how it works as a model generate tax

play01:38

word by word it has the ability to

play01:40

reference or tend to words that is

play01:43

relevant to the generated word how the

play01:45

model knows which were to attend to is

play01:47

all learned while training with

play01:49

backpropagation our intends are also

play01:51

capable of looking at previous inputs

play01:53

too but the power of the attention

play01:55

mechanism is that it doesn't suffer from

play01:57

short-term memory rnns have a shorter

play02:01

window to reference from so when a story

play02:03

gets longer rnns can't access word

play02:06

generated earlier in the sequence

play02:09

this is still true for gr use and L

play02:11

STM's although they do have a bigger

play02:13

capacity to achieve longer term memory

play02:16

therefore having a longer window to

play02:18

reference from the attention mechanism

play02:20

in theory and given enough compute

play02:22

resources have an infinite window to

play02:25

reference from therefore being capable

play02:27

of using the entire context of the story

play02:29

while generating the text this power was

play02:33

demonstrated in the paper attention is

play02:35

all you need when the author's introduce

play02:38

a new novel neural network called the

play02:40

Transformers which is an attention based

play02:43

encoder decoder type architecture on a

play02:45

high level the encoder Maps an input

play02:48

sequence into an abstract continuous

play02:51

representation that holds all the

play02:53

learned information of that input to

play02:55

decoder then takes our continuous

play02:57

representation and step by step

play02:59

generates a single output while also

play03:02

being fed to previous output let's walk

play03:05

through an example

play03:08

the attention is all you need paper

play03:11

applied to transformer model on a neuro

play03:13

machine translation problem our

play03:15

demonstration of the transformer model

play03:17

would be a conversational chat bot the

play03:20

example with taking an input tax hi how

play03:23

are you and generate the response I am

play03:25

fine

play03:26

let's break down the mechanics of the

play03:29

network step by step the first step is

play03:32

feeding our input into a word embedded

play03:34

layer a word embedding layer can be

play03:36

thought of as a lookup table to grab a

play03:38

learn factor of representation of each

play03:40

word neural networks learned through

play03:43

numbers so each word maps to a vector

play03:45

with continuous values to represent that

play03:47

word

play03:50

next step is to inject positional

play03:52

information into the embeddings because

play03:55

a transformer encoder has no recurrence

play03:57

like recurrent known networks we must

play04:00

add information about the positions into

play04:02

the input embeddings

play04:04

this is done using positional encoding

play04:07

the authors came up with a clever trick

play04:09

using sine and cosine functions we won't

play04:12

go into the mathematical details of the

play04:14

positional codings in this video but

play04:16

here are the basics for every odd time

play04:20

step create a vector using the cosine

play04:22

function for every even time step create

play04:25

a vector using the sine function then

play04:28

add those vectors to their corresponding

play04:30

embedding vector this successfully gives

play04:33

the network information on two positions

play04:35

of each vector the sine and cosine

play04:39

functions were chosen in tandem because

play04:41

they have linear properties the model

play04:43

can easily learn to attend to now we

play04:47

have the encoder layer the encoder

play04:50

layers job is to map all input sequence

play04:52

into an abstract continuous

play04:54

representation that holds the learned

play04:57

information for that entire sequence it

play04:59

contains two sub modules multi-headed

play05:02

attention followed by a fully connected

play05:04

network there are also residual

play05:07

connections around each of the two sub

play05:09

modules followed by a layer

play05:11

normalization to break this down let's

play05:14

look at the multi headed attention

play05:15

module multi-headed attention Indian

play05:20

code applies a specific attention

play05:22

mechanism called self attention self

play05:25

attention allows a model to associate

play05:27

each individual word in the input to

play05:30

other words in the input so in our

play05:32

example it's possible that our model can

play05:34

learn to associate the word you with how

play05:37

M are it's also possible that the model

play05:39

learns that word structured in this

play05:41

pattern are typically a question

play05:43

so respond appropriately to achieve self

play05:47

attention we feed the input into three

play05:49

distinct fully connected layers to

play05:51

create the query key and value vectors

play05:54

what are these vectors exactly I found a

play05:57

good explanation on stock-exchange

play05:58

stating the query key and value concept

play06:03

comes from the retrieval system for

play06:05

example when you type a query to search

play06:07

for some video on YouTube

play06:08

the search engine will map your query

play06:10

against a set of keys for example video

play06:13

title description etc associated with

play06:16

candidate videos in the database then

play06:18

present you with the best match video

play06:20

let's see how this relates to self

play06:22

attention the queries and keys undergoes

play06:27

a dot product matrix multiplication to

play06:29

produce a score matrix the score matrix

play06:32

determines how much focus should a word

play06:34

be put on other words so each word will

play06:37

have a score to correspond to other

play06:39

words in the time step the higher score

play06:42

the more the focus this is how queries

play06:44

are mapped to keys then the scores get

play06:48

scaled down by getting divided by the

play06:50

square root of the dimension of the

play06:52

queries and the keys this is to allow

play06:56

for more stable gradients as multiplying

play06:58

values can have exploding effects next

play07:01

you take the softmax the scaled score to

play07:04

get the attention weights which gives

play07:05

you probability values between 0 & 1

play07:08

by doing the softmax the higher scores

play07:10

get heightened and the lower scores are

play07:12

depressed this allows the model to be

play07:14

more confident on which words to attend

play07:16

to then you take the attention weights

play07:18

and multiply it by your value vector to

play07:21

get an output vector the higher softmax

play07:23

scores will keep the value of the words

play07:25

the model learn is more important the

play07:28

lower scores will drown out their

play07:29

irrelevant words you feed the output

play07:31

vector into a linear layer to process to

play07:35

make this a multi-headed attention

play07:37

computation you need to split the query

play07:39

key in value into adding vectors before

play07:42

applying self attention to split vectors

play07:45

that goes through the same self

play07:47

attention process individually each self

play07:50

attention process is called a head each

play07:52

head produces an output vector that gets

play07:54

concatenated into a single vector before

play07:57

go through in a final linear layer in

play07:59

theory each head would learn something

play08:02

different therefore giving the encounter

play08:03

model more representation power okay so

play08:07

that's multi-headed attention to sum it

play08:09

up multi-headed attention is a module in

play08:12

a transformer network that

play08:14

you to the attention waits for the input

play08:16

and produces an output vector with

play08:18

encoded information on how each word

play08:20

should attend to all other words in a

play08:23

sequence

play08:27

next step the multi-headed attention

play08:29

output vector is added to the original

play08:31

input this is called a residual

play08:33

connection the output of the residual

play08:36

connection goes through a layer

play08:38

normalization the normalized residual

play08:41

output gets fed into a point-wise

play08:43

feed-forward network for further

play08:45

processing the point-wise feed-forward

play08:47

network are a couple of linear layers

play08:49

with a relict evasion in between the

play08:52

output of that is again added to the

play08:54

input of the point-wise feed-forward

play08:56

network and further normalized the

play08:59

residual connections helps the network

play09:01

train by allowing gradients to flow

play09:03

through the networks directly the layer

play09:05

normalizations are used to stabilize the

play09:07

network which results in sustained

play09:09

producing the training time necessary

play09:11

and a point-wise feed-forward layer are

play09:14

used to further process the attention

play09:16

output potentially giving it a richer

play09:18

representation

play09:21

and that wraps up the encoded layer all

play09:24

these operations is for the purpose of

play09:26

encoding the input to a continuous

play09:28

representation with attention

play09:30

information this will help the decoder

play09:33

focus on the appropriate words in the

play09:35

input during the decoding process you

play09:38

can stack the encoder and times to

play09:40

further encode the information where

play09:42

each layer has the opportunity to learn

play09:44

different attention representations

play09:46

therefore potentially boosting the

play09:48

predictive power of the transformer

play09:50

network now we move on to the decoder

play09:54

the decoders job is to generate text

play09:57

sequences the decoder has similar sub

play09:59

layers as the encoder it has two

play10:01

multi-headed attention layers a

play10:03

point-wise feed-forward layer with

play10:06

residual connections and layer

play10:07

normalization after each sub layer these

play10:10

sub layers behave similarly to layers in

play10:12

the encoder but each multi-headed

play10:14

attention layer has a different job it's

play10:17

capped off with a linear layer that acts

play10:18

like a classifier and a soft Max to get

play10:20

the word probabilities the decoder is

play10:23

auto regressive it takes in the list of

play10:26

previous outputs as inputs as well as

play10:28

the encoder outputs that contains the

play10:30

attention information from the input the

play10:33

decoder stops decoding when it generates

play10:34

an end token as an output let's walk

play10:37

through the decoding steps the input

play10:42

goes through an embedding layer in a

play10:44

position on coding layer to get

play10:45

positional embeddings the positional

play10:48

embeddings gets fed into the first

play10:50

multi-headed attention layer which

play10:52

computes the attention score for the

play10:54

decoders input this multi-headed

play10:57

attention layer operates slightly

play10:58

different since the decoders

play11:01

autoregressive and generates the

play11:02

sequence word-by-word you need to

play11:05

prevent it from condition into future

play11:06

tokens for example when computing

play11:10

attention scores on the word am you

play11:12

should not have access to the word fine

play11:14

because our word is a future word that

play11:16

was generated after the word am should

play11:21

only have access to itself and the words

play11:23

before this is true for all other words

play11:25

where they can only attend to previous

play11:27

words we need a method to prevent

play11:29

computing attention scores for future

play11:31

words this method is called masking

play11:34

to prevent the decoder from looking at

play11:36

future tokens you apply a look-ahead

play11:38

mask the mask is added before

play11:41

calculating the softmax and after

play11:43

scaling the scores let's take a look at

play11:45

how this works the mask is a matrix

play11:48

that's the same size as the attention

play11:50

scores filled with values of materials

play11:52

and negative infinity x' when you add

play11:54

the mask to the scale attention scores

play11:56

you get a matrix of scores with the top

play11:58

right triangle filled with negative

play12:00

infinity x' the reason for this is once

play12:03

you take the softmax of the mask scores

play12:06

the negative infinity is get zeroed out

play12:08

leaving a zero attention score for

play12:10

future tokens as you can see the

play12:13

attention scores for M have values for

play12:15

itself and all other words before it but

play12:18

zero for the word fine this essentially

play12:21

tells the model to put no focus on those

play12:23

words this masking is the only

play12:27

difference on how the attention scores

play12:29

are calculated in the first multi-headed

play12:31

attention layer this layers still have

play12:33

multiple heads that the masks are being

play12:35

applied to before getting concatenated

play12:37

and fed through a linear layer for

play12:39

further processing the output of the

play12:42

first multi-headed attention is a mask

play12:44

output vector with information on how

play12:46

the model should attend on the decoders

play12:48

inputs

play12:51

now on to the second multi-headed

play12:53

attention layer for this layer the

play12:56

encoders output are the queries in the

play12:58

keys in the first multi-headed attention

play13:00

layer outputs are the values this

play13:02

process matches the encoders input to

play13:05

the decoders input allowing the decoder

play13:07

to decide which encoder input is

play13:09

relevant to put focus on the output of

play13:11

the second multi-headed attention goes

play13:13

through a point wise feed-forward layer

play13:15

for further processing the output of the

play13:18

final point wise feed-forward layer goes

play13:21

through a final linear layer that access

play13:23

a classifier the classifier is as

play13:26

biggest number of classes you have for

play13:28

example if you have 10,000 classes for

play13:31

10,000 words the output of that

play13:33

classifier will be of size 10,000 the

play13:37

output of the classifier again gets fed

play13:38

into a soft max layer the soft max layer

play13:42

produced probability scores between 0

play13:44

and 1 for each class we take the index

play13:47

of the highest probability score and

play13:49

that equals our predicted word the

play13:51

decoder didn't taste the output and adds

play13:53

it to the list of decoder inputs and

play13:55

continue decoding again until end token

play13:58

is predicted for our case the highest

play14:01

probability prediction is the final

play14:02

class which is assigned to the end token

play14:04

this is how the decoder generates the

play14:07

output the decoder can be stacked n

play14:10

layers high each layer taking in inputs

play14:13

from the encoder and the layers before

play14:15

it by stacking layers the model can

play14:18

learn to extract and focus on different

play14:19

combinations of attention from its

play14:21

attention heads potentially boosting its

play14:24

predictive power and that's it that's

play14:27

the mechanics of the transformers

play14:28

transformers leverage the power of the

play14:31

attention mechanism to make better

play14:32

predictions recur known networks trying

play14:35

to achieve similar things but because

play14:37

they suffer from short term memory

play14:39

transformers are usually better

play14:41

especially if you want to encode or

play14:42

generate longer sequences because of the

play14:45

transformer architecture the natural

play14:48

language processing industry can now

play14:49

achieve unprecedented results if you

play14:52

found this helpful

play14:53

hit that like and subscribe button also

play14:55

let me know in comments what you'd like

play14:56

to see next and until next time thanks

play14:59

for watching