Illustrated Guide to Transformers Neural Network: A step by step explanation

The AI Hacker
27 Apr 202015:01

Summary

TLDR该视频详细介绍了transformer模型的工作原理。它利用注意力机制,允许模型关注输入序列中每个词与其他词之间的关系。视频通过一个聊天机器人的示例,步步解读了transformer模型的encoder和decoder部分。它展示了注意力机制如何帮助模型生成更好的预测。总的来说,该视频帮助观众深入理解这种基于注意力的模型的威力。

Takeaways

  • 😀 转换器通过注意力机制实现了在NLP任务上的突破
  • 😊 转换器可以参考很长的上下文信息,不像RNN有短期记忆问题
  • 🤔 多头注意力允许模型学习输入中不同单词之间的关联
  • 😯 位置编码给输入注入了顺序信息,因为转换器没有循环结构
  • 😀 将查询,键和值的注意力机制应用于自注意力中
  • 🤔 编码器通过自注意力和前馈网络编码输入序列
  • 😊 解码器通过遮挡未来记号只关注过去
  • 🧐 前馈网络和残差连接帮助网络更好地训练
  • 😀 堆叠编码器和解码器层可提升模型预测能力
  • 🥳 转换器架构让NLP取得了前所未有的成果

Q & A

  • 变压器的核心机制是什么?

    -变压器最核心的机制是“attention机制”,这使得模型可以学习关注输入序列的,不同部分,从而做出更好的预测。

  • 多头注意力机制是什么?

    -多头注意力机制产生多个attention层或“头”,每个头会独立地学习不同的attention表示,然后把它们合并以产生更丰富的表示。

  • 为什么要使用位置编码?

    -因为变压器编码器没有RNN所具有的循环结构,所以需要通过位置编码向嵌入向量注入位置信息,让模型能够区分不同位置的词。

  • 编码器的作用是什么?

    -编码器的作用是将输入序列映射为一个连续的、带有注意力信息的表示,这可以帮助解码器在解码过程中正确地关注输入的相关部分。

  • 解码器的关键特点是什么?

    -解码器是自回归的,它接收先前输出作为当前输入,同时也接收来自编码器的输出。解码器通过Masked Attention 防止模型看到未来词。

  • 变压器为什么常超越RNN?

    -因为注意力机制的关系,变压器在理论上具有无限的工作记忆,可以更好地处理长序列。而RNN则受限于短期记忆问题。

  • 变压器主要应用于哪些领域?

    -变压器被广泛应用于机器翻译、开放域问答、语音识别等多种自然语言处理任务中,取得了显著成效。

  • BPE是什么?

    -BPE(Byte Pair Encoding)是一种词元化算法,通过迭代地将频繁共现的字符对合并,可以有效地产生词元,提高模型对词汇的覆盖率。

  • 变压器为何需要大量训练数据?

    -变压器的参数较多,需要大量标注数据进行有效训练,以防止模型过拟合。数据量不足时,变压器的性能会明显下降。

  • 变压器的主要缺点是什么?

    -变压器计算复杂,需要大量计算资源,同时也需要大量标注数据进行有效训练。这限制了其应用范围。

Outlines

00:00

🤖 变压器模型的介绍与原理

这一部分主要介绍了变压器模型在自然语言处理(NLP)领域的重要性和应用范围,包括机器语言翻译、会话式聊天机器人和搜索引擎等。变压器模型因其优于循环神经网络(RNN)、长短时记忆网络(LSTM)和门控循环单元(GRU)在序列问题上的表现而受到关注。文章还提及了几种著名的变压器模型,如BERT、GPT-2,并重点介绍了开创性的论文“Attention is All You Need”。通过解释注意力机制的工作原理和优势,展示了变压器模型如何能够在处理长序列时保持对早期输入的参考能力,从而克服了RNN等模型的短期记忆限制。

05:03

🔍 变压器编码器和多头注意力机制

第二段深入探讨了变压器模型的编码器和多头注意力机制的具体工作原理。编码器通过自注意力(Self-Attention)机制,使模型能够识别输入序列中各个词之间的相关性。多头注意力机制进一步增强了这一过程,通过将查询(Query)、键(Key)和值(Value)向量化,并执行一系列操作(如点积、缩放、Softmax函数应用等),实现对不同词之间关系的编码。这种机制使模型能够从多个角度学习词之间的关系,提高了处理和理解语言的能力。通过引入残差连接和层归一化,还提到了如何通过这些技术提高模型的训练效率和稳定性。

10:04

📝 变压器解码器及文本生成

本段介绍了变压器模型的解码器部分及其在文本生成中的应用。解码器结构与编码器相似,但增加了一个额外的多头注意力层,用于处理编码器输出的信息,以及避免在生成未来词汇时提前看到信息的掩码技术。这允许模型在生成文本时考虑之前的上下文,并根据输入序列中的关键信息生成相关响应。通过层叠解码器层,模型能够生成复杂且连贯的文本输出,展现了变压器在自然语言生成任务中的强大能力。此外,还解释了如何通过线性层和Softmax函数将解码器输出转换为最终的词汇概率分布,完成文本生成过程。

Mindmap

Keywords

💡transformer

transformers 是一种新的神经网络架构,基于注意力机制,用来处理序列数据。视频中详细解释了 transformer 的工作原理,它通过编码器和解码器的结构,使用多头注意力机制来学习输入序列中的上下文和语义信息,从而生成输出。transformer 相比循环神经网络有更强的长序列建模能力。

💡attention mechanism

注意力机制是 transformer 模型的核心。它允许模型学习输入序列中每个词与其他词之间的相关性或上下文依赖性。视频详细介绍了自注意力机制的计算过程,以及 masking 方法防止解码器参考后续词汇的技巧。注意力机制使得 transformer 在长序列任务上性能较好。

💡encoder

编码器在 transformer 架构中负责分析输入序列,通过多头自注意力层和前馈全连接层提取输入的特征表示,得到一个连续的向量表示,聚合了整个输入序列的上下文信息。

💡decoder

解码器基于编码器输出,在 transformer 中采用自动回归的方式生成输出序列。关键在于使用遮掩多头自注意力防止解码历程中参考后续词汇。解码器能够关注编码器输出的相关部分。

💡multi-headed attention

多头注意力机制为注意力机制带来了多个并行的子空间投影,每个头负责学习输入序列不同的表示子空间。多头结构增强了模型学习语义信息的多样性。编码器和解码器中都使用了多头自注意力结构。

💡positional encoding

由于 transformer 没有循环或卷积结构,需要通过位置编码的方法为模型输入的词汇嵌入添加位置信息,以表示其在序列中的位置。视频介绍了使用三角函数实现位置编码的技巧。

💡residual connection

残差连接让网络中间层的输出能够直接传到后面的层,保证信息的流动,有利于梯度反向传播和模型的训练。在 transformer 的编码器和解码器中,每个子层后面都有残差连接和层归一化。

💡sequence to sequence learning

序列到序列学习是指用神经网络完成输入序列到输出序列的映射任务,如机器翻译、文本摘要、对话系统等等。transformer 的编码器解码器结构使其成为目前最强的序列到序列学习模型之一。

💡natural language processing

自然语言处理是指使计算机分析、理解和处理人类语言的技术与应用领域。如机器翻译、语音识别、情感分析等等。transformer 强大的语言表示能力使其成为当今自然语言处理領域的重要模型。

💡abstract continuous representation

视频中提到编码器要将输入序列映射为一个抽象的、连续的向量表示,其中包含了输入序列的全部学习到的信息。这种泛化的中间表达保留了语义信息,是序列到序列模型的关键。

Highlights

变压器通过注意力机制打破了许多NLP记录,推动了当前的技术状态

注意力机制允许模型关联输入序列中的每个词语与其他词语

位置编码通过正弦和余弦函数为每个词语的嵌入向量添加位置信息

多头注意力允许每个头学习不同的表示,从而为编码器提供更多的表示能力

残差连接有助于网络训练,允许梯度直接通过网络流动

解码器是自动回归的,它逐步生成文本序列

遮掩被用来防止解码器参考未来令牌

第一个多头注意力层计算解码器输入的注意力分数

第二个多头注意力层将编码器输出与解码器输入相关联

堆叠多个编码器和解码器层可以增强模型的预测能力

解码器预测单词概率分数,选择概率最高的作为预测单词

变压器使用注意力机制产生前所未有的NLP结果

循环神经网络由于短期记忆受限,变压器往往更优

变压器适合对更长序列进行编码或解码

注意力机制使NLP行业能达到前所未有的结果

Transcripts

play00:00

transformers are taking the natural

play00:02

language processing world by storm these

play00:04

incredible models are breaking multiple

play00:06

NLP records and pushing the

play00:08

state-of-the-art they are used in many

play00:10

applications like machine language

play00:12

translation conversational chat BOTS and

play00:15

even a power better search engines

play00:18

transformers are the rage and deep

play00:20

learning nowadays but how do they work

play00:22

why are they outperformed a previous

play00:24

king of sequence problems like recurrent

play00:26

neural networks gr use and LS tiens

play00:29

you've probably heard of different

play00:31

famous transformer models like Burt CBT

play00:34

and GB t2 in this video we'll focus on

play00:37

the one paper that started it all

play00:39

attention is all you need to understand

play00:43

transformers we first must understand

play00:45

the attention mechanism to get an

play00:47

intuitive understanding of the attention

play00:49

mechanism let's start with a fun text

play00:51

generation model that's capable of

play00:53

writing its own sci-fi novel we'll need

play00:56

to prime in a model with an arbitrary

play00:57

input and a model will generate the rest

play01:00

okay

play01:01

let's make the story interesting as

play01:03

aliens entered our planet and began to

play01:07

colonize earth a certain group of

play01:10

extraterrestrials begin to manipulate

play01:12

our society through their influence of a

play01:14

certain number of the elite of the

play01:16

country to keep an iron grip over the

play01:19

populace by the way I then just make

play01:22

this up this was actually generated by

play01:24

open AI is GPT to transformer model

play01:27

shout out to hugging face for an awesome

play01:29

interface to play with I'll provide a

play01:30

link in description okay so the model is

play01:33

a little dark but what's interesting is

play01:36

how it works as a model generate tax

play01:38

word by word it has the ability to

play01:40

reference or tend to words that is

play01:43

relevant to the generated word how the

play01:45

model knows which were to attend to is

play01:47

all learned while training with

play01:49

backpropagation our intends are also

play01:51

capable of looking at previous inputs

play01:53

too but the power of the attention

play01:55

mechanism is that it doesn't suffer from

play01:57

short-term memory rnns have a shorter

play02:01

window to reference from so when a story

play02:03

gets longer rnns can't access word

play02:06

generated earlier in the sequence

play02:09

this is still true for gr use and L

play02:11

STM's although they do have a bigger

play02:13

capacity to achieve longer term memory

play02:16

therefore having a longer window to

play02:18

reference from the attention mechanism

play02:20

in theory and given enough compute

play02:22

resources have an infinite window to

play02:25

reference from therefore being capable

play02:27

of using the entire context of the story

play02:29

while generating the text this power was

play02:33

demonstrated in the paper attention is

play02:35

all you need when the author's introduce

play02:38

a new novel neural network called the

play02:40

Transformers which is an attention based

play02:43

encoder decoder type architecture on a

play02:45

high level the encoder Maps an input

play02:48

sequence into an abstract continuous

play02:51

representation that holds all the

play02:53

learned information of that input to

play02:55

decoder then takes our continuous

play02:57

representation and step by step

play02:59

generates a single output while also

play03:02

being fed to previous output let's walk

play03:05

through an example

play03:08

the attention is all you need paper

play03:11

applied to transformer model on a neuro

play03:13

machine translation problem our

play03:15

demonstration of the transformer model

play03:17

would be a conversational chat bot the

play03:20

example with taking an input tax hi how

play03:23

are you and generate the response I am

play03:25

fine

play03:26

let's break down the mechanics of the

play03:29

network step by step the first step is

play03:32

feeding our input into a word embedded

play03:34

layer a word embedding layer can be

play03:36

thought of as a lookup table to grab a

play03:38

learn factor of representation of each

play03:40

word neural networks learned through

play03:43

numbers so each word maps to a vector

play03:45

with continuous values to represent that

play03:47

word

play03:50

next step is to inject positional

play03:52

information into the embeddings because

play03:55

a transformer encoder has no recurrence

play03:57

like recurrent known networks we must

play04:00

add information about the positions into

play04:02

the input embeddings

play04:04

this is done using positional encoding

play04:07

the authors came up with a clever trick

play04:09

using sine and cosine functions we won't

play04:12

go into the mathematical details of the

play04:14

positional codings in this video but

play04:16

here are the basics for every odd time

play04:20

step create a vector using the cosine

play04:22

function for every even time step create

play04:25

a vector using the sine function then

play04:28

add those vectors to their corresponding

play04:30

embedding vector this successfully gives

play04:33

the network information on two positions

play04:35

of each vector the sine and cosine

play04:39

functions were chosen in tandem because

play04:41

they have linear properties the model

play04:43

can easily learn to attend to now we

play04:47

have the encoder layer the encoder

play04:50

layers job is to map all input sequence

play04:52

into an abstract continuous

play04:54

representation that holds the learned

play04:57

information for that entire sequence it

play04:59

contains two sub modules multi-headed

play05:02

attention followed by a fully connected

play05:04

network there are also residual

play05:07

connections around each of the two sub

play05:09

modules followed by a layer

play05:11

normalization to break this down let's

play05:14

look at the multi headed attention

play05:15

module multi-headed attention Indian

play05:20

code applies a specific attention

play05:22

mechanism called self attention self

play05:25

attention allows a model to associate

play05:27

each individual word in the input to

play05:30

other words in the input so in our

play05:32

example it's possible that our model can

play05:34

learn to associate the word you with how

play05:37

M are it's also possible that the model

play05:39

learns that word structured in this

play05:41

pattern are typically a question

play05:43

so respond appropriately to achieve self

play05:47

attention we feed the input into three

play05:49

distinct fully connected layers to

play05:51

create the query key and value vectors

play05:54

what are these vectors exactly I found a

play05:57

good explanation on stock-exchange

play05:58

stating the query key and value concept

play06:03

comes from the retrieval system for

play06:05

example when you type a query to search

play06:07

for some video on YouTube

play06:08

the search engine will map your query

play06:10

against a set of keys for example video

play06:13

title description etc associated with

play06:16

candidate videos in the database then

play06:18

present you with the best match video

play06:20

let's see how this relates to self

play06:22

attention the queries and keys undergoes

play06:27

a dot product matrix multiplication to

play06:29

produce a score matrix the score matrix

play06:32

determines how much focus should a word

play06:34

be put on other words so each word will

play06:37

have a score to correspond to other

play06:39

words in the time step the higher score

play06:42

the more the focus this is how queries

play06:44

are mapped to keys then the scores get

play06:48

scaled down by getting divided by the

play06:50

square root of the dimension of the

play06:52

queries and the keys this is to allow

play06:56

for more stable gradients as multiplying

play06:58

values can have exploding effects next

play07:01

you take the softmax the scaled score to

play07:04

get the attention weights which gives

play07:05

you probability values between 0 & 1

play07:08

by doing the softmax the higher scores

play07:10

get heightened and the lower scores are

play07:12

depressed this allows the model to be

play07:14

more confident on which words to attend

play07:16

to then you take the attention weights

play07:18

and multiply it by your value vector to

play07:21

get an output vector the higher softmax

play07:23

scores will keep the value of the words

play07:25

the model learn is more important the

play07:28

lower scores will drown out their

play07:29

irrelevant words you feed the output

play07:31

vector into a linear layer to process to

play07:35

make this a multi-headed attention

play07:37

computation you need to split the query

play07:39

key in value into adding vectors before

play07:42

applying self attention to split vectors

play07:45

that goes through the same self

play07:47

attention process individually each self

play07:50

attention process is called a head each

play07:52

head produces an output vector that gets

play07:54

concatenated into a single vector before

play07:57

go through in a final linear layer in

play07:59

theory each head would learn something

play08:02

different therefore giving the encounter

play08:03

model more representation power okay so

play08:07

that's multi-headed attention to sum it

play08:09

up multi-headed attention is a module in

play08:12

a transformer network that

play08:14

you to the attention waits for the input

play08:16

and produces an output vector with

play08:18

encoded information on how each word

play08:20

should attend to all other words in a

play08:23

sequence

play08:27

next step the multi-headed attention

play08:29

output vector is added to the original

play08:31

input this is called a residual

play08:33

connection the output of the residual

play08:36

connection goes through a layer

play08:38

normalization the normalized residual

play08:41

output gets fed into a point-wise

play08:43

feed-forward network for further

play08:45

processing the point-wise feed-forward

play08:47

network are a couple of linear layers

play08:49

with a relict evasion in between the

play08:52

output of that is again added to the

play08:54

input of the point-wise feed-forward

play08:56

network and further normalized the

play08:59

residual connections helps the network

play09:01

train by allowing gradients to flow

play09:03

through the networks directly the layer

play09:05

normalizations are used to stabilize the

play09:07

network which results in sustained

play09:09

producing the training time necessary

play09:11

and a point-wise feed-forward layer are

play09:14

used to further process the attention

play09:16

output potentially giving it a richer

play09:18

representation

play09:21

and that wraps up the encoded layer all

play09:24

these operations is for the purpose of

play09:26

encoding the input to a continuous

play09:28

representation with attention

play09:30

information this will help the decoder

play09:33

focus on the appropriate words in the

play09:35

input during the decoding process you

play09:38

can stack the encoder and times to

play09:40

further encode the information where

play09:42

each layer has the opportunity to learn

play09:44

different attention representations

play09:46

therefore potentially boosting the

play09:48

predictive power of the transformer

play09:50

network now we move on to the decoder

play09:54

the decoders job is to generate text

play09:57

sequences the decoder has similar sub

play09:59

layers as the encoder it has two

play10:01

multi-headed attention layers a

play10:03

point-wise feed-forward layer with

play10:06

residual connections and layer

play10:07

normalization after each sub layer these

play10:10

sub layers behave similarly to layers in

play10:12

the encoder but each multi-headed

play10:14

attention layer has a different job it's

play10:17

capped off with a linear layer that acts

play10:18

like a classifier and a soft Max to get

play10:20

the word probabilities the decoder is

play10:23

auto regressive it takes in the list of

play10:26

previous outputs as inputs as well as

play10:28

the encoder outputs that contains the

play10:30

attention information from the input the

play10:33

decoder stops decoding when it generates

play10:34

an end token as an output let's walk

play10:37

through the decoding steps the input

play10:42

goes through an embedding layer in a

play10:44

position on coding layer to get

play10:45

positional embeddings the positional

play10:48

embeddings gets fed into the first

play10:50

multi-headed attention layer which

play10:52

computes the attention score for the

play10:54

decoders input this multi-headed

play10:57

attention layer operates slightly

play10:58

different since the decoders

play11:01

autoregressive and generates the

play11:02

sequence word-by-word you need to

play11:05

prevent it from condition into future

play11:06

tokens for example when computing

play11:10

attention scores on the word am you

play11:12

should not have access to the word fine

play11:14

because our word is a future word that

play11:16

was generated after the word am should

play11:21

only have access to itself and the words

play11:23

before this is true for all other words

play11:25

where they can only attend to previous

play11:27

words we need a method to prevent

play11:29

computing attention scores for future

play11:31

words this method is called masking

play11:34

to prevent the decoder from looking at

play11:36

future tokens you apply a look-ahead

play11:38

mask the mask is added before

play11:41

calculating the softmax and after

play11:43

scaling the scores let's take a look at

play11:45

how this works the mask is a matrix

play11:48

that's the same size as the attention

play11:50

scores filled with values of materials

play11:52

and negative infinity x' when you add

play11:54

the mask to the scale attention scores

play11:56

you get a matrix of scores with the top

play11:58

right triangle filled with negative

play12:00

infinity x' the reason for this is once

play12:03

you take the softmax of the mask scores

play12:06

the negative infinity is get zeroed out

play12:08

leaving a zero attention score for

play12:10

future tokens as you can see the

play12:13

attention scores for M have values for

play12:15

itself and all other words before it but

play12:18

zero for the word fine this essentially

play12:21

tells the model to put no focus on those

play12:23

words this masking is the only

play12:27

difference on how the attention scores

play12:29

are calculated in the first multi-headed

play12:31

attention layer this layers still have

play12:33

multiple heads that the masks are being

play12:35

applied to before getting concatenated

play12:37

and fed through a linear layer for

play12:39

further processing the output of the

play12:42

first multi-headed attention is a mask

play12:44

output vector with information on how

play12:46

the model should attend on the decoders

play12:48

inputs

play12:51

now on to the second multi-headed

play12:53

attention layer for this layer the

play12:56

encoders output are the queries in the

play12:58

keys in the first multi-headed attention

play13:00

layer outputs are the values this

play13:02

process matches the encoders input to

play13:05

the decoders input allowing the decoder

play13:07

to decide which encoder input is

play13:09

relevant to put focus on the output of

play13:11

the second multi-headed attention goes

play13:13

through a point wise feed-forward layer

play13:15

for further processing the output of the

play13:18

final point wise feed-forward layer goes

play13:21

through a final linear layer that access

play13:23

a classifier the classifier is as

play13:26

biggest number of classes you have for

play13:28

example if you have 10,000 classes for

play13:31

10,000 words the output of that

play13:33

classifier will be of size 10,000 the

play13:37

output of the classifier again gets fed

play13:38

into a soft max layer the soft max layer

play13:42

produced probability scores between 0

play13:44

and 1 for each class we take the index

play13:47

of the highest probability score and

play13:49

that equals our predicted word the

play13:51

decoder didn't taste the output and adds

play13:53

it to the list of decoder inputs and

play13:55

continue decoding again until end token

play13:58

is predicted for our case the highest

play14:01

probability prediction is the final

play14:02

class which is assigned to the end token

play14:04

this is how the decoder generates the

play14:07

output the decoder can be stacked n

play14:10

layers high each layer taking in inputs

play14:13

from the encoder and the layers before

play14:15

it by stacking layers the model can

play14:18

learn to extract and focus on different

play14:19

combinations of attention from its

play14:21

attention heads potentially boosting its

play14:24

predictive power and that's it that's

play14:27

the mechanics of the transformers

play14:28

transformers leverage the power of the

play14:31

attention mechanism to make better

play14:32

predictions recur known networks trying

play14:35

to achieve similar things but because

play14:37

they suffer from short term memory

play14:39

transformers are usually better

play14:41

especially if you want to encode or

play14:42

generate longer sequences because of the

play14:45

transformer architecture the natural

play14:48

language processing industry can now

play14:49

achieve unprecedented results if you

play14:52

found this helpful

play14:53

hit that like and subscribe button also

play14:55

let me know in comments what you'd like

play14:56

to see next and until next time thanks

play14:59

for watching