CS480/680 Lecture 19: Attention and Transformer Networks

Pascal Poupart
16 Jul 201982:38

Summary

TLDR本文介绍了注意力(Attention)机制及其在机器翻译和自然语言处理中的应用,特别是Transformer网络的兴起。自2017年提出以来,Transformer网络通过注意力机制替代了传统的循环神经网络,解决了长期依赖问题,提高了训练速度。文章还讨论了多头注意力、位置编码和层归一化等关键技术,并比较了不同模型在机器翻译任务上的表现。此外,还探讨了GPT、BERT和XLNet等基于Transformer的衍生模型,展示了它们在多项自然语言处理任务中的优异性能和对循环神经网络未来的挑战。

Takeaways

  • 📘 注意力机制(Attention)在机器翻译领域已有所讨论,并引领了一类新型神经网络——变换器网络(Transformer Networks)。
  • 📙 '注意力就是全部'(Attention is all you need)是2017年发表的论文,它提出了一种新的观点,即可能不再需要循环神经网络(RNN)。
  • 🔍 注意力机制最初在计算机视觉中被研究,用于帮助识别图像中的特定对象,通过模拟人类视觉注意力的聚焦过程。
  • 🏢 在自然语言处理(NLP)中,注意力机制允许解码器回顾输入句子,以处理任意长度的句子,而不必记住整个句子。
  • 🚀 变换器网络通过使用注意力机制,能够同时处理整个序列,解决了循环神经网络在长距离依赖和梯度消失或爆炸问题上的挑战。
  • 🔄 变换器网络的结构包括编码器和解码器两部分,其中编码器用于处理输入序列,解码器用于生成输出序列。
  • 🌟 多头部注意力(Multi-Head Attention)是变换器网络的核心,它允许网络同时关注序列中的不同部分。
  • 📊 变换器网络通过层归一化(Layer Normalization)减少了训练所需的梯度下降步骤,提高了训练速度。
  • 📈 变换器网络通过位置编码(Positional Encoding)保留了词序信息,这对于理解句子的意义至关重要。
  • 📉 相比于循环神经网络,变换器网络在计算复杂度上有所提升,但通过并行计算大大减少了训练时间。
  • 🔑 变换器网络及其变种(如GPT、BERT、XLNet)在多种NLP任务上展现出优越性能,对循环神经网络的未来提出了挑战。

Q & A

  • 什么是注意力机制(Attention)?

    -注意力机制是一种资源分配策略,用于在处理信息时集中关注最重要的部分。在神经网络中,它允许模型在序列数据中选择性地关注某些部分,而忽略其他部分。这在机器翻译和图像识别等领域尤为重要。

  • Transformer网络是如何出现的?

    -Transformer网络是2017年提出的一种新型神经网络架构,它基于注意力机制,不依赖于循环神经网络(RNN)结构。这种网络可以并行处理整个序列,从而在训练和推理时大大提高了效率。

  • 为什么Transformer网络可以不需要循环神经网络的构建块?

    -Transformer网络的核心是注意力机制,它能够捕捉序列数据中的长距离依赖关系,并且可以并行处理整个序列。这使得传统的循环神经网络中的循环结构变得不再必要。

  • 在计算机视觉中,注意力机制如何帮助识别对象?

    -在计算机视觉中,注意力机制可以聚焦于图像中的关键区域,这些区域对于识别对象至关重要。通过训练网络,可以生成热图来突出显示这些区域,从而帮助模型更准确地识别和定位对象。

  • Transformer网络如何处理长距离依赖问题?

    -Transformer网络通过注意力机制处理长距离依赖问题。它允许模型在任何给定的位置同时考虑序列中的所有其他位置,从而有效地捕捉整个序列的信息。

  • 多头注意力(Multi-Head Attention)是什么?

    -多头注意力是Transformer网络中的一个关键组件,它允许模型在不同的表示子空间中并行地执行多个注意力操作。这有助于模型从不同的角度捕捉序列的信息,提高模型的表达能力。

  • 为什么Transformer网络的训练速度比循环神经网络快?

    -Transformer网络可以并行处理整个序列,而循环神经网络需要按顺序逐步处理序列。这种并行性使得Transformer网络能够更有效地利用现代硬件(如GPU),从而加快训练速度。

  • Transformer网络中的掩码多头注意力(Masked Multi-Head Attention)有什么作用?

    -掩码多头注意力确保在生成序列时,每个词只能依赖于它之前的词,而不能依赖于未来的词。这种掩码机制避免了在序列生成过程中产生不恰当的依赖关系。

  • 什么是位置编码(Positional Encoding)?

    -位置编码是Transformer网络中用于给模型提供词序信息的一种技术。由于注意力机制本身不包含序列中词的位置信息,位置编码通过向输入词嵌入中添加特定模式的向量来解决这个问题。

  • Transformer网络在自然语言处理领域的应用有哪些?

    -Transformer网络在自然语言处理领域有广泛的应用,包括但不限于机器翻译、文本摘要、问题回答、文本分类和语言模型等任务。它的高效性和强大的表示能力使其成为许多NLP任务的首选模型。

Outlines

00:00

📘 介绍注意力机制与Transformer网络

本段落介绍了注意力机制在机器翻译中的应用,并提出了Transformer网络的概念。Transformer网络是一种新型的神经网络,它在2017年的论文《Attention is all you need》中被提出,可能不再需要循环神经网络的构建模块。通过类比人类视觉注意力,解释了注意力机制如何在图像识别中聚焦于特定区域,以及如何在机器翻译中让解码器回顾输入句子,避免丢失信息。

05:01

🌟 计算机视觉与自然语言处理中的注意力应用

这段落讨论了注意力机制在计算机视觉和自然语言处理(NLP)中的应用。在计算机视觉中,通过热图展示了如何识别图像中的建筑物,而在NLP中,注意力机制帮助处理任意长度的句子,并通过语言建模技术预测文本序列中的下一个词。此外,还比较了循环神经网络和Transformer网络的优缺点,指出了循环神经网络在训练步骤和并行计算方面的局限性。

10:05

🔗 深入理解注意力机制的原理

本段深入探讨了注意力机制的原理,将其与数据库查询进行了类比,并通过相似度函数计算查询与键(key)之间的权重,从而产生输出。介绍了不同类型的相似度计算方法,如点积、缩放点积和加性相似度,并讨论了如何通过softmax函数计算权重,以及如何将这些权重应用于值(value)来产生注意力值。

15:08

🛠️ 神经网络中的注意力架构实现

这段落详细描述了如何在神经网络中实现注意力机制,包括如何使用查询(Q)、键(K)和值(V)来计算相似度,并应用softmax函数来确定权重。解释了如何通过不同的线性变换来模拟数据库检索过程,并讨论了在实际应用中如何选择不同的相似度函数以及如何通过神经网络学习这些函数。

20:16

🔄 Transformer网络的架构与优势

本段介绍了Transformer网络的基本架构,包括编码器和解码器,并强调了其去除循环神经网络的优势。解释了多头注意力(multi-head attention)的概念,以及如何在不同的层中重复使用这些注意力模块来组合单词。讨论了Transformer网络如何解决长距离依赖问题,以及如何通过并行计算提高训练速度。

25:24

📚 深入Transformer网络的多头注意力

这段落深入讨论了Transformer网络中的多头注意力机制,包括如何通过线性层处理值、键和查询,然后通过缩放点积注意力和连接操作来生成输出。解释了多头注意力如何类似于在卷积神经网络中使用不同的过滤器来创建不同的特征图,并通过这种方式捕获文本的不同方面。

30:27

🎭 掩码多头注意力与解码器中的自回归特性

本段讨论了Transformer网络中的掩码多头注意力,解释了如何通过掩码确保解码器在生成序列时仅依赖于先前的输出,而不是未来的输出。这有助于保持自回归特性,同时避免了在训练过程中的循环依赖。此外,还讨论了教师强制(teacher forcing)方法,这是一种在训练期间使用正确输出作为输入的技术。

35:28

🛡️ 归一化与位置编码的重要性

这段落强调了归一化层在Transformer网络中的作用,如何帮助减少梯度下降所需的步骤,并提高网络的稳定性。介绍了位置编码的概念,解释了如何通过向词嵌入中添加位置信息来保留句子中的顺序信息,这对于理解句子的意义至关重要。

40:28

📈 Transformer网络与其他网络的比较

本段对Transformer网络与循环神经网络和卷积神经网络进行了比较,讨论了它们在处理序列数据时的计算复杂度和路径长度。指出了Transformer网络在捕捉长距离依赖和并行处理方面的优势,以及其在实际应用中的潜在影响。

45:29

🚀 Transformer网络在机器翻译中的应用

这段落讨论了Transformer网络在机器翻译任务中的应用,并与其他模型进行了性能比较。指出了Transformer网络在减少计算时间方面的优势,尽管在某些情况下可能并未显著超越现有技术,但其在提高效率方面的潜力是显著的。

50:30

🌐 GPT与BERT:Transformer网络的变体

本段介绍了Transformer网络的两种变体:GPT和BERT。GPT专注于语言模型的下一个词预测任务,而BERT则利用双向编码来提高性能。讨论了这些模型在不同自然语言处理任务中的应用,以及它们如何通过大量数据进行无监督预训练,然后在特定任务上进行微调以提高性能。

55:30

🏆 BERT与XLNet:刷新SOTA的Transformer模型

这段落讨论了BERT如何在多个任务上取得了显著的性能提升,并介绍了XLNet,这是BERT的一个改进版本,它允许模型在预测时考虑更广泛的上下文信息。XLNet在多个任务上超越了BERT,展示了Transformer网络在自然语言处理领域的强大能力和潜力。

Mindmap

Keywords

💡注意力机制

注意力机制是一种在序列数据中选择性地关注某些部分的算法技术。在视频中,它被用来改善机器翻译和自然语言处理的性能。例如,在机器翻译中,注意力机制允许解码器专注于输入句子中的相关部分,以提高翻译的准确性。

💡Transformer网络

Transformer网络是一种基于注意力机制的神经网络架构,它在2017年的论文《Attention is all you need》中被提出。这种网络摒弃了传统的循环神经网络结构,能够更高效地处理序列数据,因为它可以并行处理序列中的所有元素。

💡多头注意力

多头注意力是Transformer网络中的一个关键组成部分,它允许模型同时从不同的角度学习输入数据的表示。在视频中,多头注意力通过将输入数据投影到不同的空间并计算多个注意力输出,然后将这些输出合并,以捕获更丰富的信息。

💡编码器-解码器架构

编码器-解码器架构是Transformer网络中的一个基本组成部分,其中编码器用于处理输入数据,而解码器用于生成输出。在机器翻译中,编码器会处理源语言的句子,解码器则会生成目标语言的翻译。

💡遮蔽

在解码器的自注意力中使用遮蔽技术,以确保在生成序列的下一个元素时,只能依赖于已经生成的元素,而不是未来的元素。这在视频中被解释为一种确定性的dropout形式,用于维护序列生成的逻辑顺序。

💡位置编码

位置编码是Transformer网络中用于给模型提供词序信息的一种技术。由于注意力机制本身不包含序列中单词的位置信息,位置编码通过向输入词嵌入中添加特定模式的向量来解决这个问题。

💡层归一化

层归一化是一种用于稳定和加速深度神经网络训练的技术。在Transformer网络中,层归一化被用于每个子层(如多头注意力和前馈网络)的输出上,以确保不同层的输出具有相同的分布,从而减少训练过程中的不稳定性。

💡长距离依赖

长距离依赖是指在序列数据中,序列中较远位置的元素对当前元素有影响的现象。Transformer网络通过注意力机制有效地解决了传统循环神经网络在处理长距离依赖时的困难。

💡BERT

BERT(Bidirectional Encoder Representations from Transformers)是一种预训练语言表示的方法,它利用Transformer的双向编码器来理解单词的上下文。BERT在多种自然语言处理任务上取得了突破性的性能,包括问答、文本分类等。

💡GPT

GPT(Generative Pre-trained Transformer)是一种基于Transformer的预训练语言模型,主要用于文本的生成任务。GPT通过单向的解码器来预测序列中的下一个词,它在多个自然语言处理任务上展示了强大的能力。

💡XLNet

XLNet是一种最新的基于Transformer的预训练方法,它改进了BERT和GPT,允许模型在预测下一个词时同时考虑左侧和右侧的上下文,从而提高了模型的泛化能力和性能。

Highlights

介绍了注意力(attention)机制在机器翻译等任务中的应用,并提出了一种新的神经网络类别——变换器(transformer)网络。

变换器网络的核心是注意力机制,可以替代循环神经网络(RNN)中的一些构建块。

2017年发表的论文《Attention is all you need》标志着注意力机制在自然语言处理领域的突破。

注意力机制的概念最初在计算机视觉中被研究,用于对象识别。

通过热图可以展示神经网络在图像识别中关注的区域,从而验证其决策过程。

变换器网络通过注意力机制处理长距离依赖问题,与循环神经网络相比具有优势。

变换器网络的训练速度比循环神经网络快,且可以充分利用GPU的并行计算能力。

介绍了注意力机制的工作原理,包括查询(query)、键(key)和值(value)的概念。

多头部注意力(multi-head attention)允许同时从不同角度处理信息,提高了模型的表达能力。

变换器网络的编码器-解码器架构允许并行处理整个句子,与传统的序列处理方式不同。

位置编码(positional encoding)的引入使得模型能够捕捉单词顺序,对语言模型至关重要。

层归一化(layer normalization)有助于减少梯度下降的步骤,提高训练效率。

变换器网络在机器翻译任务中的表现与当时最先进技术相当,同时显著减少了计算量。

GPT(生成式预训练变换器)通过无监督学习进行语言建模,能够处理多种自然语言处理任务。

BERT(双向编码器表示变换器)通过考虑前后文信息来改进模型性能,超越了GPT。

XLNet是BERT的改进版,允许缺失输入并更好地泛化,进一步提升了模型在多个任务上的表现。

变换器网络的发展对循环神经网络的未来提出了挑战,显示出在准确性和速度方面的优势。

Transcripts

play00:04

okay so for the next set of slides I'm

play00:09

gonna talk about attention in more

play00:11

detail so we discuss already attention

play00:15

in the context of machine translation

play00:17

but attention has led to a new class of

play00:21

neural networks that are known as

play00:23

transformer networks okay and this this

play00:29

material that I'm going to present today

play00:31

is actually quite recent so it's not

play00:35

described in any of the textbooks that

play00:37

are associated with this course so for

play00:40

this there is a paper that was published

play00:43

in 2017 called attention is all you need

play00:47

okay so it's a very interesting title

play00:50

and it actually does suggest that

play00:53

perhaps we don't need recurrent neural

play00:56

networks anymore and and then a lot of

play00:59

the building blocks that people would

play01:01

design let's see let's same units

play01:04

gr use and and other things like this

play01:06

this essentially is suggesting that we

play01:08

don't need them all we need is really

play01:11

just attention ok so let's see how this

play01:14

works ok so it turns out that the

play01:19

concept of attention was was first

play01:23

studied in the context of computer

play01:26

vision and here let's say that we have

play01:29

some images we're going to recognize

play01:31

some objects so as humans let's say that

play01:35

we have a large image and then we're

play01:38

looking for a tiny object then what we

play01:40

will do naturally with our eyes is

play01:43

roughly scan the image and eventually

play01:46

focus on some regions of interest and

play01:50

and then eventually identify the object

play01:53

and then when we identify an object we

play01:56

eventually I guess focus precisely on

play01:59

the location where the object is and

play02:02

then some studies have shown that our

play02:04

eyes are indeed focusing on the right

play02:09

regions when when we do this type of

play02:11

identification so I guess here an

play02:15

interesting question is is there a

play02:17

mechanism

play02:18

computer vision that could be similar

play02:20

and would that be beneficial and the

play02:22

answer is yes so here let's say that we

play02:26

have an image

play02:27

okay this image doesn't appear very well

play02:30

on the screen on my laptop

play02:34

it looks better so essentially you're

play02:36

supposed to see some scene where there's

play02:40

a house and a tower here so it says

play02:42

essentially two buildings and then

play02:45

there's there's the rest of the scene

play02:47

around it and let's say that we're doing

play02:49

object detection where we want to

play02:51

recognize buildings so what this shows

play02:54

here is a heat map that's overlaid on

play02:57

top of the image so that's what you

play02:58

don't actually see while the scene but

play03:01

in any case this heat map shows that the

play03:04

pixels or the region where buildings are

play03:08

more likely to be are right here in the

play03:12

red part and if you look at it more

play03:14

carefully so it turns out that there is

play03:18

indeed a house here and then there's a

play03:20

tower here okay so now if you train some

play03:25

network to do some classification an

play03:30

interesting question is when the network

play03:33

outputs a class how can we trust that it

play03:36

comes up with the right class and you

play03:39

know we could ask it if you tell me that

play03:41

indeed there's a building in this image

play03:43

can you show me where that building is

play03:46

right and this could be a nice way of at

play03:49

least validating understanding what it

play03:53

thinks is a is a building in the image

play03:56

and whether it's correct or not so so

play04:00

here we can use attention to essentially

play04:03

see which pixels are aligned with the

play04:08

concept of a building and then so the

play04:12

heat map corresponds to the weights here

play04:16

for the attention mechanism so in this

play04:19

case you could imagine having an

play04:21

attention mechanism over the entire

play04:24

image so you've got weights with respect

play04:26

to all of the pixels right and then you

play04:31

try to see I guess which pixels would

play04:35

have some semantic meaning or some

play04:39

embedding that would be essentially

play04:42

aligned with the notion of of the object

play04:45

that we're trying to recognize and then

play04:48

so some researchers demonstrated that

play04:50

you can do this and in fact here

play04:53

attention can be used to highlight the

play04:55

important parts of an image that

play04:57

contribute to the desired output so it's

play05:00

very nice in terms of explaining what is

play05:03

the decision process going on

play05:06

and it can also be used to as a building

play05:11

block as part of the recognition process

play05:14

okay so this was in computer vision then

play05:18

in the National anguish processing in

play05:20

2015 we saw the work about machine

play05:23

translation where you can get your

play05:27

decoder to essentially peek or look back

play05:30

at what was the input sentence so that

play05:33

it doesn't lose track of what it's

play05:36

translating and it doesn't have to

play05:38

remember completely the sentence so so

play05:41

this was an important breakthrough that

play05:44

allowed us to deal with sentences of

play05:46

arbitrary length inside very long length

play05:49

but then further than that in 2017 some

play05:55

researchers show that we can use

play05:57

attention to develop general language

play06:00

modeling techniques so here language

play06:03

modeling simply refers to the idea of

play06:06

developing a model that would simply

play06:08

predict the next word in a sequence so

play06:11

it could generate words and then also if

play06:15

let's say we've got a word missing

play06:18

somewhere in a sequence and it could

play06:19

also recover that missing word so

play06:22

language model is essentially just a

play06:23

model that can predict or recover words

play06:26

in a sequence a lot of tasks can be

play06:31

formulated as language modeling problems

play06:35

so whenever we do translation you could

play06:38

imagine that it's just a sequence where

play06:41

we have a first part in one language

play06:43

a second part in another language and

play06:45

what we're really doing it's just

play06:47

continuing the sequence where we're

play06:49

predicting words of the second part that

play06:52

happened to be in the next language and

play06:56

for doing sentiment analysis we can

play06:59

think of it this way too so most of the

play07:02

the tasks in natural language processing

play07:04

can be cast as some form of language

play07:07

modeling and then what they showed is

play07:10

that we can also design some

play07:12

architecture that uses pretty much

play07:15

exclusively some attention blocks and

play07:19

then they call that architecture a

play07:22

transformer okay so so we'll see in more

play07:25

details now those transformer networks

play07:28

and they've become now the state of the

play07:30

art for natural anguish processing so

play07:32

these surpass now recurrent neural

play07:35

networks okay so if we do a comparison

play07:39

between a recurrent neural network and a

play07:43

transformer Network turns out that the

play07:45

recurrent neural network has several

play07:47

challenges the first problem that we've

play07:50

discussed a lot is how can we deal with

play07:52

long-range dependencies and here the

play07:55

solution was in fact to combine the

play07:57

Recker neural network with some some

play07:59

attention mechanism it also suffers from

play08:04

great invention great and explosion the

play08:08

number of steps for training recurrent

play08:10

neural networks can be quite large and

play08:13

this has to do with the fact that a

play08:15

recurrent neural network we can think of

play08:17

it as essentially an unrolled network

play08:20

that is very deep because when we unroll

play08:23

it we have to unroll it for as many

play08:25

steps as needed for a corresponding

play08:28

sequence so recur neural networks

play08:31

effectively arbitrarily deep networks

play08:34

and then so they have lots of powers

play08:37

those powers are essentially correlated

play08:41

between each other because we in fact

play08:43

tied the weights from one step to

play08:46

another and then so the optimization

play08:49

tends to be quite difficult as a result

play08:51

and it tends to require a lot of steps

play08:54

okay so so training Eric

play08:57

all Network usually takes a lot longer

play08:59

than a convolutional or net or regular

play09:01

feed-forward neural net beyond the

play09:05

number of steps there's also the

play09:08

question of paralyzing computation so

play09:11

today GPUs have become key for working

play09:17

with large neural networks and then what

play09:19

they really do is that they enable us to

play09:22

do some computation in parallel but then

play09:25

if you have a recurrent neural network

play09:27

what happens is that the sequence of

play09:30

steps write the computation has to be

play09:33

done sequentially we can't process all

play09:36

those steps in parallel because of the

play09:40

recurrence ok so so there's an inherent

play09:42

problem here we can't quite leverage

play09:46

GPUs as well in the context of recurrent

play09:50

neural networks now if we consider

play09:52

transformer networks we're going to see

play09:55

in a moment that because they

play09:57

essentially use pretty much exclusively

play10:00

some attention blocks then there won't

play10:04

be any recurrence and attention will

play10:08

also help us to draw some connections

play10:11

between any part of the sequence so

play10:15

long-range dependencies won't be a

play10:17

problem

play10:18

so in fact long-range dependencies will

play10:22

have like the same likelihood of being

play10:25

taken into account as short-range

play10:27

dependencies so so that's not an issue

play10:30

anymore now great advancing and

play10:34

explosion will also not be an issue

play10:36

because instead of having computation

play10:39

that goes linearly with the length of

play10:43

the sequence into a deep network for a

play10:46

recurrent neural network and a

play10:48

transformer we're going to do the

play10:50

computation for the entire sequence

play10:52

simultaneously and then just build

play10:54

several layers for that but there won't

play10:57

be so many layers in practice so so

play11:00

great invention and explosion

play11:02

won't be as much of an issue in terms of

play11:06

steps it will take as well mark fewer

play11:09

steps to Train

play11:10

and then because there's no recurrence

play11:12

and those networks we're gonna be able

play11:15

to do the competition in parallel for

play11:17

every step so so so this is really nice

play11:21

so I guess we've got lots of great

play11:23

advantages for those transformer

play11:25

networks okay so to introduce

play11:29

transformer networks

play11:31

let's review briefly attention so we're

play11:36

gonna see again attention but more

play11:38

generally so we've seen it already in

play11:40

the context of machine translation but

play11:43

now let's let's think about attention

play11:46

essentially being some form of

play11:49

approximation of a select that you would

play11:52

do in a database so in a database if you

play11:55

want to retrieve some value based on

play11:59

some query and and then also based on

play12:02

some key right then there's some

play12:03

operations that you can do where you can

play12:07

use a query to identify a key that

play12:11

aligns well and then simply output the

play12:14

corresponding value so we can think of

play12:17

attention and essentially mimicking this

play12:20

type of retrieval process but in a more

play12:23

fuzzy or probably bill sick way okay so

play12:27

let me draw a picture just to illustrate

play12:29

how things would normally work in in a

play12:32

database and then we'll see that in our

play12:35

case we're going to enable the same type

play12:37

of competition that would normally

play12:38

happen in in database but using this

play12:41

equation here

play12:44

[Applause]

play12:54

okay so let's say I've got a query and

play12:59

then I've got a database so in my

play13:09

database that's it I have stored some

play13:12

keys with associated values

play13:33

okay so I've got my database now when I

play13:37

issue a query all right I will I will

play13:40

see how well this query aligns with

play13:42

different keys perhaps the key that is

play13:46

the right one is let's see key number

play13:48

three then what I would do is

play13:50

essentially produce an output that

play13:52

corresponds to the value okay so so

play13:56

retrieval in the database would more or

play14:00

less correspond to this and now an

play14:02

attention mechanism is essentially a

play14:04

neural architecture that mimics this

play14:08

type of process and the way we're going

play14:10

to mimic this is by using the following

play14:14

equation so here we're going to measure

play14:19

what is the similarity between our query

play14:22

queue and each key value ki and then

play14:26

this similarity is going to return a

play14:29

weight and then we'll simply produce an

play14:32

output that is a weighted combination of

play14:34

all the values in our database now

play14:38

normally with the database when we do

play14:40

retrieval we simply return one value and

play14:44

this would correspond here to finding a

play14:47

similarity between the query and some

play14:50

key where the similarity would have

play14:53

value one and then all the other keys

play14:56

the similarity would be value zero so if

play14:59

we have this if if we've got a

play15:00

similarity function essentially produces

play15:03

a one hot and coding right then we would

play15:08

effectively just return one value now in

play15:11

practice because we'll we'll want to

play15:15

embed this as part of a neural network

play15:17

and be able to do back propagation to

play15:21

differentiate through that type of

play15:23

operation they'll be useful for us to

play15:26

think of the similarity as instead

play15:29

computing a distribution

play15:32

so I guess weights that are between 0 &

play15:34

1 and then even if we have multiple keys

play15:37

that have some similarity the ideas that

play15:40

we're going to produce a value that's

play15:42

going to be a weighted combination based

play15:46

on

play15:46

weights okay so I guess we can think of

play15:49

this as like a generalization of the

play15:51

mechanism for a database where we make

play15:54

the retrieval process become a convex

play15:57

combination or weighted combination of

play16:00

the values for which the keys have a

play16:03

high similarity in in the database okay

play16:09

so let's try now a neural architecture

play16:15

that will correspond to this attention

play16:18

mechanism so so it's going to be

play16:20

essentially the same as what we've seen

play16:22

already in the context of machine

play16:24

translation but now we're just gonna

play16:25

make it

play16:26

the main agnostic so you're gonna see

play16:28

more generally what attention really

play16:31

corresponds to

play16:33

[Applause]

play17:02

okay so let's say that I've got t1 t2

play17:14

actually shift this a little bit t1 t2

play17:20

t3 and t4 we're going to have our query

play17:30

and let's have s1 s2 s3 and s4 query is

play17:46

going to influence the computation of

play17:49

each one of those things so here what

play17:54

I've drawn is a first layer we're

play17:57

starting with the keys I'm gonna compute

play17:59

a similarity measure so these s's

play18:02

correspond to similarity okay so here SI

play18:08

is going to be equal to some function of

play18:12

the query with respect to some function

play18:19

of the query Q and the ki KI

play18:25

and now there are many functions that we

play18:29

could consider so let me suggest a few

play18:34

the first one could simply be a dot

play18:37

product so

play18:40

just like this okay so the similarity if

play18:55

we think of the query and the key as

play18:57

just embedding vectors right if I want

play19:01

to measure their similarity a simple

play19:03

thing is just to compute their dot

play19:04

product so that's something common a

play19:07

variant on this is going to be a scaled

play19:12

dot product okay and then here D is the

play19:40

dimensionality

play19:47

of each King okay something slightly

play19:56

more general we could also have QT q

play20:00

transpose wk I so I'll call that a

play20:05

general dot product and then finally

play20:15

let's have W transpose Q of Q plus

play20:25

w transpose k ki so this will be an

play20:32

additive similarity okay so this first

play20:46

layer is computing the similarity

play20:48

between some query Q and each keys each

play20:53

key in in our database or in our memory

play20:57

and to do this the many choices some

play21:01

common choices in practice or just to do

play21:04

a dot product or scale the dot product

play21:06

by dividing by the square root of the

play21:08

dimensionality this has the benefit of

play21:11

simply keeping the dot products in in a

play21:15

certain scale now more generally we can

play21:18

also project the query into a new space

play21:21

by using a weight matrix W and then

play21:25

after that taking a dot product by a ki

play21:29

and then another one is just to take the

play21:32

sum of the sum some combination of QN

play21:39

and ki and this is known as as some form

play21:42

of additive similarity now you could

play21:45

think of other types of similarity for

play21:47

instance earlier in the course we talked

play21:49

about kernel methods that also measure

play21:51

similarity they do this by essentially

play21:55

mapping two vectors into a new space

play21:58

through some nonlinear function so we

play22:01

could have as well in here some form of

play22:04

kernel similarity to yeah

play22:13

[Music]

play22:24

okay so in this case we're not going to

play22:27

have convolutions but here are you

play22:32

suggesting that instead of comparing the

play22:35

query to every key we could just do it

play22:37

with respect to a subset of the case

play22:42

[Music]

play22:49

also the okay this w think of it as more

play22:55

like we're simply transforming our query

play22:58

to be in the same space as the keys so

play23:03

okay to give you a concrete example

play23:05

let's say that we're doing question

play23:07

answering and then we would like to

play23:11

let's say we have a database of possible

play23:14

answers we've got a query and now they

play23:18

say we've got an embedding of every

play23:20

answer that corresponds to a key now the

play23:23

query is a question so we can embed it

play23:27

as well and now the problem is that

play23:29

depending on the type of embedding we

play23:31

compute if those embeddings are simply

play23:35

computing the semantic meaning of these

play23:39

sentences right there's no reason for us

play23:42

to expect that the question and the

play23:45

answer have the same meaning right so in

play23:48

fact they should have different meaning

play23:51

because the answer is supposed to

play23:52

provide something that the question

play23:53

doesn't have right but still then what

play23:57

you can do is map them into a new space

play24:01

where there we could interpret the

play24:03

question and the answer is really being

play24:06

things that we can compare directly and

play24:08

then this matrix W serves that purpose

play24:11

okay so it's just a high-level intuition

play24:15

but the idea is that if you're not

play24:17

confident that you can compute the

play24:21

similarity directly then you can allow

play24:24

your neural network to learn a mapping W

play24:28

so here W is a set of weights and some

play24:31

matrix that will essentially

play24:34

map our query into a new space and then

play24:38

it will just learn what that new space

play24:41

should be okay so that's our first step

play24:50

now the second step or the second layer

play24:53

after this will be to compute the

play24:55

weights so have a 1 a 2 a 3 and a 4

play25:07

those weights

play25:09

they depend on everything ok so here

play25:23

this will be done through a softmax

play25:30

so essentially AI is going to be equal

play25:36

to the exponential of Si divided by the

play25:42

sum over J of the exponential of SJ ok

play25:53

so yeah so here it's a fully connected

play25:56

network but not in the classical sense

play25:59

it's more I'm just showing what hidden

play26:04

nodes are using and computing these

play26:07

weights this is a softmax we don't have

play26:09

any weights all we're doing is just

play26:12

computing this this expression ok and

play26:18

then after this we're going to have our

play26:24

weighted combination

play26:52

so what I've shown here is that we

play26:55

multiply a 1 by v1

play26:58

alright so this means that we multiply

play26:59

them then we add them to the product of

play27:03

a 2 by V 2 we add this to the product of

play27:06

a 3 plus times v3 and so on and then

play27:09

this produces the attention value ok so

play27:13

the attention value is just a sum over I

play27:16

of AI di ok so I guess you see this is a

play27:22

general scheme where we have some query

play27:27

we have some keys and then we're going

play27:29

to produce an output where the output is

play27:32

really a linear combination of some

play27:34

values where the weights come from some

play27:38

notion of similarity between our query

play27:41

and and the keys okay any questions

play27:49

regarding this yeah

play28:00

right so here the W matrix should span

play28:03

the space that we care about now

play28:07

we don't specify W right these are

play28:09

variables that are going to get

play28:11

optimized by the neural network itself

play28:14

so that's the beauty of this right so

play28:16

here again W in general indicates

play28:21

weights that are powers or variables in

play28:24

the neural network and then whatever the

play28:27

task right we're going to do back

play28:28

propagation and then these weights are

play28:31

going to get a justice or the neural

play28:32

network is going to learn on its own

play28:34

what might be a good space to project Q

play28:38

into so that then we can take a dot

play28:40

product with K yeah oh good question

play28:49

okay yeah the AIS are scalars so I guess

play28:52

okay so the key the key eyes are vectors

play28:57

okay

play28:58

the S eyes are scalars the AIS are

play29:07

scalars and then the V eyes are vectors

play29:30

yeah so here these scalars or our

play29:33

weights and then there's going to be a

play29:35

weight like this for every possible

play29:37

world like if we're doing machine

play29:39

translation and now we're about to

play29:40

produce an output and we want to compute

play29:42

the attention of respect to the input

play29:44

words then we're going to have

play29:47

essentially one one yeah one weight per

play29:53

output word and then the VI is are

play29:56

essentially going to be the hidden

play29:58

vectors associated with each input word

play30:02

okay so in fact this is what I've got

play30:04

here on this slide right so as a

play30:09

concrete example we've talked about

play30:11

machine translation in a previous set of

play30:14

slides and here if I simply use as a

play30:19

query si the hidden vector for the if

play30:23

output word and then for the keys and

play30:26

the values and this particular setting

play30:29

I'm gonna have the same thing so both

play30:31

the keys and the values are going to be

play30:33

the h JS these are the hidden vectors

play30:36

for the input word that allows me to

play30:38

essentially compare my my hidden vector

play30:44

for an output word to each one of the

play30:47

hidden vectors for an input word and

play30:49

then essentially combine them together

play30:52

to produce a context vector that

play30:55

reflects what are the words that I'm

play30:58

interested in in decoding next or

play31:01

translating next yeah

play31:16

okay great question yeah so we haven't

play31:18

talked about yet what is a transformer

play31:20

all I'm doing so far is just explaining

play31:24

in a general form what is the attention

play31:26

mechanism but I believe it's coming up

play31:29

in a few slides all right so we've

play31:34

discussed all kinds of networks and more

play31:38

recently it's been the focus has been on

play31:40

sequential data we've seen hidden Markov

play31:44

models then the recurrent neural network

play31:46

and now we're talking about transformers

play31:48

so I want to go back and discuss in more

play31:52

detail the transformer network that was

play31:54

presented in 2017 so this network is

play32:01

special because as we discuss it gets

play32:04

rid of recurrent so this was a major

play32:06

thing so recurrences mean that the

play32:11

optimization tends to take longer for

play32:15

two reasons the number of iteration so

play32:17

the number of steps ingredient descent

play32:20

will be higher and also recurrence means

play32:24

that we have several operations that are

play32:26

going to be sequential and then we

play32:29

cannot paralyze them as easily so the

play32:32

beauty of having a GPU is that in

play32:35

principle you can paralyze lots of

play32:37

operations but if these operations are

play32:39

going to be sequential then you cannot

play32:41

write so so we would like to reduce as

play32:44

much as possible recurrent okay so on

play32:47

this slide here we've got a picture of

play32:51

the transformer network that was

play32:53

proposed in 2017 and this network has

play32:57

now displays pretty much recurrent

play33:00

neural network for sequential data so

play33:02

this was a major shift in in the

play33:05

thinking that people have now with

play33:07

respect to sequential data okay so if we

play33:10

take the example of machine translation

play33:13

so this network even though it's not

play33:16

obvious has two parts

play33:18

the first part that corresponds to an

play33:20

encoder the second part that corresponds

play33:22

to a decoder and here you see in machine

play33:26

translation you would use the encoder to

play33:29

encode your initial sentence and then

play33:32

you would use the decoder to produce a

play33:34

translated sentence now this will

play33:39

process an entire sentence in parallel

play33:43

as opposed to a recurrent neural network

play33:46

that will essentially processed one word

play33:48

at a time in the sentence so if you look

play33:51

carefully you see the inputs here would

play33:53

actually be the entire sequence of words

play33:56

so we would feed them all in at once and

play33:59

then they would get embedded and then

play34:03

after this we would add a positional

play34:05

encoding I'll come this I'll come back

play34:08

to this later on in a few slides but

play34:10

this is important essentially to make

play34:12

sure that we can distinguish words that

play34:16

occur in different positions within a

play34:19

sentence the problem is that if we don't

play34:22

have a positional encoding then we would

play34:25

effectively have a model here that

play34:27

treats the word as if it was a bag of

play34:31

words as opposed to a sequence and we

play34:34

all know that in languages the sequence

play34:37

actually matters so the ordering matters

play34:39

so we need to still capture this

play34:41

information and then so the positional

play34:44

coding at Eve's that ok so now the

play34:48

important part of a transformer network

play34:51

is essentially this block here it

play34:53

consists of two sub parts so does the

play34:57

multi-head attention and then here a

play34:59

feed-forward neural network

play35:01

now the multi-head attention is is

play35:04

really where all the good stuff happens

play35:07

so here the ideas that we feed in again

play35:10

a vector that would consist of sub

play35:14

vectors of all the the words that we

play35:16

have in our sentence and then the

play35:19

multi-head attention is going to compute

play35:22

the attention between every position and

play35:25

every other position so we have vectors

play35:28

that embed the words in

play35:31

one of those positions and now we simply

play35:34

carry out an attention computation that

play35:38

will essentially treat each word as a

play35:40

query and then find some keys that

play35:44

correspond to the other words in the

play35:46

sentence and then take a convex

play35:49

combination of the corresponding value

play35:52

so here the values are going to be the

play35:55

same as the keys and then take a dot

play35:57

product of that to produce a better

play36:00

embedding so the idea is that this

play36:02

multi-head attention will essentially

play36:05

take every word combine it with some of

play36:09

the other words through the attention

play36:11

mechanism to essentially produce a

play36:14

better embedding that that merges

play36:17

together information from pairs of words

play36:20

now when we do this in one block we

play36:23

essentially look at pairs of words

play36:25

together but now if we repeat this

play36:28

multiple times so this n times here

play36:32

means that we're going to have this

play36:34

block that's going to be repeated n

play36:36

times we're going to have n stacks of

play36:38

those blocks and now you see in the

play36:41

first block we look at pairs and the

play36:43

second block we're looking at pairs of

play36:45

pairs and then the third block we're

play36:47

going to look at pairs of pairs of tears

play36:49

so essentially we're combining now more

play36:52

than just two words but groups of words

play36:55

that that get larger and larger and

play36:58

larger okay so that's what the

play37:02

multi-head attention does then we have

play37:06

on top here another layer and that's

play37:09

called a Donora this is essentially

play37:12

adding a residual connection that takes

play37:15

the original input to what comes out of

play37:18

the multi hat attention and then it

play37:21

normalizes this so here nor is

play37:24

essentially a layer normalization we'll

play37:27

come back to this in a second but it

play37:29

essentially means that we we take all of

play37:32

our entries and then we normalize them

play37:35

to have zero mean as well as variance

play37:38

while then we feed this into a

play37:41

feed-forward Network there's again a

play37:42

residual connection

play37:44

and then a normalization so this block

play37:47

is repeated n times so that then we can

play37:50

combine not just pairs of words but

play37:53

pairs of pairs of words and and so on so

play37:56

that eventually you see you can combine

play37:57

together all the words in in the

play38:00

sentence

play38:02

okay so the output of this is going to

play38:05

be again a sequence of embeddings

play38:09

there's going to be one embedding per

play38:12

position and twitted li the embedding in

play38:14

that position captures the original word

play38:17

at that position but also information

play38:19

from the other words that it attended to

play38:23

throughout the network okay so you can

play38:25

think of this is essentially just a

play38:27

large embedding of all those words

play38:31

corresponding to each position so that's

play38:33

our encoding of the input sentence then

play38:38

after this we have the decoder which

play38:40

will do something similar but obviously

play38:42

the main purpose of the decoder is to

play38:45

produce some output not just to embed

play38:47

but produce some output so that's what

play38:49

we're gonna have some additional stuff

play38:51

on top here where we have a softmax that

play38:54

produces some probabilities for let's

play38:58

see outputting a label in each position

play39:02

okay now inside the block so this this

play39:06

block will also repeat n times what we

play39:09

do is we have first a multi-head

play39:13

attention that looks at simply combining

play39:16

output words with previous output words

play39:19

and then there's another block of

play39:21

multi-head attention that now combines

play39:23

output words with input words and then

play39:27

finally a feed-forward Network again

play39:30

okay so here you see we have two layers

play39:33

of attention the first layer is really

play39:37

just self attention between the output

play39:40

words so now the problem though with

play39:44

output words is that when you generate a

play39:47

sequence as output you can only generate

play39:50

the next word based on the previous

play39:53

words so when you do the attention you

play39:57

need to be care

play39:57

to make sure that you only attend to

play40:00

previous words and that's why this one

play40:03

is called a masque multi-head attention

play40:06

because we can mask the future words so

play40:09

that each word is only attending to the

play40:12

previous words the second multi had

play40:16

attention here is now combining or is is

play40:20

I guess that making sure that each

play40:23

position in the output is attending to

play40:25

positions in the input so this is where

play40:28

a bit like in machine translation

play40:30

whenever you want to produce an output

play40:32

and it's good if you can kind of peek

play40:35

and look back at what was your input

play40:37

sentence and here we're gonna look at

play40:39

the embeddings of each position in the

play40:42

input so that's why you see we've got

play40:45

these arrows that go in okay and this

play40:50

will be repeated and times again so that

play40:52

the ideas that we gradually build up

play40:56

combinations and then get better and

play40:59

better embeddings

play41:00

until we produce an output and here the

play41:03

output can be a distribution over the

play41:06

words in the dictionary alright so for

play41:08

every position there is a word that

play41:10

we're trying to generate and then we're

play41:13

gonna compute some distribution over the

play41:15

words and in the dictionary any

play41:18

questions regarding this slide okay

play41:25

good alright so in the transformer

play41:30

Network perhaps the most important part

play41:32

is this multi-head attention so I'm

play41:36

gonna draw on the board what this

play41:39

corresponds to now mathematically the

play41:42

multi-head attention is essentially this

play41:45

expression that decomposes according to

play41:49

these operations okay so for the

play42:13

multi-head attention as we talked about

play42:15

last class whenever we want to design an

play42:21

attention mechanism the general way of

play42:23

thinking about it is that we have some

play42:26

key value pairs just like in a database

play42:28

and then there's a query that we're

play42:33

going to compare to each keys and and

play42:37

then the keys that have the greatest

play42:39

similarity are going to have the highest

play42:41

weights and now we can take a weighted

play42:44

combination of the corresponding values

play42:46

to produce the output so we're going to

play42:49

feed V key and Q into some linear layer

play43:11

then after this we're going to compute a

play43:16

scale that product attention

play43:36

then we're going to concatenate these

play43:43

outputs then we'll have another linear

play43:48

layer and then the output of this is

play43:53

going to be our multi-head attention

play44:01

okay now this is called multi hat

play44:04

because in a reality and I haven't drawn

play44:07

this yet on the board we're going to

play44:09

compute multiple attentions so here when

play44:14

we take a linear combination perhaps

play44:16

there's different we can think of this

play44:18

linear combination as really being a

play44:20

projection of the values V same thing

play44:23

for a case anything from Q and we could

play44:25

consider several projections so here I'm

play44:29

gonna use this to indicate that I could

play44:34

compute three different types of

play44:36

projections by kinda looking at three

play44:39

different linear combinations of the

play44:40

values same thing for K

play44:45

same thing for Q so that I get

play44:52

essentially three different projection

play44:55

now each one of them I can now compute a

play44:59

skill that product attention so I will

play45:01

get three scale dot product attention

play45:08

and the way to think about these

play45:11

different linear combination these

play45:13

different scale product attentions it's

play45:16

a bit like feature maps and

play45:18

convolutional neural network so we saw

play45:20

the in convolutional neural network you

play45:22

can compute multiple feature Maps simply

play45:26

by having different filters so here

play45:28

these linear combination are a bit like

play45:31

different filters although in this case

play45:33

you can think of them as more like

play45:35

projecting or simply changing the space

play45:39

in which the values reside okay so so

play45:43

here this will give us different

play45:45

projection on different spaces a bit

play45:47

like multiple filters into a

play45:50

convolutional neural net and then when

play45:52

we compute the scale that product

play45:54

attention then for each projection is

play45:57

going to be a different one different

play45:59

scale dot product attention so we're

play46:02

going to get multiple of them and that

play46:04

corresponds more or less to having

play46:05

multiple feature Maps that's the same

play46:08

intuition so then this contact layer is

play46:12

going to concatenate these different

play46:15

skel product product attention and in

play46:18

founding we take a linear combination of

play46:21

them and this gives us a multi-head

play46:23

attention because we essentially

play46:26

computed multiple attentions so here

play46:29

there's three of them so we can think of

play46:32

these as H so this would be the number

play46:36

of heads and multi-head attention okay

play46:42

so the idea is that there's one head per

play46:47

linear combination so here there's three

play46:51

of them and in general there's going to

play46:54

be h of them and that's where the idea

play46:56

of the the name multi hat comes from

play47:02

okay any questions regarding this good

play47:08

ok

play47:16

all right so besides just a regular

play47:19

multi-head attention we also have in the

play47:21

decoder a mask multi-head attention and

play47:25

here the idea behind the mask multi-head

play47:28

attention is that some of the values

play47:30

should be masked meaning that the

play47:32

probabilities should be nullified so

play47:35

that we don't create any combinations so

play47:39

for instance in the decoder when we

play47:42

produce the output let's say we're doing

play47:44

machine translation so we have a

play47:46

sentence let's say in English then we're

play47:48

translating that into French we start

play47:51

producing the words in French when we

play47:53

produce a word right then it's okay for

play47:56

that word to depend on the previous

play47:59

words in the translation because we're

play48:00

generating them sequentially but it

play48:03

doesn't make sense for that word to

play48:05

depend on the future words because we

play48:08

haven't produced them yet right so so

play48:11

here what we need to do is to

play48:13

essentially change our attention

play48:16

mechanism so that we we would nullify or

play48:19

effectively remove links that would

play48:22

create dependencies on words that we

play48:24

haven't generated yet and so this is

play48:27

what we call a mask multi-head attention

play48:30

okay so here the main difference is that

play48:35

in the attention mechanism normally we

play48:38

just computer soft max according to this

play48:40

expression but now instead with a mask

play48:44

attention we're going to add a mask here

play48:48

that will effectively produce some

play48:50

probabilities that are zero for the the

play48:54

terms that we don't want to attend to

play48:56

because it would be future terms so here

play48:59

you see in a soft max what I normally do

play49:03

is take the exponential divided by the

play49:06

sum of exponential so now if I add a

play49:09

mask which is a matrix of zeros and

play49:13

minus infinity so adding minus infinity

play49:16

when I take the exponential of minus

play49:19

infinity it gives me zero so this has

play49:22

the effect of ensuring that the

play49:24

probabilities of certain items here are

play49:28

going to be zero because I'll take the

play49:30

Financial of minus infinity and and this

play49:34

will have the same effect as removing

play49:36

connections so at some level you can

play49:39

think of this as a form of dropout but

play49:42

here it's not a dropout in the same

play49:45

sense that we saw at the beginning of

play49:47

the course

play49:48

right so dropout for regularization you

play49:51

would drop out some connections or

play49:53

remove some connections at random

play49:55

according to some distribution here we

play49:59

remove connections that are essentially

play50:02

pointing at words that we haven't

play50:05

produced yet so this is more like a

play50:07

deterministic type of dropout if you

play50:10

wish because we we would never have

play50:12

those connections so here as opposed to

play50:16

sampling those connections with a

play50:18

distribution you see we use a mask and

play50:21

then inside the softmax the mask will

play50:26

have the effect of nullifying some of

play50:29

the connections because the exponential

play50:31

of minus infinity will be zero any

play50:37

questions regarding this yeah

play50:52

yeah okay good question so yeah here in

play50:56

the paper they add a mask with values

play51:00

that are minus infinity now perhaps the

play51:02

more intuitive approach would simply be

play51:05

to multiply the softmax by some sort of

play51:09

Hadamard product with values that are 0

play51:13

and 1 now when we do it outside with

play51:16

values that are zero in one what happens

play51:18

is that the softmax produces a

play51:20

distribution that adds up to 1 now we're

play51:23

going to nullify some of those

play51:24

probabilities and then the the some of

play51:27

the properties that are left is not

play51:29

gonna add up to 1 whereas if we do it

play51:32

inside right by adding a matrix that

play51:36

might have values that are minus

play51:38

infinity it means that when we take the

play51:40

softmax these are gonna have zero

play51:43

probability but then all the other

play51:45

values are gonna have probabilities that

play51:48

still sum up to one so this ensures that

play51:50

we still have a proper distribution

play51:59

okay and if we just go back to this

play52:03

slide you see what happens is that when

play52:08

we produce an output let's say I produce

play52:11

my first word here then that first word

play52:16

could be fed as input for the next

play52:19

position right so when I want to produce

play52:22

a word for a certain position it's okay

play52:27

to look at the previous words and this

play52:31

is where the mass multi-head attention

play52:34

will will apply so we're gonna have a

play52:36

mask it's a matrix that will essentially

play52:39

be lower triangular that has essentially

play52:43

values that are zero in the lower

play52:45

triangular part and minus infinity in

play52:48

the upper triangular part to essentially

play52:51

nullify everything that happens in the

play52:54

future and the other thing that might

play52:57

not be obvious is that it looks like

play53:00

here when we've got some output it gets

play53:03

fed back in as as input here and that

play53:07

looks like it's creating a recurrence so

play53:10

here this is not creating a recurrence

play53:14

per se simply because there is a method

play53:18

for training known as teacher forcing

play53:23

where the idea is that when you train

play53:27

the network you have both what is the

play53:30

input sentence and the output sentence

play53:32

and then you can simply say well let me

play53:35

assume that my outputs are correct

play53:37

everywhere so I'm going to feed that as

play53:39

input here so I'm going to feed in what

play53:42

are the correct output words for the

play53:44

previous positions and then I'll simply

play53:48

try to predict what is the next word

play53:50

based on that okay so with this scheme

play53:53

that is known as teacher forcing then

play53:56

you can essentially decouple the output

play54:00

here from the input here and we don't

play54:03

have any recurrence relation in training

play54:07

now at test time then what happens that

play54:11

you really have to execute this network

play54:13

with the recurrence relation but that's

play54:16

okay so where there's really again is a

play54:19

training time so training is what takes

play54:22

a long time and if we can remove all

play54:25

recurrence relations so that we can do

play54:28

all the computation in parallel this

play54:30

will be a lot faster and then through

play54:33

this teacher forcing trick then what we

play54:36

do is we simply assume that we have the

play54:38

correct output words for the previous

play54:42

words and then we feed them in here as

play54:44

if they were given to us and then we

play54:47

simply try to predict what is the next

play54:49

output any questions regarding this okay

play54:57

let's continue okay so the other

play55:04

important layers are the nollans ation

play55:09

layer and also the positional embedding

play55:12

so the normalization layer is actually

play55:14

quite important and quite interesting

play55:17

so it's this layer that we saw right

play55:21

here so on top of every multi-head

play55:24

attention and feed-forward network

play55:26

there's here a normalization layer so

play55:30

what this does is that it helps to

play55:32

essentially reduce the number of steps

play55:35

needed by gradient descent to optimize

play55:38

the network here whenever we've got a

play55:41

network with multiple layers we've got

play55:43

weights in each layer and then those

play55:46

weights are going to be trained by

play55:48

gradient descent but when you look at

play55:50

the formula for the gradient it's it's

play55:54

often that the case that to compute the

play55:57

gradient of one set of weights then it

play56:00

depends on the output of the layers

play56:03

below and also what is being computed in

play56:06

the layers above right so depends on on

play56:08

what is being computed below and above

play56:11

now the problem is that if we're still

play56:13

adjusting the weights but below and

play56:16

above now when we compute the gradient

play56:19

then things are not going to be stable

play56:22

since some level it's like we'd rather

play56:24

wait till all these

play56:26

layers have stabilized and then we can

play56:28

optimize the gradient in the middle

play56:31

properly the problems we can't do this

play56:34

because we have to optimize essentially

play56:36

all of those layers simultaneously and

play56:39

then there's this effect where you see

play56:41

you change some weights that affects the

play56:44

other layers then you change those

play56:46

weights that affects the layer that you

play56:48

just changed and and so on and then so

play56:51

it it makes the the convergence quite

play56:54

slow because we've got all these inter

play56:56

dependencies now there's no way of

play56:59

completely getting rid of the inter

play57:00

dependencies because if we did that

play57:03

would mean that we're essentially

play57:04

breaking our network into parts that are

play57:07

not connected anymore but one thing we

play57:09

can do is to do some normalization when

play57:13

we normalize what this does is that it

play57:16

ensures that the output of that layer

play57:19

regardless of how we set the weights are

play57:22

going to be normalized they're going to

play57:24

have a mean of 0 and a variance of 1 so

play57:27

the scale of these outputs is going to

play57:31

be the same ok so so now to obtain the

play57:38

same scale what we can do is you see for

play57:41

each hidden unit we can subtract from it

play57:45

the mean so the mean would be just the

play57:48

average the empirical average like this

play57:50

and then we can divide by the standard

play57:52

deviation which is here the square root

play57:56

of the empirical variance and then

play57:59

there's also a variable G which is known

play58:02

as the gain that's added to essentially

play58:04

compensate for the fact that we've just

play58:06

normalized but the idea is that with

play58:10

this approach then we can ensure that

play58:12

you see if G is is set to 1 then this

play58:17

would ensure that H is always normalized

play58:21

with zero mean and variance 1 and

play58:24

therefore you see if there's some

play58:27

gradient competition that depends on the

play58:29

output of that layer the outputs of that

play58:32

layer are always going to be the same

play58:34

scale they're going to vary but these

play58:36

are going to remain on the same scale

play58:38

and then

play58:39

as a result the other gradients when we

play58:41

compute them they don't have to adjust

play58:44

simply because we were changing the

play58:47

scale of those outputs so that reduces

play58:51

the dependencies between the layers and

play58:55

it tends to make the the convergence

play58:58

faster okay any questions regarding

play59:04

normalization okay

play59:10

perhaps one thing I should see as well

play59:12

some of you might have heard about batch

play59:14

normalization so this is closely related

play59:16

to batch normalization but the main

play59:18

difference is that we're doing the

play59:20

normalization at the level of a layer

play59:22

whereas batch normalization would do it

play59:25

for one hidden unit but by normalizing

play59:29

across a batch of inputs the advantage

play59:32

of layer normalization is that we don't

play59:35

need to worry about how large our batch

play59:38

is so bash no machine only works well if

play59:41

you have fairly large batches and this

play59:44

here we can feed in one data point at a

play59:46

time we can have mini batches that are

play59:49

very small in fact we can be in an

play59:51

online or streaming setting where we

play59:53

just feed in one data point at a time

play59:55

and we can still do the normalization

play59:57

and it still has the same effect as bash

play60:00

normalization in terms of decoupling how

play60:03

the gradients evolve in different layers

play60:08

okay so the other part that is important

play60:12

is the positional embedding here so if I

play60:15

just go back we introduced a positional

play60:20

embedding right after the input

play60:22

embedding the idea is that with the

play60:26

attention mechanism it doesn't care

play60:30

what's the position of the words so the

play60:33

words could be all shuffled we could

play60:36

consider them as a bag of words and if

play60:40

it wasn't for the positional embedding

play60:42

right we would get the same answer and

play60:45

at some level that's not good because

play60:48

sentences the ordering of the words is

play60:50

important in order to tell us

play60:53

what's the meaning right so the ordering

play60:55

does carry some meaning so we need to

play60:58

still capture some of that information

play61:00

and here this is really an engineering

play61:03

hack okay so it's not clear that this is

play61:07

really the best way to to still capture

play61:10

the ordering but the idea is that we're

play61:12

gonna you see we have already an

play61:13

embedding here that is supposed to

play61:17

capture information about each word now

play61:20

let's just make that embedding capture

play61:22

information about the word and also its

play61:24

position so we're simply going to add a

play61:27

vector that is known as the positional

play61:29

encoding and that vector is going to be

play61:31

different depending on what is the

play61:33

position so it's just a vector that

play61:36

embeds the position which is an integer

play61:39

and then we add that to the embedding of

play61:42

the word okay so what's the precise

play61:46

formula for the the positional embedding

play61:50

it's given here the idea is that we we

play61:55

have a position which is an integer and

play61:57

then we embed it into a vector so it's a

play62:01

vector what with multiple entries and

play62:04

now each entry is going to be computed

play62:07

according to the sign of the position

play62:10

divided by 10,000 to the 2i divided by D

play62:12

or the cosine of the position divided by

play62:15

10,000 to the 2i divided by T okay so

play62:19

here just to illustrate let me draw

play62:23

something out of board

play62:31

so we have the position

play62:46

we're going to compute these on this a

play62:48

position embedding so here this is a

play62:57

scaler and this is a vector okay so the

play63:04

idea is that we already have an

play63:06

embedding for the word and this

play63:09

embedding is a vector now we want to

play63:12

encode as well the position so obviously

play63:15

there's multiple ways in which we could

play63:17

do this the simplest could just be to

play63:18

have an integer for the position maybe

play63:21

append that or concatenate that with the

play63:24

embedding of the word but in in this

play63:27

work the authors simply chose to add to

play63:30

this a vector and then it's going to be

play63:33

a vector of the same dimensionality as

play63:35

the embedding of the word so often these

play63:39

vectors are going to be hundreds long

play63:41

and now how do we go from a scalar to a

play63:44

vector well this is where the formula

play63:48

gives us a way to obtain a different

play63:50

value for each entry of the vector so

play63:55

you see for the even entries we're going

play63:58

to compute the sine of this and then for

play64:01

the odd entries we're going to compute

play64:03

the cosine of this and here really this

play64:08

is something that's debatable okay we

play64:13

could consider perhaps different ways of

play64:15

coming up with a positional embedding

play64:17

but at least the key is that it carries

play64:20

information about the position which

play64:22

allows us to distinguish each word so

play64:25

that then the our sentence still retains

play64:28

ordering information yeah

play64:34

so we

play64:41

[Music]

play64:42

yeah very good points we are just adding

play64:45

the positional embedding to the word

play64:48

embedding could maybe affect the

play64:50

information that is included in the word

play64:53

embedding and yeah my gut feeling as

play64:57

well is that it might just be better to

play64:59

concatenate this so that we don't lose

play65:03

the information but in any case that's

play65:05

what the author's chose to do and it

play65:08

seems to work relatively fine okay

play65:20

all right so if we compare now a

play65:26

transformer network with a recurrent

play65:29

network or convolutional neural network

play65:31

we get the following complexity

play65:33

estimates so here the transformer

play65:36

network is the one called self attention

play65:40

in a self attention Network what happens

play65:43

is that in each layer here I guess a

play65:47

layer would consist of n position so

play65:51

it's if we've got a sentence of size and

play65:54

so we're going to have n positions and

play65:56

now when we compute an embedding for

play65:59

each position it would have

play66:00

dimensionality D so the order of

play66:03

computation the complexity of the

play66:05

computation in one layer is going to be

play66:08

order n square simply because for every

play66:13

position we're going to try to attend to

play66:16

every other position so all the pairs is

play66:19

n square and then for each pair we're

play66:22

going to compute an embedding of

play66:24

dimensionality D so that's that's the

play66:27

complexity we get now the benefit is

play66:31

that if we want to capture long-range

play66:34

dependencies then the maximal path

play66:38

between any two positions is just one

play66:41

simply because we have our attention

play66:45

mechanism that that combines in one

play66:48

operation essentially every pair of

play66:52

words so now information can flow

play66:55

between pairs of words in one step so so

play66:59

then we don't have to worry about

play67:00

information being lost like in the

play67:03

recurrent neural network where the first

play67:05

word that we process you know gets

play67:07

embedded but then after that when we

play67:09

process additional words then this

play67:12

embedding changes and eventually it

play67:15

loses information from the first word

play67:17

all right so so here it allows us to

play67:20

essentially combine together information

play67:23

about pairs of words immediately so

play67:27

that's a good thing so here a path

play67:29

length of size one is great and the

play67:32

other important aspect

play67:34

is that here there's no sequential

play67:36

operation since that we have a sentence

play67:39

of a size and but now we essentially

play67:42

process all of the words simultaneously

play67:45

we're going to look at every pair of

play67:47

words simultaneously so I guess this is

play67:51

where we have this n square that creeps

play67:53

in but in the other hand all of this can

play67:56

be paralyzed so today we gpus we want to

play68:01

exploit paralyzation and it's better to

play68:04

in fact not have to process the word

play68:09

sequentially but then to do everything

play68:12

in parallel so even though we have a

play68:14

factor n square this n square and

play68:17

practice might not be so bad simply

play68:20

because we're going to do a lot of those

play68:22

operations in parallel in contrast a

play68:26

recurrent neural network will have this

play68:27

complexity so in each so I guess here

play68:33

the idea of a recurrent neural net

play68:35

having multiple layers the way to think

play68:37

about it is that normally we think of a

play68:40

layer as being every word that we

play68:42

process but then another way of thinking

play68:44

about is that you could have stacks of

play68:46

recurrent neural networks so we haven't

play68:49

talked about this but it's actually

play68:50

something common that people would do in

play68:53

practice and that makes the network even

play68:56

more complex and heavier to Train but if

play69:01

you have stacks of recurrent neural

play69:04

networks then here we will have some

play69:09

computation let's say that we have n of

play69:13

those stack together so the competition

play69:16

would be n times d square in in just one

play69:20

layer in one stack ok and then the d

play69:25

square comes from the fact that we have

play69:27

an embedding of size d and after you

play69:30

consider gru on LST m or even just a

play69:34

linear unit that produces the next

play69:37

embedding of size d then typically

play69:39

you're going to have a matrix of weights

play69:41

that's d by d that will essentially

play69:44

multiply the this hidden vector

play69:47

to get the next hiddenvector so that's

play69:50

how the d square shows up it's we have

play69:55

sequential operations because we have to

play69:56

go through the entire sequence and then

play69:59

the path length is at it can be up to

play70:02

size n because if you want to combine

play70:05

information for the first and the last

play70:07

word then you've got a long path to go

play70:11

okay so in general this will be quite

play70:16

advantageous it will help to reduce

play70:18

competition and improve scale I guess

play70:22

I'm yeah I'm proved scalability quite a

play70:25

bit any questions regarding this okay

play70:32

now in the paper in 2017 regarding

play70:38

transformers there was a comparison done

play70:42

for machine translation here they

play70:45

compared English to German plus a DD

play70:49

they did transition between English and

play70:52

German as well as English and French and

play70:54

then they compared a bunch of models and

play70:57

their models are down here okay so

play70:59

they've got a base model and then a

play71:01

bigger transformer if you look at the

play71:04

results they're not really outstanding I

play71:07

mean they're improving at least in this

play71:10

case for English German improving a

play71:13

little bit the accuracy so here blur if

play71:16

you recall is a measure of precision

play71:20

where you look at the percentage of

play71:23

words in the output translation that are

play71:26

part of some human translation roughly

play71:29

speaking so the higher the better is the

play71:31

score and and then so I guess I'll

play71:34

perform a little bit here they came

play71:36

close to the state of the art but now

play71:38

what's beautiful is the fact that they

play71:41

reduce the computation significantly so

play71:44

here I mean those numbers look horrible

play71:47

because when you see 10 to the 18 I mean

play71:50

that's scary and then 10 to the 19 hours

play71:53

is worse than to the 20 10 to the 21

play71:56

but now the difference between 10 to the

play71:58

18 and

play71:59

10 to the 19th that's a factor of 10

play72:01

right so something that would take 10

play72:04

days that might take one day and with

play72:07

respect to 10 to 20 that's a reduction

play72:09

of a factor of a hundred and for 10 to

play72:12

the 21 that's a reduction by a factor of

play72:16

a thousand so this is a major reduction

play72:18

in terms of the training time and here

play72:24

I'm not sure I don't recall from the

play72:27

paper whether that takes into account

play72:28

parallelism or not

play72:30

but in any case yeah this gives you a

play72:34

sense about the fact that really a big

play72:37

advantage is that it reduces computation

play72:40

while still achieving the state of the

play72:42

art okay any questions regarding this

play72:50

good yeah

play73:01

okay so that's a good question yes so

play73:04

the training cost is different for

play73:07

different languages it might have to do

play73:15

with how much data they use for training

play73:17

but then presumably that would have an

play73:19

effect here too I'm not sure because

play73:25

they're presumably there should be a

play73:27

difference here - I'm not sure we'll

play73:33

have to look it up in the paper

play73:39

okay so transformer was essentially the

play73:43

starting point of a new class of neural

play73:46

network that do not rely on recurrence

play73:49

and then another important type of

play73:53

transformer is known as GPT and then I

play73:57

guess an improved version known as GPT -

play74:00

so these were proposed in 2018 here when

play74:06

they were proposed the idea is that the

play74:08

did unsupervised language modeling and

play74:13

your language modeling is is a general

play74:16

task where you say I've got a sequence

play74:18

of tokens a sequence of words and I'm

play74:20

simply going to predict what the next

play74:22

word is and it turns out that a lot of

play74:25

tasks in natural language processing can

play74:28

be formulated as some form of language

play74:31

modeling so if you take machine

play74:33

translation and you concatenate the

play74:36

input sentence and the output sentence

play74:38

into just one long sequence and now

play74:41

let's say you've got a language model

play74:43

that essentially simply predicts what is

play74:45

the next word in the sequence and

play74:47

furthermore let's say that this language

play74:49

model doesn't care about whether it's

play74:52

English or French or any language it

play74:55

just predicts the next word in in the

play74:58

sequence then you can train it to

play75:00

essentially do translation right so you

play75:03

just feed it with the inputs and then it

play75:06

will predict what the next words are in

play75:08

the other language so a lot of tasks

play75:12

and be formulated this will where you

play75:13

just create a sequence and then just by

play75:15

the virtue that is gonna predict the

play75:17

next thing if the next thing is what you

play75:19

care about maybe it's a classification

play75:21

maybe it's another sequence of words

play75:24

then you can do do all of those tasks

play75:27

with a language model so here they did

play75:30

something really interesting where they

play75:32

they trained some decoder transformer so

play75:39

because they were only going to predict

play75:42

the next word given the previous word

play75:43

and then there was no need to really

play75:45

separate some input sequence from an

play75:48

output sequence then they did not really

play75:51

need to have the encoder part so in the

play75:54

transformer architecture they actually

play75:56

got rid of the encoder and then they

play75:58

worked only with the decoder so the

play76:01

decoder attends to the previous outputs

play76:04

which could be considered the input and

play76:06

then it never attends to the future

play76:08

output so it can just generate

play76:10

sequentially right so so that's the main

play76:14

thing compared to the transformer

play76:16

network they essentially just got rid of

play76:18

the the encoder the other thing they did

play76:22

here is what they call a zero shot

play76:24

learning so it means that they simply

play76:27

took a very large corpus they trained on

play76:31

this corpus to predict the next word in

play76:34

the sequence irrespective of what the

play76:37

task is and then they simply applied

play76:40

this to different tasks where the

play76:43

network was not tailored or was not

play76:45

fine-tuned for that task

play76:47

it was just trained generally speaking

play76:49

to to predict what the next word is in

play76:52

the sequence and then so they did this

play76:54

for some tasks that correspond to

play76:56

reading comprehension translation

play76:58

summarization question answering and we

play77:01

can see in blue their performance here

play77:06

the performance improves as we increase

play77:08

the number of pounders for the language

play77:10

model and then the compare to

play77:14

state-of-the-art techniques now their

play77:17

approach was general it was not trained

play77:20

specifically for a particular task

play77:23

whereas in this case for

play77:25

since PG net dr qadr QE + PG net and so

play77:29

on are trained specifically for that

play77:32

task so what's beautiful is that in a

play77:35

completely unsupervised fashion without

play77:38

really being tailored to that task they

play77:41

managed to come close to the state of

play77:43

the art and then a it's fairly general

play77:50

it can be used for many tasks right so

play77:53

you can see the results it doesn't beat

play77:56

the state of the art on most of those

play77:58

tasks but it it it does beat at least

play78:00

some techniques that were tailored and

play78:03

then i guess if you keep on improving it

play78:07

suggests that those curves i would lead

play78:09

to further improvements okay any

play78:13

questions regarding GPT okay so let's

play78:19

continue now GPT was not the last one

play78:24

there's another one called Burt that has

play78:28

become quite popular and it was proposed

play78:30

this here bird stands for bi-directional

play78:33

and coded representations from

play78:34

transformers so it's another variant of

play78:37

the transformer network and the main

play78:41

advance that is being proposed here is

play78:44

that instead of just predicting the next

play78:47

words in the sequence

play78:49

why don't we predict a word based on the

play78:52

previous word and then the future words

play78:54

so there's lots of tasks including

play78:56

machine translation where if you think

play78:58

about it if you're given a sentence as

play79:01

input there's no reason why you have to

play79:03

really produce the the sentences output

play79:06

sequentially one word at a time you

play79:09

could work on your translation by coming

play79:12

up with some sections of the translation

play79:15

and gradually building up your your your

play79:18

translation but you don't have to do it

play79:20

perfectly sequentially right so a lot of

play79:24

tasks are actually like that and then

play79:26

what it means is that you could take

play79:28

advantage of what comes before and what

play79:31

comes after so it's a bit like

play79:33

bi-directional recurrent neural networks

play79:36

that improve with respect to

play79:38

you need actual return neural network so

play79:41

here it's a bi-directional transformer

play79:43

and naturally it does better than GPT so

play79:47

the they tried it on a bunch of on a

play79:52

bunch of tasks in fact eleven tasks and

play79:54

then here what he did is they they did

play79:57

unsupervised free training there's just

play80:00

like GPT but then to really compete with

play80:02

the state-of-the-art they did some

play80:04

further fine-tuning with data

play80:06

specifically for that task so so the

play80:10

other proposal is we first trained a

play80:12

general network unsupervised with lots

play80:15

of data and then we fine-tuned the

play80:19

powders by doing some further training

play80:21

with data specifically for that task and

play80:24

when they did that then they obtained

play80:26

those results and here this is quite

play80:29

impressive because okay they improve the

play80:31

state of the art on eleven tasks and if

play80:33

you look at some of those tasks like for

play80:35

instance this task here they improve the

play80:39

state of the art from 45 to 60 okay so

play80:43

this is a major improvement okay any

play80:49

questions regarding Burt alright so Burt

play80:56

is I again not the last network there is

play81:00

another network that was just made

play81:02

public about a month ago called Excel

play81:05

net and now Excel net beats birth as

play81:08

well okay I don't have a slide for it

play81:11

but I can tell you roughly speaking that

play81:14

the main difference is that Burt

play81:16

essentially assumes that we've got I

play81:21

guess everything in the window before

play81:25

and after

play81:26

whereas Excel net allows missing inputs

play81:31

in a sense and then tends to generalize

play81:34

better by looking at different subsets

play81:38

of words before and after and as a

play81:45

result I tends to generalize better and

play81:47

then it improves again

play81:50

on a lot of tasks I don't remember what

play81:53

the number of tasks is but in general it

play81:56

beats berth across the board for most

play81:59

tasks okay so this has been a fruitful

play82:04

direction and then it's become quite

play82:07

clear now that those transformer

play82:09

networks can perform quite well both in

play82:12

terms of accuracy and also in terms of

play82:15

speed and and that becomes questionable

play82:17

what the future of recurrent neural

play82:20

networks will be okay any questions

play82:23

regarding this alright okay so this

play82:30

concludes this set of slides

Rate This

5.0 / 5 (0 votes)

Related Tags
注意力机制Transformer自然语言处理机器翻译神经网络语言模型GPTBERTExcelNet深度学习技术创新
Do you need a summary in English?