CS480/680 Lecture 19: Attention and Transformer Networks
Summary
TLDR本文介绍了注意力(Attention)机制及其在机器翻译和自然语言处理中的应用,特别是Transformer网络的兴起。自2017年提出以来,Transformer网络通过注意力机制替代了传统的循环神经网络,解决了长期依赖问题,提高了训练速度。文章还讨论了多头注意力、位置编码和层归一化等关键技术,并比较了不同模型在机器翻译任务上的表现。此外,还探讨了GPT、BERT和XLNet等基于Transformer的衍生模型,展示了它们在多项自然语言处理任务中的优异性能和对循环神经网络未来的挑战。
Takeaways
- 📘 注意力机制(Attention)在机器翻译领域已有所讨论,并引领了一类新型神经网络——变换器网络(Transformer Networks)。
- 📙 '注意力就是全部'(Attention is all you need)是2017年发表的论文,它提出了一种新的观点,即可能不再需要循环神经网络(RNN)。
- 🔍 注意力机制最初在计算机视觉中被研究,用于帮助识别图像中的特定对象,通过模拟人类视觉注意力的聚焦过程。
- 🏢 在自然语言处理(NLP)中,注意力机制允许解码器回顾输入句子,以处理任意长度的句子,而不必记住整个句子。
- 🚀 变换器网络通过使用注意力机制,能够同时处理整个序列,解决了循环神经网络在长距离依赖和梯度消失或爆炸问题上的挑战。
- 🔄 变换器网络的结构包括编码器和解码器两部分,其中编码器用于处理输入序列,解码器用于生成输出序列。
- 🌟 多头部注意力(Multi-Head Attention)是变换器网络的核心,它允许网络同时关注序列中的不同部分。
- 📊 变换器网络通过层归一化(Layer Normalization)减少了训练所需的梯度下降步骤,提高了训练速度。
- 📈 变换器网络通过位置编码(Positional Encoding)保留了词序信息,这对于理解句子的意义至关重要。
- 📉 相比于循环神经网络,变换器网络在计算复杂度上有所提升,但通过并行计算大大减少了训练时间。
- 🔑 变换器网络及其变种(如GPT、BERT、XLNet)在多种NLP任务上展现出优越性能,对循环神经网络的未来提出了挑战。
Q & A
什么是注意力机制(Attention)?
-注意力机制是一种资源分配策略,用于在处理信息时集中关注最重要的部分。在神经网络中,它允许模型在序列数据中选择性地关注某些部分,而忽略其他部分。这在机器翻译和图像识别等领域尤为重要。
Transformer网络是如何出现的?
-Transformer网络是2017年提出的一种新型神经网络架构,它基于注意力机制,不依赖于循环神经网络(RNN)结构。这种网络可以并行处理整个序列,从而在训练和推理时大大提高了效率。
为什么Transformer网络可以不需要循环神经网络的构建块?
-Transformer网络的核心是注意力机制,它能够捕捉序列数据中的长距离依赖关系,并且可以并行处理整个序列。这使得传统的循环神经网络中的循环结构变得不再必要。
在计算机视觉中,注意力机制如何帮助识别对象?
-在计算机视觉中,注意力机制可以聚焦于图像中的关键区域,这些区域对于识别对象至关重要。通过训练网络,可以生成热图来突出显示这些区域,从而帮助模型更准确地识别和定位对象。
Transformer网络如何处理长距离依赖问题?
-Transformer网络通过注意力机制处理长距离依赖问题。它允许模型在任何给定的位置同时考虑序列中的所有其他位置,从而有效地捕捉整个序列的信息。
多头注意力(Multi-Head Attention)是什么?
-多头注意力是Transformer网络中的一个关键组件,它允许模型在不同的表示子空间中并行地执行多个注意力操作。这有助于模型从不同的角度捕捉序列的信息,提高模型的表达能力。
为什么Transformer网络的训练速度比循环神经网络快?
-Transformer网络可以并行处理整个序列,而循环神经网络需要按顺序逐步处理序列。这种并行性使得Transformer网络能够更有效地利用现代硬件(如GPU),从而加快训练速度。
Transformer网络中的掩码多头注意力(Masked Multi-Head Attention)有什么作用?
-掩码多头注意力确保在生成序列时,每个词只能依赖于它之前的词,而不能依赖于未来的词。这种掩码机制避免了在序列生成过程中产生不恰当的依赖关系。
什么是位置编码(Positional Encoding)?
-位置编码是Transformer网络中用于给模型提供词序信息的一种技术。由于注意力机制本身不包含序列中词的位置信息,位置编码通过向输入词嵌入中添加特定模式的向量来解决这个问题。
Transformer网络在自然语言处理领域的应用有哪些?
-Transformer网络在自然语言处理领域有广泛的应用,包括但不限于机器翻译、文本摘要、问题回答、文本分类和语言模型等任务。它的高效性和强大的表示能力使其成为许多NLP任务的首选模型。
Outlines
📘 介绍注意力机制与Transformer网络
本段落介绍了注意力机制在机器翻译中的应用,并提出了Transformer网络的概念。Transformer网络是一种新型的神经网络,它在2017年的论文《Attention is all you need》中被提出,可能不再需要循环神经网络的构建模块。通过类比人类视觉注意力,解释了注意力机制如何在图像识别中聚焦于特定区域,以及如何在机器翻译中让解码器回顾输入句子,避免丢失信息。
🌟 计算机视觉与自然语言处理中的注意力应用
这段落讨论了注意力机制在计算机视觉和自然语言处理(NLP)中的应用。在计算机视觉中,通过热图展示了如何识别图像中的建筑物,而在NLP中,注意力机制帮助处理任意长度的句子,并通过语言建模技术预测文本序列中的下一个词。此外,还比较了循环神经网络和Transformer网络的优缺点,指出了循环神经网络在训练步骤和并行计算方面的局限性。
🔗 深入理解注意力机制的原理
本段深入探讨了注意力机制的原理,将其与数据库查询进行了类比,并通过相似度函数计算查询与键(key)之间的权重,从而产生输出。介绍了不同类型的相似度计算方法,如点积、缩放点积和加性相似度,并讨论了如何通过softmax函数计算权重,以及如何将这些权重应用于值(value)来产生注意力值。
🛠️ 神经网络中的注意力架构实现
这段落详细描述了如何在神经网络中实现注意力机制,包括如何使用查询(Q)、键(K)和值(V)来计算相似度,并应用softmax函数来确定权重。解释了如何通过不同的线性变换来模拟数据库检索过程,并讨论了在实际应用中如何选择不同的相似度函数以及如何通过神经网络学习这些函数。
🔄 Transformer网络的架构与优势
本段介绍了Transformer网络的基本架构,包括编码器和解码器,并强调了其去除循环神经网络的优势。解释了多头注意力(multi-head attention)的概念,以及如何在不同的层中重复使用这些注意力模块来组合单词。讨论了Transformer网络如何解决长距离依赖问题,以及如何通过并行计算提高训练速度。
📚 深入Transformer网络的多头注意力
这段落深入讨论了Transformer网络中的多头注意力机制,包括如何通过线性层处理值、键和查询,然后通过缩放点积注意力和连接操作来生成输出。解释了多头注意力如何类似于在卷积神经网络中使用不同的过滤器来创建不同的特征图,并通过这种方式捕获文本的不同方面。
🎭 掩码多头注意力与解码器中的自回归特性
本段讨论了Transformer网络中的掩码多头注意力,解释了如何通过掩码确保解码器在生成序列时仅依赖于先前的输出,而不是未来的输出。这有助于保持自回归特性,同时避免了在训练过程中的循环依赖。此外,还讨论了教师强制(teacher forcing)方法,这是一种在训练期间使用正确输出作为输入的技术。
🛡️ 归一化与位置编码的重要性
这段落强调了归一化层在Transformer网络中的作用,如何帮助减少梯度下降所需的步骤,并提高网络的稳定性。介绍了位置编码的概念,解释了如何通过向词嵌入中添加位置信息来保留句子中的顺序信息,这对于理解句子的意义至关重要。
📈 Transformer网络与其他网络的比较
本段对Transformer网络与循环神经网络和卷积神经网络进行了比较,讨论了它们在处理序列数据时的计算复杂度和路径长度。指出了Transformer网络在捕捉长距离依赖和并行处理方面的优势,以及其在实际应用中的潜在影响。
🚀 Transformer网络在机器翻译中的应用
这段落讨论了Transformer网络在机器翻译任务中的应用,并与其他模型进行了性能比较。指出了Transformer网络在减少计算时间方面的优势,尽管在某些情况下可能并未显著超越现有技术,但其在提高效率方面的潜力是显著的。
🌐 GPT与BERT:Transformer网络的变体
本段介绍了Transformer网络的两种变体:GPT和BERT。GPT专注于语言模型的下一个词预测任务,而BERT则利用双向编码来提高性能。讨论了这些模型在不同自然语言处理任务中的应用,以及它们如何通过大量数据进行无监督预训练,然后在特定任务上进行微调以提高性能。
🏆 BERT与XLNet:刷新SOTA的Transformer模型
这段落讨论了BERT如何在多个任务上取得了显著的性能提升,并介绍了XLNet,这是BERT的一个改进版本,它允许模型在预测时考虑更广泛的上下文信息。XLNet在多个任务上超越了BERT,展示了Transformer网络在自然语言处理领域的强大能力和潜力。
Mindmap
Keywords
💡注意力机制
💡Transformer网络
💡多头注意力
💡编码器-解码器架构
💡遮蔽
💡位置编码
💡层归一化
💡长距离依赖
💡BERT
💡GPT
💡XLNet
Highlights
介绍了注意力(attention)机制在机器翻译等任务中的应用,并提出了一种新的神经网络类别——变换器(transformer)网络。
变换器网络的核心是注意力机制,可以替代循环神经网络(RNN)中的一些构建块。
2017年发表的论文《Attention is all you need》标志着注意力机制在自然语言处理领域的突破。
注意力机制的概念最初在计算机视觉中被研究,用于对象识别。
通过热图可以展示神经网络在图像识别中关注的区域,从而验证其决策过程。
变换器网络通过注意力机制处理长距离依赖问题,与循环神经网络相比具有优势。
变换器网络的训练速度比循环神经网络快,且可以充分利用GPU的并行计算能力。
介绍了注意力机制的工作原理,包括查询(query)、键(key)和值(value)的概念。
多头部注意力(multi-head attention)允许同时从不同角度处理信息,提高了模型的表达能力。
变换器网络的编码器-解码器架构允许并行处理整个句子,与传统的序列处理方式不同。
位置编码(positional encoding)的引入使得模型能够捕捉单词顺序,对语言模型至关重要。
层归一化(layer normalization)有助于减少梯度下降的步骤,提高训练效率。
变换器网络在机器翻译任务中的表现与当时最先进技术相当,同时显著减少了计算量。
GPT(生成式预训练变换器)通过无监督学习进行语言建模,能够处理多种自然语言处理任务。
BERT(双向编码器表示变换器)通过考虑前后文信息来改进模型性能,超越了GPT。
XLNet是BERT的改进版,允许缺失输入并更好地泛化,进一步提升了模型在多个任务上的表现。
变换器网络的发展对循环神经网络的未来提出了挑战,显示出在准确性和速度方面的优势。
Transcripts
okay so for the next set of slides I'm
gonna talk about attention in more
detail so we discuss already attention
in the context of machine translation
but attention has led to a new class of
neural networks that are known as
transformer networks okay and this this
material that I'm going to present today
is actually quite recent so it's not
described in any of the textbooks that
are associated with this course so for
this there is a paper that was published
in 2017 called attention is all you need
okay so it's a very interesting title
and it actually does suggest that
perhaps we don't need recurrent neural
networks anymore and and then a lot of
the building blocks that people would
design let's see let's same units
gr use and and other things like this
this essentially is suggesting that we
don't need them all we need is really
just attention ok so let's see how this
works ok so it turns out that the
concept of attention was was first
studied in the context of computer
vision and here let's say that we have
some images we're going to recognize
some objects so as humans let's say that
we have a large image and then we're
looking for a tiny object then what we
will do naturally with our eyes is
roughly scan the image and eventually
focus on some regions of interest and
and then eventually identify the object
and then when we identify an object we
eventually I guess focus precisely on
the location where the object is and
then some studies have shown that our
eyes are indeed focusing on the right
regions when when we do this type of
identification so I guess here an
interesting question is is there a
mechanism
computer vision that could be similar
and would that be beneficial and the
answer is yes so here let's say that we
have an image
okay this image doesn't appear very well
on the screen on my laptop
it looks better so essentially you're
supposed to see some scene where there's
a house and a tower here so it says
essentially two buildings and then
there's there's the rest of the scene
around it and let's say that we're doing
object detection where we want to
recognize buildings so what this shows
here is a heat map that's overlaid on
top of the image so that's what you
don't actually see while the scene but
in any case this heat map shows that the
pixels or the region where buildings are
more likely to be are right here in the
red part and if you look at it more
carefully so it turns out that there is
indeed a house here and then there's a
tower here okay so now if you train some
network to do some classification an
interesting question is when the network
outputs a class how can we trust that it
comes up with the right class and you
know we could ask it if you tell me that
indeed there's a building in this image
can you show me where that building is
right and this could be a nice way of at
least validating understanding what it
thinks is a is a building in the image
and whether it's correct or not so so
here we can use attention to essentially
see which pixels are aligned with the
concept of a building and then so the
heat map corresponds to the weights here
for the attention mechanism so in this
case you could imagine having an
attention mechanism over the entire
image so you've got weights with respect
to all of the pixels right and then you
try to see I guess which pixels would
have some semantic meaning or some
embedding that would be essentially
aligned with the notion of of the object
that we're trying to recognize and then
so some researchers demonstrated that
you can do this and in fact here
attention can be used to highlight the
important parts of an image that
contribute to the desired output so it's
very nice in terms of explaining what is
the decision process going on
and it can also be used to as a building
block as part of the recognition process
okay so this was in computer vision then
in the National anguish processing in
2015 we saw the work about machine
translation where you can get your
decoder to essentially peek or look back
at what was the input sentence so that
it doesn't lose track of what it's
translating and it doesn't have to
remember completely the sentence so so
this was an important breakthrough that
allowed us to deal with sentences of
arbitrary length inside very long length
but then further than that in 2017 some
researchers show that we can use
attention to develop general language
modeling techniques so here language
modeling simply refers to the idea of
developing a model that would simply
predict the next word in a sequence so
it could generate words and then also if
let's say we've got a word missing
somewhere in a sequence and it could
also recover that missing word so
language model is essentially just a
model that can predict or recover words
in a sequence a lot of tasks can be
formulated as language modeling problems
so whenever we do translation you could
imagine that it's just a sequence where
we have a first part in one language
a second part in another language and
what we're really doing it's just
continuing the sequence where we're
predicting words of the second part that
happened to be in the next language and
for doing sentiment analysis we can
think of it this way too so most of the
the tasks in natural language processing
can be cast as some form of language
modeling and then what they showed is
that we can also design some
architecture that uses pretty much
exclusively some attention blocks and
then they call that architecture a
transformer okay so so we'll see in more
details now those transformer networks
and they've become now the state of the
art for natural anguish processing so
these surpass now recurrent neural
networks okay so if we do a comparison
between a recurrent neural network and a
transformer Network turns out that the
recurrent neural network has several
challenges the first problem that we've
discussed a lot is how can we deal with
long-range dependencies and here the
solution was in fact to combine the
Recker neural network with some some
attention mechanism it also suffers from
great invention great and explosion the
number of steps for training recurrent
neural networks can be quite large and
this has to do with the fact that a
recurrent neural network we can think of
it as essentially an unrolled network
that is very deep because when we unroll
it we have to unroll it for as many
steps as needed for a corresponding
sequence so recur neural networks
effectively arbitrarily deep networks
and then so they have lots of powers
those powers are essentially correlated
between each other because we in fact
tied the weights from one step to
another and then so the optimization
tends to be quite difficult as a result
and it tends to require a lot of steps
okay so so training Eric
all Network usually takes a lot longer
than a convolutional or net or regular
feed-forward neural net beyond the
number of steps there's also the
question of paralyzing computation so
today GPUs have become key for working
with large neural networks and then what
they really do is that they enable us to
do some computation in parallel but then
if you have a recurrent neural network
what happens is that the sequence of
steps write the computation has to be
done sequentially we can't process all
those steps in parallel because of the
recurrence ok so so there's an inherent
problem here we can't quite leverage
GPUs as well in the context of recurrent
neural networks now if we consider
transformer networks we're going to see
in a moment that because they
essentially use pretty much exclusively
some attention blocks then there won't
be any recurrence and attention will
also help us to draw some connections
between any part of the sequence so
long-range dependencies won't be a
problem
so in fact long-range dependencies will
have like the same likelihood of being
taken into account as short-range
dependencies so so that's not an issue
anymore now great advancing and
explosion will also not be an issue
because instead of having computation
that goes linearly with the length of
the sequence into a deep network for a
recurrent neural network and a
transformer we're going to do the
computation for the entire sequence
simultaneously and then just build
several layers for that but there won't
be so many layers in practice so so
great invention and explosion
won't be as much of an issue in terms of
steps it will take as well mark fewer
steps to Train
and then because there's no recurrence
and those networks we're gonna be able
to do the competition in parallel for
every step so so so this is really nice
so I guess we've got lots of great
advantages for those transformer
networks okay so to introduce
transformer networks
let's review briefly attention so we're
gonna see again attention but more
generally so we've seen it already in
the context of machine translation but
now let's let's think about attention
essentially being some form of
approximation of a select that you would
do in a database so in a database if you
want to retrieve some value based on
some query and and then also based on
some key right then there's some
operations that you can do where you can
use a query to identify a key that
aligns well and then simply output the
corresponding value so we can think of
attention and essentially mimicking this
type of retrieval process but in a more
fuzzy or probably bill sick way okay so
let me draw a picture just to illustrate
how things would normally work in in a
database and then we'll see that in our
case we're going to enable the same type
of competition that would normally
happen in in database but using this
equation here
[Applause]
okay so let's say I've got a query and
then I've got a database so in my
database that's it I have stored some
keys with associated values
okay so I've got my database now when I
issue a query all right I will I will
see how well this query aligns with
different keys perhaps the key that is
the right one is let's see key number
three then what I would do is
essentially produce an output that
corresponds to the value okay so so
retrieval in the database would more or
less correspond to this and now an
attention mechanism is essentially a
neural architecture that mimics this
type of process and the way we're going
to mimic this is by using the following
equation so here we're going to measure
what is the similarity between our query
queue and each key value ki and then
this similarity is going to return a
weight and then we'll simply produce an
output that is a weighted combination of
all the values in our database now
normally with the database when we do
retrieval we simply return one value and
this would correspond here to finding a
similarity between the query and some
key where the similarity would have
value one and then all the other keys
the similarity would be value zero so if
we have this if if we've got a
similarity function essentially produces
a one hot and coding right then we would
effectively just return one value now in
practice because we'll we'll want to
embed this as part of a neural network
and be able to do back propagation to
differentiate through that type of
operation they'll be useful for us to
think of the similarity as instead
computing a distribution
so I guess weights that are between 0 &
1 and then even if we have multiple keys
that have some similarity the ideas that
we're going to produce a value that's
going to be a weighted combination based
on
weights okay so I guess we can think of
this as like a generalization of the
mechanism for a database where we make
the retrieval process become a convex
combination or weighted combination of
the values for which the keys have a
high similarity in in the database okay
so let's try now a neural architecture
that will correspond to this attention
mechanism so so it's going to be
essentially the same as what we've seen
already in the context of machine
translation but now we're just gonna
make it
the main agnostic so you're gonna see
more generally what attention really
corresponds to
[Applause]
okay so let's say that I've got t1 t2
actually shift this a little bit t1 t2
t3 and t4 we're going to have our query
and let's have s1 s2 s3 and s4 query is
going to influence the computation of
each one of those things so here what
I've drawn is a first layer we're
starting with the keys I'm gonna compute
a similarity measure so these s's
correspond to similarity okay so here SI
is going to be equal to some function of
the query with respect to some function
of the query Q and the ki KI
and now there are many functions that we
could consider so let me suggest a few
the first one could simply be a dot
product so
just like this okay so the similarity if
we think of the query and the key as
just embedding vectors right if I want
to measure their similarity a simple
thing is just to compute their dot
product so that's something common a
variant on this is going to be a scaled
dot product okay and then here D is the
dimensionality
of each King okay something slightly
more general we could also have QT q
transpose wk I so I'll call that a
general dot product and then finally
let's have W transpose Q of Q plus
w transpose k ki so this will be an
additive similarity okay so this first
layer is computing the similarity
between some query Q and each keys each
key in in our database or in our memory
and to do this the many choices some
common choices in practice or just to do
a dot product or scale the dot product
by dividing by the square root of the
dimensionality this has the benefit of
simply keeping the dot products in in a
certain scale now more generally we can
also project the query into a new space
by using a weight matrix W and then
after that taking a dot product by a ki
and then another one is just to take the
sum of the sum some combination of QN
and ki and this is known as as some form
of additive similarity now you could
think of other types of similarity for
instance earlier in the course we talked
about kernel methods that also measure
similarity they do this by essentially
mapping two vectors into a new space
through some nonlinear function so we
could have as well in here some form of
kernel similarity to yeah
[Music]
okay so in this case we're not going to
have convolutions but here are you
suggesting that instead of comparing the
query to every key we could just do it
with respect to a subset of the case
[Music]
also the okay this w think of it as more
like we're simply transforming our query
to be in the same space as the keys so
okay to give you a concrete example
let's say that we're doing question
answering and then we would like to
let's say we have a database of possible
answers we've got a query and now they
say we've got an embedding of every
answer that corresponds to a key now the
query is a question so we can embed it
as well and now the problem is that
depending on the type of embedding we
compute if those embeddings are simply
computing the semantic meaning of these
sentences right there's no reason for us
to expect that the question and the
answer have the same meaning right so in
fact they should have different meaning
because the answer is supposed to
provide something that the question
doesn't have right but still then what
you can do is map them into a new space
where there we could interpret the
question and the answer is really being
things that we can compare directly and
then this matrix W serves that purpose
okay so it's just a high-level intuition
but the idea is that if you're not
confident that you can compute the
similarity directly then you can allow
your neural network to learn a mapping W
so here W is a set of weights and some
matrix that will essentially
map our query into a new space and then
it will just learn what that new space
should be okay so that's our first step
now the second step or the second layer
after this will be to compute the
weights so have a 1 a 2 a 3 and a 4
those weights
they depend on everything ok so here
this will be done through a softmax
so essentially AI is going to be equal
to the exponential of Si divided by the
sum over J of the exponential of SJ ok
so yeah so here it's a fully connected
network but not in the classical sense
it's more I'm just showing what hidden
nodes are using and computing these
weights this is a softmax we don't have
any weights all we're doing is just
computing this this expression ok and
then after this we're going to have our
weighted combination
so what I've shown here is that we
multiply a 1 by v1
alright so this means that we multiply
them then we add them to the product of
a 2 by V 2 we add this to the product of
a 3 plus times v3 and so on and then
this produces the attention value ok so
the attention value is just a sum over I
of AI di ok so I guess you see this is a
general scheme where we have some query
we have some keys and then we're going
to produce an output where the output is
really a linear combination of some
values where the weights come from some
notion of similarity between our query
and and the keys okay any questions
regarding this yeah
right so here the W matrix should span
the space that we care about now
we don't specify W right these are
variables that are going to get
optimized by the neural network itself
so that's the beauty of this right so
here again W in general indicates
weights that are powers or variables in
the neural network and then whatever the
task right we're going to do back
propagation and then these weights are
going to get a justice or the neural
network is going to learn on its own
what might be a good space to project Q
into so that then we can take a dot
product with K yeah oh good question
okay yeah the AIS are scalars so I guess
okay so the key the key eyes are vectors
okay
the S eyes are scalars the AIS are
scalars and then the V eyes are vectors
yeah so here these scalars or our
weights and then there's going to be a
weight like this for every possible
world like if we're doing machine
translation and now we're about to
produce an output and we want to compute
the attention of respect to the input
words then we're going to have
essentially one one yeah one weight per
output word and then the VI is are
essentially going to be the hidden
vectors associated with each input word
okay so in fact this is what I've got
here on this slide right so as a
concrete example we've talked about
machine translation in a previous set of
slides and here if I simply use as a
query si the hidden vector for the if
output word and then for the keys and
the values and this particular setting
I'm gonna have the same thing so both
the keys and the values are going to be
the h JS these are the hidden vectors
for the input word that allows me to
essentially compare my my hidden vector
for an output word to each one of the
hidden vectors for an input word and
then essentially combine them together
to produce a context vector that
reflects what are the words that I'm
interested in in decoding next or
translating next yeah
okay great question yeah so we haven't
talked about yet what is a transformer
all I'm doing so far is just explaining
in a general form what is the attention
mechanism but I believe it's coming up
in a few slides all right so we've
discussed all kinds of networks and more
recently it's been the focus has been on
sequential data we've seen hidden Markov
models then the recurrent neural network
and now we're talking about transformers
so I want to go back and discuss in more
detail the transformer network that was
presented in 2017 so this network is
special because as we discuss it gets
rid of recurrent so this was a major
thing so recurrences mean that the
optimization tends to take longer for
two reasons the number of iteration so
the number of steps ingredient descent
will be higher and also recurrence means
that we have several operations that are
going to be sequential and then we
cannot paralyze them as easily so the
beauty of having a GPU is that in
principle you can paralyze lots of
operations but if these operations are
going to be sequential then you cannot
write so so we would like to reduce as
much as possible recurrent okay so on
this slide here we've got a picture of
the transformer network that was
proposed in 2017 and this network has
now displays pretty much recurrent
neural network for sequential data so
this was a major shift in in the
thinking that people have now with
respect to sequential data okay so if we
take the example of machine translation
so this network even though it's not
obvious has two parts
the first part that corresponds to an
encoder the second part that corresponds
to a decoder and here you see in machine
translation you would use the encoder to
encode your initial sentence and then
you would use the decoder to produce a
translated sentence now this will
process an entire sentence in parallel
as opposed to a recurrent neural network
that will essentially processed one word
at a time in the sentence so if you look
carefully you see the inputs here would
actually be the entire sequence of words
so we would feed them all in at once and
then they would get embedded and then
after this we would add a positional
encoding I'll come this I'll come back
to this later on in a few slides but
this is important essentially to make
sure that we can distinguish words that
occur in different positions within a
sentence the problem is that if we don't
have a positional encoding then we would
effectively have a model here that
treats the word as if it was a bag of
words as opposed to a sequence and we
all know that in languages the sequence
actually matters so the ordering matters
so we need to still capture this
information and then so the positional
coding at Eve's that ok so now the
important part of a transformer network
is essentially this block here it
consists of two sub parts so does the
multi-head attention and then here a
feed-forward neural network
now the multi-head attention is is
really where all the good stuff happens
so here the ideas that we feed in again
a vector that would consist of sub
vectors of all the the words that we
have in our sentence and then the
multi-head attention is going to compute
the attention between every position and
every other position so we have vectors
that embed the words in
one of those positions and now we simply
carry out an attention computation that
will essentially treat each word as a
query and then find some keys that
correspond to the other words in the
sentence and then take a convex
combination of the corresponding value
so here the values are going to be the
same as the keys and then take a dot
product of that to produce a better
embedding so the idea is that this
multi-head attention will essentially
take every word combine it with some of
the other words through the attention
mechanism to essentially produce a
better embedding that that merges
together information from pairs of words
now when we do this in one block we
essentially look at pairs of words
together but now if we repeat this
multiple times so this n times here
means that we're going to have this
block that's going to be repeated n
times we're going to have n stacks of
those blocks and now you see in the
first block we look at pairs and the
second block we're looking at pairs of
pairs and then the third block we're
going to look at pairs of pairs of tears
so essentially we're combining now more
than just two words but groups of words
that that get larger and larger and
larger okay so that's what the
multi-head attention does then we have
on top here another layer and that's
called a Donora this is essentially
adding a residual connection that takes
the original input to what comes out of
the multi hat attention and then it
normalizes this so here nor is
essentially a layer normalization we'll
come back to this in a second but it
essentially means that we we take all of
our entries and then we normalize them
to have zero mean as well as variance
while then we feed this into a
feed-forward Network there's again a
residual connection
and then a normalization so this block
is repeated n times so that then we can
combine not just pairs of words but
pairs of pairs of words and and so on so
that eventually you see you can combine
together all the words in in the
sentence
okay so the output of this is going to
be again a sequence of embeddings
there's going to be one embedding per
position and twitted li the embedding in
that position captures the original word
at that position but also information
from the other words that it attended to
throughout the network okay so you can
think of this is essentially just a
large embedding of all those words
corresponding to each position so that's
our encoding of the input sentence then
after this we have the decoder which
will do something similar but obviously
the main purpose of the decoder is to
produce some output not just to embed
but produce some output so that's what
we're gonna have some additional stuff
on top here where we have a softmax that
produces some probabilities for let's
see outputting a label in each position
okay now inside the block so this this
block will also repeat n times what we
do is we have first a multi-head
attention that looks at simply combining
output words with previous output words
and then there's another block of
multi-head attention that now combines
output words with input words and then
finally a feed-forward Network again
okay so here you see we have two layers
of attention the first layer is really
just self attention between the output
words so now the problem though with
output words is that when you generate a
sequence as output you can only generate
the next word based on the previous
words so when you do the attention you
need to be care
to make sure that you only attend to
previous words and that's why this one
is called a masque multi-head attention
because we can mask the future words so
that each word is only attending to the
previous words the second multi had
attention here is now combining or is is
I guess that making sure that each
position in the output is attending to
positions in the input so this is where
a bit like in machine translation
whenever you want to produce an output
and it's good if you can kind of peek
and look back at what was your input
sentence and here we're gonna look at
the embeddings of each position in the
input so that's why you see we've got
these arrows that go in okay and this
will be repeated and times again so that
the ideas that we gradually build up
combinations and then get better and
better embeddings
until we produce an output and here the
output can be a distribution over the
words in the dictionary alright so for
every position there is a word that
we're trying to generate and then we're
gonna compute some distribution over the
words and in the dictionary any
questions regarding this slide okay
good alright so in the transformer
Network perhaps the most important part
is this multi-head attention so I'm
gonna draw on the board what this
corresponds to now mathematically the
multi-head attention is essentially this
expression that decomposes according to
these operations okay so for the
multi-head attention as we talked about
last class whenever we want to design an
attention mechanism the general way of
thinking about it is that we have some
key value pairs just like in a database
and then there's a query that we're
going to compare to each keys and and
then the keys that have the greatest
similarity are going to have the highest
weights and now we can take a weighted
combination of the corresponding values
to produce the output so we're going to
feed V key and Q into some linear layer
then after this we're going to compute a
scale that product attention
then we're going to concatenate these
outputs then we'll have another linear
layer and then the output of this is
going to be our multi-head attention
okay now this is called multi hat
because in a reality and I haven't drawn
this yet on the board we're going to
compute multiple attentions so here when
we take a linear combination perhaps
there's different we can think of this
linear combination as really being a
projection of the values V same thing
for a case anything from Q and we could
consider several projections so here I'm
gonna use this to indicate that I could
compute three different types of
projections by kinda looking at three
different linear combinations of the
values same thing for K
same thing for Q so that I get
essentially three different projection
now each one of them I can now compute a
skill that product attention so I will
get three scale dot product attention
and the way to think about these
different linear combination these
different scale product attentions it's
a bit like feature maps and
convolutional neural network so we saw
the in convolutional neural network you
can compute multiple feature Maps simply
by having different filters so here
these linear combination are a bit like
different filters although in this case
you can think of them as more like
projecting or simply changing the space
in which the values reside okay so so
here this will give us different
projection on different spaces a bit
like multiple filters into a
convolutional neural net and then when
we compute the scale that product
attention then for each projection is
going to be a different one different
scale dot product attention so we're
going to get multiple of them and that
corresponds more or less to having
multiple feature Maps that's the same
intuition so then this contact layer is
going to concatenate these different
skel product product attention and in
founding we take a linear combination of
them and this gives us a multi-head
attention because we essentially
computed multiple attentions so here
there's three of them so we can think of
these as H so this would be the number
of heads and multi-head attention okay
so the idea is that there's one head per
linear combination so here there's three
of them and in general there's going to
be h of them and that's where the idea
of the the name multi hat comes from
okay any questions regarding this good
ok
all right so besides just a regular
multi-head attention we also have in the
decoder a mask multi-head attention and
here the idea behind the mask multi-head
attention is that some of the values
should be masked meaning that the
probabilities should be nullified so
that we don't create any combinations so
for instance in the decoder when we
produce the output let's say we're doing
machine translation so we have a
sentence let's say in English then we're
translating that into French we start
producing the words in French when we
produce a word right then it's okay for
that word to depend on the previous
words in the translation because we're
generating them sequentially but it
doesn't make sense for that word to
depend on the future words because we
haven't produced them yet right so so
here what we need to do is to
essentially change our attention
mechanism so that we we would nullify or
effectively remove links that would
create dependencies on words that we
haven't generated yet and so this is
what we call a mask multi-head attention
okay so here the main difference is that
in the attention mechanism normally we
just computer soft max according to this
expression but now instead with a mask
attention we're going to add a mask here
that will effectively produce some
probabilities that are zero for the the
terms that we don't want to attend to
because it would be future terms so here
you see in a soft max what I normally do
is take the exponential divided by the
sum of exponential so now if I add a
mask which is a matrix of zeros and
minus infinity so adding minus infinity
when I take the exponential of minus
infinity it gives me zero so this has
the effect of ensuring that the
probabilities of certain items here are
going to be zero because I'll take the
Financial of minus infinity and and this
will have the same effect as removing
connections so at some level you can
think of this as a form of dropout but
here it's not a dropout in the same
sense that we saw at the beginning of
the course
right so dropout for regularization you
would drop out some connections or
remove some connections at random
according to some distribution here we
remove connections that are essentially
pointing at words that we haven't
produced yet so this is more like a
deterministic type of dropout if you
wish because we we would never have
those connections so here as opposed to
sampling those connections with a
distribution you see we use a mask and
then inside the softmax the mask will
have the effect of nullifying some of
the connections because the exponential
of minus infinity will be zero any
questions regarding this yeah
yeah okay good question so yeah here in
the paper they add a mask with values
that are minus infinity now perhaps the
more intuitive approach would simply be
to multiply the softmax by some sort of
Hadamard product with values that are 0
and 1 now when we do it outside with
values that are zero in one what happens
is that the softmax produces a
distribution that adds up to 1 now we're
going to nullify some of those
probabilities and then the the some of
the properties that are left is not
gonna add up to 1 whereas if we do it
inside right by adding a matrix that
might have values that are minus
infinity it means that when we take the
softmax these are gonna have zero
probability but then all the other
values are gonna have probabilities that
still sum up to one so this ensures that
we still have a proper distribution
okay and if we just go back to this
slide you see what happens is that when
we produce an output let's say I produce
my first word here then that first word
could be fed as input for the next
position right so when I want to produce
a word for a certain position it's okay
to look at the previous words and this
is where the mass multi-head attention
will will apply so we're gonna have a
mask it's a matrix that will essentially
be lower triangular that has essentially
values that are zero in the lower
triangular part and minus infinity in
the upper triangular part to essentially
nullify everything that happens in the
future and the other thing that might
not be obvious is that it looks like
here when we've got some output it gets
fed back in as as input here and that
looks like it's creating a recurrence so
here this is not creating a recurrence
per se simply because there is a method
for training known as teacher forcing
where the idea is that when you train
the network you have both what is the
input sentence and the output sentence
and then you can simply say well let me
assume that my outputs are correct
everywhere so I'm going to feed that as
input here so I'm going to feed in what
are the correct output words for the
previous positions and then I'll simply
try to predict what is the next word
based on that okay so with this scheme
that is known as teacher forcing then
you can essentially decouple the output
here from the input here and we don't
have any recurrence relation in training
now at test time then what happens that
you really have to execute this network
with the recurrence relation but that's
okay so where there's really again is a
training time so training is what takes
a long time and if we can remove all
recurrence relations so that we can do
all the computation in parallel this
will be a lot faster and then through
this teacher forcing trick then what we
do is we simply assume that we have the
correct output words for the previous
words and then we feed them in here as
if they were given to us and then we
simply try to predict what is the next
output any questions regarding this okay
let's continue okay so the other
important layers are the nollans ation
layer and also the positional embedding
so the normalization layer is actually
quite important and quite interesting
so it's this layer that we saw right
here so on top of every multi-head
attention and feed-forward network
there's here a normalization layer so
what this does is that it helps to
essentially reduce the number of steps
needed by gradient descent to optimize
the network here whenever we've got a
network with multiple layers we've got
weights in each layer and then those
weights are going to be trained by
gradient descent but when you look at
the formula for the gradient it's it's
often that the case that to compute the
gradient of one set of weights then it
depends on the output of the layers
below and also what is being computed in
the layers above right so depends on on
what is being computed below and above
now the problem is that if we're still
adjusting the weights but below and
above now when we compute the gradient
then things are not going to be stable
since some level it's like we'd rather
wait till all these
layers have stabilized and then we can
optimize the gradient in the middle
properly the problems we can't do this
because we have to optimize essentially
all of those layers simultaneously and
then there's this effect where you see
you change some weights that affects the
other layers then you change those
weights that affects the layer that you
just changed and and so on and then so
it it makes the the convergence quite
slow because we've got all these inter
dependencies now there's no way of
completely getting rid of the inter
dependencies because if we did that
would mean that we're essentially
breaking our network into parts that are
not connected anymore but one thing we
can do is to do some normalization when
we normalize what this does is that it
ensures that the output of that layer
regardless of how we set the weights are
going to be normalized they're going to
have a mean of 0 and a variance of 1 so
the scale of these outputs is going to
be the same ok so so now to obtain the
same scale what we can do is you see for
each hidden unit we can subtract from it
the mean so the mean would be just the
average the empirical average like this
and then we can divide by the standard
deviation which is here the square root
of the empirical variance and then
there's also a variable G which is known
as the gain that's added to essentially
compensate for the fact that we've just
normalized but the idea is that with
this approach then we can ensure that
you see if G is is set to 1 then this
would ensure that H is always normalized
with zero mean and variance 1 and
therefore you see if there's some
gradient competition that depends on the
output of that layer the outputs of that
layer are always going to be the same
scale they're going to vary but these
are going to remain on the same scale
and then
as a result the other gradients when we
compute them they don't have to adjust
simply because we were changing the
scale of those outputs so that reduces
the dependencies between the layers and
it tends to make the the convergence
faster okay any questions regarding
normalization okay
perhaps one thing I should see as well
some of you might have heard about batch
normalization so this is closely related
to batch normalization but the main
difference is that we're doing the
normalization at the level of a layer
whereas batch normalization would do it
for one hidden unit but by normalizing
across a batch of inputs the advantage
of layer normalization is that we don't
need to worry about how large our batch
is so bash no machine only works well if
you have fairly large batches and this
here we can feed in one data point at a
time we can have mini batches that are
very small in fact we can be in an
online or streaming setting where we
just feed in one data point at a time
and we can still do the normalization
and it still has the same effect as bash
normalization in terms of decoupling how
the gradients evolve in different layers
okay so the other part that is important
is the positional embedding here so if I
just go back we introduced a positional
embedding right after the input
embedding the idea is that with the
attention mechanism it doesn't care
what's the position of the words so the
words could be all shuffled we could
consider them as a bag of words and if
it wasn't for the positional embedding
right we would get the same answer and
at some level that's not good because
sentences the ordering of the words is
important in order to tell us
what's the meaning right so the ordering
does carry some meaning so we need to
still capture some of that information
and here this is really an engineering
hack okay so it's not clear that this is
really the best way to to still capture
the ordering but the idea is that we're
gonna you see we have already an
embedding here that is supposed to
capture information about each word now
let's just make that embedding capture
information about the word and also its
position so we're simply going to add a
vector that is known as the positional
encoding and that vector is going to be
different depending on what is the
position so it's just a vector that
embeds the position which is an integer
and then we add that to the embedding of
the word okay so what's the precise
formula for the the positional embedding
it's given here the idea is that we we
have a position which is an integer and
then we embed it into a vector so it's a
vector what with multiple entries and
now each entry is going to be computed
according to the sign of the position
divided by 10,000 to the 2i divided by D
or the cosine of the position divided by
10,000 to the 2i divided by T okay so
here just to illustrate let me draw
something out of board
so we have the position
we're going to compute these on this a
position embedding so here this is a
scaler and this is a vector okay so the
idea is that we already have an
embedding for the word and this
embedding is a vector now we want to
encode as well the position so obviously
there's multiple ways in which we could
do this the simplest could just be to
have an integer for the position maybe
append that or concatenate that with the
embedding of the word but in in this
work the authors simply chose to add to
this a vector and then it's going to be
a vector of the same dimensionality as
the embedding of the word so often these
vectors are going to be hundreds long
and now how do we go from a scalar to a
vector well this is where the formula
gives us a way to obtain a different
value for each entry of the vector so
you see for the even entries we're going
to compute the sine of this and then for
the odd entries we're going to compute
the cosine of this and here really this
is something that's debatable okay we
could consider perhaps different ways of
coming up with a positional embedding
but at least the key is that it carries
information about the position which
allows us to distinguish each word so
that then the our sentence still retains
ordering information yeah
so we
[Music]
yeah very good points we are just adding
the positional embedding to the word
embedding could maybe affect the
information that is included in the word
embedding and yeah my gut feeling as
well is that it might just be better to
concatenate this so that we don't lose
the information but in any case that's
what the author's chose to do and it
seems to work relatively fine okay
all right so if we compare now a
transformer network with a recurrent
network or convolutional neural network
we get the following complexity
estimates so here the transformer
network is the one called self attention
in a self attention Network what happens
is that in each layer here I guess a
layer would consist of n position so
it's if we've got a sentence of size and
so we're going to have n positions and
now when we compute an embedding for
each position it would have
dimensionality D so the order of
computation the complexity of the
computation in one layer is going to be
order n square simply because for every
position we're going to try to attend to
every other position so all the pairs is
n square and then for each pair we're
going to compute an embedding of
dimensionality D so that's that's the
complexity we get now the benefit is
that if we want to capture long-range
dependencies then the maximal path
between any two positions is just one
simply because we have our attention
mechanism that that combines in one
operation essentially every pair of
words so now information can flow
between pairs of words in one step so so
then we don't have to worry about
information being lost like in the
recurrent neural network where the first
word that we process you know gets
embedded but then after that when we
process additional words then this
embedding changes and eventually it
loses information from the first word
all right so so here it allows us to
essentially combine together information
about pairs of words immediately so
that's a good thing so here a path
length of size one is great and the
other important aspect
is that here there's no sequential
operation since that we have a sentence
of a size and but now we essentially
process all of the words simultaneously
we're going to look at every pair of
words simultaneously so I guess this is
where we have this n square that creeps
in but in the other hand all of this can
be paralyzed so today we gpus we want to
exploit paralyzation and it's better to
in fact not have to process the word
sequentially but then to do everything
in parallel so even though we have a
factor n square this n square and
practice might not be so bad simply
because we're going to do a lot of those
operations in parallel in contrast a
recurrent neural network will have this
complexity so in each so I guess here
the idea of a recurrent neural net
having multiple layers the way to think
about it is that normally we think of a
layer as being every word that we
process but then another way of thinking
about is that you could have stacks of
recurrent neural networks so we haven't
talked about this but it's actually
something common that people would do in
practice and that makes the network even
more complex and heavier to Train but if
you have stacks of recurrent neural
networks then here we will have some
computation let's say that we have n of
those stack together so the competition
would be n times d square in in just one
layer in one stack ok and then the d
square comes from the fact that we have
an embedding of size d and after you
consider gru on LST m or even just a
linear unit that produces the next
embedding of size d then typically
you're going to have a matrix of weights
that's d by d that will essentially
multiply the this hidden vector
to get the next hiddenvector so that's
how the d square shows up it's we have
sequential operations because we have to
go through the entire sequence and then
the path length is at it can be up to
size n because if you want to combine
information for the first and the last
word then you've got a long path to go
okay so in general this will be quite
advantageous it will help to reduce
competition and improve scale I guess
I'm yeah I'm proved scalability quite a
bit any questions regarding this okay
now in the paper in 2017 regarding
transformers there was a comparison done
for machine translation here they
compared English to German plus a DD
they did transition between English and
German as well as English and French and
then they compared a bunch of models and
their models are down here okay so
they've got a base model and then a
bigger transformer if you look at the
results they're not really outstanding I
mean they're improving at least in this
case for English German improving a
little bit the accuracy so here blur if
you recall is a measure of precision
where you look at the percentage of
words in the output translation that are
part of some human translation roughly
speaking so the higher the better is the
score and and then so I guess I'll
perform a little bit here they came
close to the state of the art but now
what's beautiful is the fact that they
reduce the computation significantly so
here I mean those numbers look horrible
because when you see 10 to the 18 I mean
that's scary and then 10 to the 19 hours
is worse than to the 20 10 to the 21
but now the difference between 10 to the
18 and
10 to the 19th that's a factor of 10
right so something that would take 10
days that might take one day and with
respect to 10 to 20 that's a reduction
of a factor of a hundred and for 10 to
the 21 that's a reduction by a factor of
a thousand so this is a major reduction
in terms of the training time and here
I'm not sure I don't recall from the
paper whether that takes into account
parallelism or not
but in any case yeah this gives you a
sense about the fact that really a big
advantage is that it reduces computation
while still achieving the state of the
art okay any questions regarding this
good yeah
okay so that's a good question yes so
the training cost is different for
different languages it might have to do
with how much data they use for training
but then presumably that would have an
effect here too I'm not sure because
they're presumably there should be a
difference here - I'm not sure we'll
have to look it up in the paper
okay so transformer was essentially the
starting point of a new class of neural
network that do not rely on recurrence
and then another important type of
transformer is known as GPT and then I
guess an improved version known as GPT -
so these were proposed in 2018 here when
they were proposed the idea is that the
did unsupervised language modeling and
your language modeling is is a general
task where you say I've got a sequence
of tokens a sequence of words and I'm
simply going to predict what the next
word is and it turns out that a lot of
tasks in natural language processing can
be formulated as some form of language
modeling so if you take machine
translation and you concatenate the
input sentence and the output sentence
into just one long sequence and now
let's say you've got a language model
that essentially simply predicts what is
the next word in the sequence and
furthermore let's say that this language
model doesn't care about whether it's
English or French or any language it
just predicts the next word in in the
sequence then you can train it to
essentially do translation right so you
just feed it with the inputs and then it
will predict what the next words are in
the other language so a lot of tasks
and be formulated this will where you
just create a sequence and then just by
the virtue that is gonna predict the
next thing if the next thing is what you
care about maybe it's a classification
maybe it's another sequence of words
then you can do do all of those tasks
with a language model so here they did
something really interesting where they
they trained some decoder transformer so
because they were only going to predict
the next word given the previous word
and then there was no need to really
separate some input sequence from an
output sequence then they did not really
need to have the encoder part so in the
transformer architecture they actually
got rid of the encoder and then they
worked only with the decoder so the
decoder attends to the previous outputs
which could be considered the input and
then it never attends to the future
output so it can just generate
sequentially right so so that's the main
thing compared to the transformer
network they essentially just got rid of
the the encoder the other thing they did
here is what they call a zero shot
learning so it means that they simply
took a very large corpus they trained on
this corpus to predict the next word in
the sequence irrespective of what the
task is and then they simply applied
this to different tasks where the
network was not tailored or was not
fine-tuned for that task
it was just trained generally speaking
to to predict what the next word is in
the sequence and then so they did this
for some tasks that correspond to
reading comprehension translation
summarization question answering and we
can see in blue their performance here
the performance improves as we increase
the number of pounders for the language
model and then the compare to
state-of-the-art techniques now their
approach was general it was not trained
specifically for a particular task
whereas in this case for
since PG net dr qadr QE + PG net and so
on are trained specifically for that
task so what's beautiful is that in a
completely unsupervised fashion without
really being tailored to that task they
managed to come close to the state of
the art and then a it's fairly general
it can be used for many tasks right so
you can see the results it doesn't beat
the state of the art on most of those
tasks but it it it does beat at least
some techniques that were tailored and
then i guess if you keep on improving it
suggests that those curves i would lead
to further improvements okay any
questions regarding GPT okay so let's
continue now GPT was not the last one
there's another one called Burt that has
become quite popular and it was proposed
this here bird stands for bi-directional
and coded representations from
transformers so it's another variant of
the transformer network and the main
advance that is being proposed here is
that instead of just predicting the next
words in the sequence
why don't we predict a word based on the
previous word and then the future words
so there's lots of tasks including
machine translation where if you think
about it if you're given a sentence as
input there's no reason why you have to
really produce the the sentences output
sequentially one word at a time you
could work on your translation by coming
up with some sections of the translation
and gradually building up your your your
translation but you don't have to do it
perfectly sequentially right so a lot of
tasks are actually like that and then
what it means is that you could take
advantage of what comes before and what
comes after so it's a bit like
bi-directional recurrent neural networks
that improve with respect to
you need actual return neural network so
here it's a bi-directional transformer
and naturally it does better than GPT so
the they tried it on a bunch of on a
bunch of tasks in fact eleven tasks and
then here what he did is they they did
unsupervised free training there's just
like GPT but then to really compete with
the state-of-the-art they did some
further fine-tuning with data
specifically for that task so so the
other proposal is we first trained a
general network unsupervised with lots
of data and then we fine-tuned the
powders by doing some further training
with data specifically for that task and
when they did that then they obtained
those results and here this is quite
impressive because okay they improve the
state of the art on eleven tasks and if
you look at some of those tasks like for
instance this task here they improve the
state of the art from 45 to 60 okay so
this is a major improvement okay any
questions regarding Burt alright so Burt
is I again not the last network there is
another network that was just made
public about a month ago called Excel
net and now Excel net beats birth as
well okay I don't have a slide for it
but I can tell you roughly speaking that
the main difference is that Burt
essentially assumes that we've got I
guess everything in the window before
and after
whereas Excel net allows missing inputs
in a sense and then tends to generalize
better by looking at different subsets
of words before and after and as a
result I tends to generalize better and
then it improves again
on a lot of tasks I don't remember what
the number of tasks is but in general it
beats berth across the board for most
tasks okay so this has been a fruitful
direction and then it's become quite
clear now that those transformer
networks can perform quite well both in
terms of accuracy and also in terms of
speed and and that becomes questionable
what the future of recurrent neural
networks will be okay any questions
regarding this alright okay so this
concludes this set of slides
Browse More Related Video
![](https://i.ytimg.com/vi/zl99IZvW7rE/hq720.jpg)
Geoffrey Hinton: The Foundations of Deep Learning
![](https://i.ytimg.com/vi/fOvTtapxa9c/hq720.jpg)
Natural Language Processing: Crash Course Computer Science #36
![](https://i.ytimg.com/vi/OVwEeSsSCHE/hqdefault.jpg?sqp=-oaymwEXCJADEOABSFryq4qpAwkIARUAAIhCGAE=&rs=AOn4CLA-ug-Cn-hZJ1U9XcvNQ0mWM3OauQ)
Lecture 1.1 — Why do we need machine learning — [ Deep Learning | Geoffrey Hinton | UofT ]
![](https://i.ytimg.com/vi/kCc8FmEb1nY/hq720.jpg)
Let's build GPT: from scratch, in code, spelled out.
![](https://i.ytimg.com/vi/dIyQl99oxlg/hq720.jpg)
【人工智能】万字通俗讲解大语言模型内部运行原理 | LLM | 词向量 | Transformer | 注意力机制 | 前馈网络 | 反向传播 | 心智理论
![](https://i.ytimg.com/vi/y1AxIpIaQmI/hq720.jpg)
Ilya Sutskever | This will all happen next year | I totally believe | AI is come
5.0 / 5 (0 votes)