Transformer Neural Networks, ChatGPT's foundation, Clearly Explained!!!
Summary
TLDRこの動画は、Transformerニューラルネットワークの仕組みを詳しく解説しています。Josh Starmerが主役で、Transformerが単純な英語文をスペイン語に翻訳する方法をステップバイステップで示しています。Word embeddingを使用して単語を数値に変換し、位置エンコーディングで単語の順序を追跡します。また、自己注意(self-attention)とエンコーダー-デコーダー注意(encoder-decoder attention)を用いて、単語間の関係を捉え、残差接続(residual connections)で各サブユニットが問題の特定部分に集中できるようにします。最後に、Transformerはこれらの技術を組み合わせて、入力フレーズを正確に翻訳することができ、翻訳タスクにおいて重要な単語を無視することなく、入力と出力のフレーズの関係を保持することができます。
Takeaways
- 🤖 Transformerニューラルネットワークは、自然言語処理タスクで広く使用されています。
- 📈 ワードエンベディングは、単語を数値に変換する手法で、ニューラルネットワークの入力として使用されます。
- 📊 位置エンコーディングは、文の単語の順序を追跡するために使用されます。
- 🔍 セルフアテンションは、文内の単語の関係を把握するために使用されます。
- 🔄 エンコーダー-デコーダーアテンションは、入力文と出力文の関係を追跡し、翻訳の品質を向上させます。
- 🔧 レジダラルコネクションは、複雑なニューラルネットワークをより簡単にトレーニングするために使用されます。
- 🔢 Transformerは、並列コンピューティングを利用して高速に処理が行われるように設計されています。
- 📚 Transformerモデルは、複数の自注意力セルをスタックすることで、複雑な文や段落内の単語の関係を捕捉できます。
- 📈 トレーニングプロセスは、バックプロパゲーションを使用して、最適なウェイトを決定します。
- 🔧 Transformerは、より複雑なデータに適応するために、エンコーダーやデコーダーに隠れられた層を持つ追加のニューラルネットワークを含めることができます。
- 📝 オリジナルのTransformerモデルは、非常に大きなボキャブラリ(37,000トークン)と長い入力・出力文を扱うことができました。
Q & A
トランスフォーマーニューラルネットワークとは何ですか?
-トランスフォーマーニューラルネットワークは、自然言語処理タスクで使用される一種のニューラルネットワークであり、翻訳や文章生成などに応用されています。
ワードエンベディングとは何ですか?
-ワードエンベディングは、単語や記号を数値に変換する技術であり、ニューラルネットワークが扱える形式に変換します。これにより、入力された文を数値のシーケンスに変換することが可能です。
位置エンコーディングの目的は何ですか?
-位置エンコーディングは、文内の単語の順序を追跡するために使用されます。これにより、トランスフォーマーは単語の位置に関する情報を保持し、文の意味を正確に理解することができます。
セルフアテンションとは何ですか?
-セルフアテンションは、トランスフォーマーのメカニズムの一つで、文内の各単語どうしの関連性を正確に捉えることができます。これにより、単語が文のどの部分に関連しているかを理解し、翻訳や文章生成の際に適切な単語を選択することができます。
エンコーダーとデコーダーとは何ですか?
-エンコーダーは、入力された文を数値のエンコーディングに変換する部分であり、デコーダーは、そのエンコーディングをもとに翻訳された文を生成する部分です。トランスフォーマーでは、これらの部分を組み合わせて翻訳タスクを実行します。
リジダルコネクションとは何ですか?
-リジダルコネクションは、トランスフォーマーの各サブユニット(例えばセルフアテンション)が特定の問題を解決するために、前段階の情報を保持しながら計算を実行できるようにする仕組みです。これにより、複雑なニューラルネットワークをより効率的にトレーニングできます。
トランスフォーマーが翻訳タスクでどのように動作するかを説明してください。
-トランスフォーマーは、まず入力文をワードエンベディングで数値化し、位置エンコーディングを加えます。次に、エンコーダーでセルフアテンションを実行し、文内での単語の関係を捉えます。その後、デコーダーでエンコーディングされた情報を元に翻訳文を生成し、エンコーダーとデコーダーのアテンションを利用して入力文と出力文の関係を追跡します。最後に、リジダルコネクションを通じて各サブユニットが問題を解決し、翻訳タスクを実行します。
トランスフォーマーが翻訳タスクを実行する際に、なぜ位置エンコーディングが必要なのか?
-位置エンコーディングは、文内の単語の順序を保持するために必要な情報を提供します。翻訳タスクでは、単語の順序が文の意味に大きな影響を与えるため、位置エンコーディングを使用して、トランスフォーマーが正確に翻訳を生成できるようにします。
トランスフォーマーのエンコーダーとデコーダーの役割は何ですか?
-エンコーダーは、入力文を数値のエンコーディングに変換し、文内の単語の関係を理解する役割を持ちます。一方、デコーダーは、エンコーディングされた情報を元に翻訳文を生成し、入力文と翻訳文の関係を追跡する役割を担当します。
トランスフォーマーで使用されるアテンションメカニズムの利点は何ですか?
-アテンションメカニズムは、文内の単語どうしの関連性を正確に捉えることができるため、翻訳や文章生成タスクで高い精度を達成することができます。また、アテンションは並列計算が可能であり、トランスフォーマーが高速に処理を実行できるようにします。
トランスフォーマーのトレーニング中に、バックプロパゲーションは何のために使用されるのですか?
-バックプロパゲーションは、ニューラルネットワークの重みを最適化するために使用されます。トレーニング中に、バックプロパゲーションは、モデルの予測と実際の結果の差を利用して、重みを徐々に更新し、モデルのパフォーマンスを向上させます。
Outlines
🤖 トランスフォーマーニューラルネットワークの基本
この段落では、トランスフォーマーニューラルネットワークの基本的な仕組みとその応用について説明されています。例えば、チャットGPTのようなアプリケーションがトランスフォーマーをどのように使用しているか、またトランスフォーマーがどのように単純な英語文をスペイン語に翻訳するのかについて詳細に説明されています。
📈 単語の数値化と位置エンコーディング
この段落では、単語を数値に変換する方法である単語エンコーディングと、単語の順序を追跡するための位置エンコーディングについて説明されています。また、同じ単語エンコーディングネットワークをすべての入力単語や記号で再利用することで、異なる長さの入力文を扱う柔軟性を確保する方法も触れられています。
🔍 セルフアテンションと単語の関係の追跡
この段落では、トランスフォーマーが単語間の関係をどのように追跡するかを説明しています。セルフアテンションというメカニズムを使用して、各単語が文全体でどのように関連しているかを理解し、それぞれの単語をエンコードする方法が説明されています。
🔄 エンコーダーとデコーダーの働き
この段落では、トランスフォーマーのエンコーダーとデコーダーの働きについて説明されています。エンコーダーは入力文を数値に変換し、デコーダーはその数値をもとに翻訳された文を生成します。また、デコーダーはエンコーダーと同様に単語エンコーディング、位置エンコーディング、セルフアテンションを使用し、翻訳プロセスを進めます。
🔄 エンコーダー-デコーダーアテンションと翻訳
この段落では、デコーダーがどのように入力文と出力文の関係を追跡するかを説明しています。エンコーダー-デコーダーアテンションを使用して、翻訳時に重要な単語を無視しないようにすることで、翻訳の正確性を確保します。
📚 トランスフォーマーの拡張と改善
この段落では、トランスフォーマーをより複雑なデータに適応させるための拡張と改善について触れられています。例えば、より多くの単語を扱うためには、値の正規化やドットプロダクトのスケーリング、さらにはエンコーダーやデコーダーに隠れ層を持つ追加のニューラルネットワークを追加することが提案されています。
📖 ステークベストのプロモーション
この最終段落では、この動画のホストが、視聴者が統計学や機械学習について学ぶためのリソースを提供していることを紹介しています。また、チャンネルへの登録、パトロンの支援、商品の購入、または寄付を通じてサポートを提供するよう呼びかけています。
Mindmap
Keywords
💡Transformer Neural Networks
💡Word Embedding
💡Positional Encoding
💡Self-Attention
💡Encoder-Decoder Attention
💡Backpropagation
💡Softmax Function
💡Residual Connections
💡Normalization
💡Dot Product
💡Fully Connected Layer
Highlights
Transformer neural networks are explained in detail, focusing on their ability to translate English sentences into Spanish.
Transformers are a type of neural network that can handle input and output values in the form of numbers, using word embeddings.
Word embeddings convert words into numbers, allowing neural networks to process linguistic data.
The process of converting words into numbers involves multiplying input values by weights and passing them through activation functions.
Positional encoding is used to maintain the order of words, which is crucial for understanding the meaning of sentences.
Positional encoding adds a set of numbers to the word embeddings that correspond to the word's position in the sentence.
Self-attention is a mechanism within Transformers that associates words with their context within the sentence.
Self-attention calculates similarities between each word and all other words in the sentence, including itself.
The softmax function is used to translate similarity scores into a distribution that represents the influence of each word on the encoding of a given word.
Transformers use residual connections to combine the input word embeddings with the self-attention values, allowing for efficient training.
The encoder-decoder architecture of Transformers allows for the translation of input phrases into output phrases in different languages.
Encoder-decoder attention helps the decoder keep track of significant words in the input sentence, ensuring accurate translation.
The decoder uses self-attention and encoder-decoder attention to generate the translated output, starting with the EOS token.
The translation process continues until the decoder outputs an EOS token, indicating the end of the translated sentence.
Transformers can be scaled to handle larger vocabularies and longer sentences by normalizing values and using scaled dot products for attention.
Additional neural networks with hidden layers can be added to the encoder and decoder to increase the model's complexity and improve its performance.
The original Transformer model used 37,000 tokens and demonstrated the ability to encode and decode long and complex phrases.
Transcripts
[Music]
translation it's done with a transform
ER
stat Quest
hello I'm Josh starmer and welcome to
statquest today we're going to talk
about Transformer neural networks and
they're going to be clearly explained
Transformers are more fun when you build
them in the cloud with lightning
bam
right now people are going bonkers about
something called chat GPT for example
our friend statsquatch might type
something into chat GPT like
right and awesome song in the style of
statquest
translation it's done with a transform
ER
anyway there's a lot to be said about
how chat GPT works but fundamentally it
is based on something called a
Transformer
so in this stat Quest we're going to
show you how a Transformer works one
step at a time
specifically we're going to focus on how
a Transformer neural network can
translate a simple English sentence
let's go into Spanish vamos
now since a Transformer is a type of
neural network and neural networks
usually only have numbers for input
values
the first thing we need to do is find a
way to turn the input and output words
into numbers
there are a lot of ways to convert words
into numbers but for neural networks one
of the most commonly used methods is
called word embedding the main idea of
word embedding is to use a relatively
simple neural network that has one input
for every word and symbol in the
vocabulary that you want to use in this
case we have a super simple vocabulary
that allows us to input short phrases
like let's go and to go
and we have an input for this symbol EOS
which stands for end of sentence or end
of sequence because the vocabulary can
be a mix of words word fragments and
symbols we call each input a token
the inputs are then connected to
something called an activation function
and in this example we have two
activation functions
and each connection multiplies the input
value by something called a weight
hey Josh where do these numbers come
from
great question Squatch and we'll answer
it in just a bit
for now let's just see how we convert
the word let's into numbers
first we put a 1 into the input for
let's and then put zeros into all of the
other inputs
now we multiply the inputs by their
weights on the connections to the
activation functions
for example the input for let's is one
so we multiply 1.87 by 1 to get 1.87
going to the activation function on the
left and we multiply 0.09 by 1 to get
0.09 going to the activation function on
the right
in contrast if the input value for the
word 2 is 0
then we multiply negative 1.45 by 0 to
get 0 going to the activation function
on the left
and we multiply 1.50 by 0 to get 0 going
to the activation function on the right
in other words when an input value is 0
then it only sends zeros to the
activation functions and that means to
go and the EOS symbol all just send
zeros to the activation functions
and only the weight values for let's end
up at the activation functions because
its input value is 1.
so in this case 1.87 goes to the
activation function on the left
and 0.09 goes to the activation function
on the right
in this example the activation functions
themselves are just identity functions
meaning the output values are the same
as the input values
in other words if the input value or
x-axis coordinate for the activation
function on the left is 1.87 then the
output value the y-axis coordinate will
also be 1.87
likewise because the input to the
activation function on the right is 0.09
the output is also 0.09
thus these output values 1.87 and 0.09
are the numbers that represent the word
leads bam
likewise if we want to convert the word
go into numbers
we set the input value for go to 1. and
all of the other inputs to zero
and we end up with negative 0.78 and
0.27 as the numbers that represent the
word go
and that is how we use word embedding to
convert our input phrase let's go into
numbers bam
note there is a lot more to say about
word embedding so if you're interested
check out the quest
also note before we move on I want to
point out two things
first we reuse the same word embedding
Network for each input word or symbol
in other words the weights in the
network for let's
are the exact same as the weights in the
network for go
this means that regardless of how long
the input sentence is we just copy and
use the exact same word embedding
Network for each word or symbol
and this gives us flexibility to handle
input sentences with different lengths
the second thing I want to mention is
that all of these weights and all of the
other weights we're going to talk about
in this Quest are determined using
something called back propagation
to get a sense of what back propagation
does let's imagine we had this data and
we wanted to fit a line to it
back propagation would start with a line
that has a random value for the y-axis
intercept and a random value for the
slope
and then using an iterative process back
propagation would change the y-axis
intercept and slope one step at a time
until it found the optimal values
likewise in the context of neural
networks each weight starts out as a
random number
but when we train the Transformer with
English phrases and known Spanish
translations
back propagation optimizes these values
one step at a time and results in these
final weights
also just to be clear the process of
optimizing the weights is also called
training bam
note there is a lot more to be said
about training and back propagation so
if you're interested check out the
quests
now because the word embedding networks
are taking up the whole screen let's
shrink them down and put them in the
corner okay and now that we know how to
convert words into numbers let's talk
about word order
for example if Norm said Squatch eats
pizza then squash might say yum
in contrast if Norm said Pizza eats
squash then squash might say yikes
so these two phrases
Squatch eats Pizza
and pizza eats squash use the exact same
words but have very different meanings
so keeping track of word order is super
important so let's talk about positional
encoding which is a technique that
Transformers use to keep track of word
order
we'll start by showing how to add
positional encoding to the first phrase
Squatch eats Pizza
note there are a bunch of ways to do
positional encoding but we're just going
to talk about one popular method
that said the first thing we do is
convert the words squash eats pizza into
numbers using word embedding
in this example we've got a new
vocabulary and we're creating four word
embedding values per word
however in practice people often create
hundreds or even thousands of embedding
values per word
now we add a set of numbers that
correspond to word order to the
embedding values for each word hey Josh
where do the numbers that correspond to
word order come from
in this case the numbers that represent
the word order come from a sequence of
alternating sine and cosine squiggles
each squiggle gives a specific position
values for each word's embeddings
for example the y-axis values on the
green squiggle give us position encoding
values for the first embeddings for each
word
specifically for the first word which
has an x-axis coordinate all the way to
the left of the green squiggle the
position value for the first embedding
is the y-axis coordinate zero
the position value for the second
embedding comes from the orange squiggle
and the y-axis coordinate on the orange
squiggle that corresponds to the first
word is one
likewise the blue squiggle which is more
spread out than the first two squiggles
gives us the position value for the
third embedding value which for the
first word is zero
lastly the red squiggle gives us the
position value for the fourth embedding
which for the first word is one
thus the position values for the first
word come from the corresponding y-axis
coordinates on the squiggles
now to get the position values for the
second word we simply use the y-axis
coordinates on the squiggles that
correspond to the x-axis coordinate for
the second word
lastly to get the position values for
the third word we use the y-axis
coordinates on the squiggles that
correspond to the x-axis coordinate for
the third word
note because the sine and cosine
squiggles are repetitive it's possible
that two words might get the same
position or y-axis values
for example the second and third words
both got negative 0.9 for the first
position value
however because the squiggles get wider
for larger embedding positions and the
more embedding values we have then the
wider the squiggles get
then even with a repeat value here and
there we end up with a unique sequence
of position values for each word
thus each input word ends up with a
unique sequence of position values
now all we have to do is add the
position values to the embedding values
and we end up with the word embeddings
plus positional encoding for the whole
sentence Squatch eats Pizza
yum
note if we reverse the order of the
input words to be Pizza eats squash then
the embeddings for the first and third
words get swapped but the positional
values for the first second and third
word stay the same and when we add the
positional values to the embeddings
we end up with new positional encoding
for the first and third words and the
second word since it didn't move stays
the same
thus positional encoding allows a
Transformer to keep track of word order
bam
now let's go back to our simple example
where we are just trying to translate
the English sentence let's go
and add position values to the word
embeddings
the first embedding for the first word
Let's gets zero and the second embedding
gets one and the first embedding for the
second word go gets a negative 0.9 and
the second embedding gets 0.4
and now we just do the math to get the
positional encoding for both words bam
now because we're going to need all the
space we can get let's consolidate the
math in the diagram and let the sine and
cosine and plus symbols represent the
positional encoding
now that we know how to keep track of
each word's position let's talk about
how a Transformer keeps track of the
relationships among words
for example if the input sentence was
this the pizza came out of the oven and
it tasted good then this word it could
refer to pizza or potentially it could
refer to the word oven Josh I've heard
of good tasting pizza but never a good
tasting oven I know Squatch that's why
it's important that the Transformer
correctly Associates the word it with
pizza the good news is that Transformers
have something called self-attention
which is a mechanism to correctly
associate the word ID with the word
Pizza
in general terms self-attention works by
seeing how similar each word is to all
of the words in the sentence including
itself for example self-attention
calculates the similarity between the
first word the and all of the words in
the sentence including itself
and self-attention calculates these
similarities for every word in the
sentence
once the similarities are calculated
they are used to determine how the
Transformer encodes each word
for example if you looked at a lot of
sentences about pizza and the word ID
was more commonly associated with pizza
than oven
then the similarity score for pizza will
cause it to have a larger impact on how
the word ID is encoded by the
Transformer
bam
and now that we know the main ideas of
how self-attention Works let's look at
the details
so let's go back to our simple example
where we had just added positional
encoding to the words let's and go
the first thing we do is multiply the
position encoded values for the word
let's by a pair of weights and we add
those products together to get Negative
1.0
then we do the same thing with a
different pair of weights to get 3.7
we do this twice because we started out
with two position encoded values that
represent the word leads and after doing
the math two times we still have two
values representing the word leads
Josh I don't get it
if we want two values to represent let's
why don't we just use the two values we
started with
that's a great question Squatch and
we'll answer it in a little bit grr
anyway for now just know that we have
these two new values to represent the
word let's and in Transformer
terminology we call them query values
and now that we have query values for
the word let's use them to calculate the
similarity between itself and the word
go
and we do this by creating two new
values just like we did for the query to
represent the word let's
and we create two new values to
represent the word go
both sets of new values are called key
values
and we use them to calculate
similarities with the query for let's
one way to calculate similarities
between the query and the keys is to
calculate something called a DOT product
for example in order to calculate the
dot product similarity between the query
and key for let's
we simply multiply each pair of numbers
together
and then add the products to get 11.7
likewise we can calculate the dot
product similarity between the query for
let's and the key for go
by multiplying the pairs of numbers
together
and adding the products to get negative
2.6
the relatively large similarity value
for let's relative to itself
11.7 compared to the relatively small
value for lets relative to the word go
negative 2.6
tells us that let's is much more similar
to itself than it is to the word go
that said if you remember the example
where the word it could relate to pizza
or oven
the word it should have a relatively
large similarity value with respect to
the word Pizza since it refers to pizza
and not oven
note there's a lot to be said about
calculating similarities in this context
and the dot product so if you're
interested check out the quests anyway
since let's is much more similar to
itself than it is to the word go
then we want let's to have more
influence on its encoding than the word
go
and we do this by first running the
similarities course through something
called a soft Max function
the main idea of a soft Max function is
that it preserves the order of the input
values from low to high and translates
them into numbers between 0 and 1 that
add up to one
so we can think of the output of the
softmax function as a way to determine
what percentage of each input word we
should use to encode the word let's
in this case because let's is so much
more similar to itself than the word go
we'll use one hundred percent of the
word let's to encode less
and zero percent of the word go to
encode the word let's
note there's a lot more to be said about
the soft Max function so if you're
interested check out the quest
anyway because we want 100 of the word
let's to encode let's
we create two more values that will
cleverly call values to represent the
word let's
and scale the values that represent
let's by 1.0
then we create two values to represent
the word go
and scale those values by 0.0
lastly we add the scaled values together
and these sums which combine separate
encodings for both input words let's and
go relative to their similarity to Let's
are the self-attention values for leads
bam
now that we have self-attention values
for the word let's it's time to
calculate them for the word go
the good news is that we don't need to
recalculate the keys and values instead
all we need to do is create the query
that represents the word go
and do the math
by first calculating the similarity
scores between the new query and the
keys and then run the similarity scores
through a softmax
and then scale the values
and then add them together
and we end up with the self-attention
values for go
note before we move on I want to point
out a few details about self-attention
first the weights that we use to
calculate the self-attention queries are
the exact same for let's and go
in other words this example uses one set
of weights for calculating
self-attention queries regardless of how
many words are in the input
likewise we reuse the sets of weights
for calculating self-attention keys and
values for each input word
this means that no matter how many words
are input into the Transformer
we just reuse the same sets of weights
for self-attention queries keys and
values
the other thing I want to point out is
that we can calculate the queries keys
and values for each word at the same
time
in other words we don't have to
calculate the query key and value for
the first word first before moving on to
the second word
and because we can do all of the
computation at the same time
Transformers can take advantage of
parallel Computing and run fast
now that we understand the details of
how self-attention Works let's shrink it
down so we can keep building our
Transformer bam Josh you forgot
something if we want two values to
represent let's why don't we just use
the two position encoded values we
started with
first the new self-attention values for
each word contain input from all of the
other words and this helps give each
word context and this can help establish
how each word in the input is related to
the others
also if we think of this unit with its
weights for calculating queries keys and
values as a self-attention cell
then in order to correctly establish how
words are related in complicated
sentences and paragraphs
we can create a stack of self-attention
cells each with its own sets of Weights
that we apply to the position encoded
values for each word to capture
different relationships among the words
in the manuscript that first describes
Transformers they stacked eight
self-attention cells and they called
this multi-head attention
why eight instead of 12 or 16 I have no
idea
bam
okay going back to our simple example
with only one self-attention cell
there's one more thing we need to do to
encode the input
we take the position encoded values
and add them to the self-attention
values these bypasses are called
residual connections and they make it
easier to train complex neural networks
by allowing the self-attention layer to
establish relationships among the input
words without having to also preserve
the word embedding and positioning
coding information
bam
and that's all we need to do to encode
the input for this simple Transformer
double bam note this simple Transformer
only contains the parts required for
encoding the input word embedding
positional encoding self-attention and
residual connections
these four features allow the
Transformer to encode words into numbers
encode the positions of the words encode
the relationships among the words and
relatively easily and quickly train in
parallel that said there are lots of
extra things we can add to a Transformer
and we'll talk about those at the end of
this Quest bam so now that we've encoded
the English input phrase let's go it's
time to decode it into Spanish
in other words the first part of a
transformer is called an encoder and now
it's time to create the second part A
decoder
the decoder just like the encoder starts
with word embedding however this time we
create embedding values for the output
vocabulary which consists of the Spanish
words ear vamos e and the EOS end of
sequence token
now because we just finished encoding
the English sentence let's go
the decoder starts with embedding values
for the EOS token in this case we're
using the EOS token to start the
decoding because that is a common way to
initialize the process of decoding the
encoded input sentence
however sometimes you'll see people use
SOS for startup sentence or start of
sequence to initialize the process
Josh starting with SOS makes more sense
to me then you can do it that way
Squatch I'm just saying a lot of people
start with EOS
anyway we plug in 1 for Eos and zero for
everything else
and do the math
and we end up with 2.70 and negative
1.34 as the numbers that represent the
EOS token bam
now let's shrink the word embedding down
to make more space
so that we can add the positional
encoding
note these are the exact same sine and
cosine squiggles that we used when we
encoded the input and since the EOS
token is in the first position with two
embeddings we just add those two
position values
and we get 2.70 and negative 0.34 as the
position and word embedding values
representing the EOS token bam now let's
consolidate the math in the diagram
and before we move on to the next step
let's review a key concept from when we
encoded the input
one key concept from earlier was that we
created a single unit to process an
input word
and then we just copied that unit for
each word in the input
and if we had more words we just make
more copies of the same unit
by creating a single unit that can be
copied for each input word the
Transformer can do all of the
computation for each word in the input
at the same time
for example we can calculate the word
embeddings on different processors at
the same time
and then add the positional encoding at
the same time
and then calculate the queries keys and
values at the same time
and once that is done we can calculate
the self-attention values at the same
time
and lastly we can calculate the residual
connections at the same time
doing all of the computations at the
same time rather than doing them
sequentially for each word
means we can process a lot of words
relatively quickly on a chip with a lot
of computing cores like a GPU Graphics
Processing Unit or multiple chips in the
cloud
well likewise when we decode and
translate the input we want a single
unit that we can copy for each
translated word for the same reasons we
want to do the math quickly
so even though we're only processing the
EOS token so far we add a self-attention
layer so that ultimately we can keep
track of related words in the output
now that we have the query key and value
numbers for the EOS token we calculate
itself attention values just like before
and the self-attention values for the
EOS token are negative 2.8 and negative
2.3 note the sets of Weights we used to
calculate the decoder self-attention
query key and value are different from
the sets we used in the encoder
now let's consolidate the math and add
residual connections just like before
bam
now so far we've talked about how
self-attention helps the Transformer
keep track of how words are related
within a sentence
however since We're translating a
sentence we also need to keep track of
the relationships between the input
sentence and the output
for example if the input sentence was
don't eat the delicious looking and
smelling pizza then when translating
it's super important to keep track of
the very first word don't
if the translation focuses on other
parts of the sentence and omits the
don't then we'll end up with
eat the delicious looking and smelling
Pizza
and these two sentences have completely
opposite meanings
so it's super important for the decoder
to keep track of the significant words
in the input
so the main idea of encoder decoder
attention is to allow the decoder to
keep track of the significant words in
the input
now that we know the main idea behind
encoder decoder attention here are the
details
first to give us a little more room
let's consolidate the math and the
diagrams
now just like we did for self-attention
we create two new values to represent
the query for the EOS token in the
decoder then we create keys for each
word in the encoder and we calculate the
similarities between the EOS token in
the decoder and each word in the encoder
by calculating the dot products just
like before
then we run the similarities through a
softmax function and this tells us to
use one hundred percent of the first
input word and zero percent of the
second when the decoder determines what
should be the first translated word
now that we know what percentage of each
input word to use when determining what
should be the first translated word
we calculate values for each input word
and then scale those values by the soft
Max percentages
and then add the pairs of scaled values
together to get the encoder decoder
attention values
bam
now to make room for the next step let's
consolidate the encoder decoder
attention in our diagram
note the sets of Weights that we use to
calculate the queries keys and values
for encoder decoder attention are
different from the sets of Weights we
use for self-attention
however just like for self-attention the
sets of Weights are copied and reused
for each word
this allows the Transformer to be
flexible with the length of the inputs
and outputs
and also we can stack encode or decoder
attention just like we can stack
self-attention to keep track of words in
complicated phrases bam
now we add another set of residual
connections
that allow the encoder decoder attention
to focus on the relationships between
the output words and the input words
without having to preserve the
self-attention or word and position
encoding that happened earlier then we
consolidate the math and the diagram
lastly we need a way to take these two
values that represent the EOS token in
the decoder and select one of the four
output tokens ear vamos e or EOS
so we run these two values through a
fully connected layer that has one input
for each value that represents the
current token so in this case we have
two inputs
and one output for each token in the
output vocabulary which in this case
means four outputs
note a fully connected layer is just a
simple neural network with weights
numbers we multiply the inputs by
and biases numbers we add to the sums of
the products
and when we do the math we get four
output values
which we run through a final soft Max
function to select the first output word
vamos
bam
note vamos is the Spanish translation
for Let's Go triple boom no not yet
so far the translation is correct but
the decoder doesn't stop until it
outputs an EOS token so let's
consolidate our diagrams
and plug the translated word vamos into
a copy of the decoder's embedding layer
and do the math
first we get the word embeddings for
vamos
then we add the positional encoding
now we calculate self-attention values
for vamos using the exact same weights
that we used for the EOS token
now add the residual connections
and calculate the encoder decoder
attention using the same sets of Weights
that we used for the EOS token
now we add more residual connections
lastly we run the values that represent
vamos through the same fully connected
layer and softmax that we used for the
EOS token and the second output from the
decoder is eos so we are done decoding
triple bam
at long last we've shown how a
Transformer can encode a simple input
phrase let's go
and decode the encoding into the
translated phrase of vamos
in summary
Transformers use word embedding to
convert words into numbers
positional encoding to keep track of
word order
self-attention to keep track of word
relationships within the input and
output phrases
encoder decoder attention to keep track
of things between the input and output
phrases to make sure that important
words in the input are not lost in the
translation
and residual connections to allow each
subunit like self-attention to focus on
solving just one part of the problem
now that we understand the main ideas of
how Transformers work let's talk about a
few extra things we can add to them
in this example we kept things super
simple
however if we had larger vocabularies
and the original Transformer had 37 000
tokens and longer input and output
phrases
then in order to get their model to work
they had to normalize the values after
every step
for example they normalize the values
after positional encoding and after
self-attention in both the encoder and
the decoder also when we calculated
attention values we used the dot product
to calculate the similarities but you
can use whatever similarity function you
want
in the original Transformer manuscript
they calculated the similarities with a
DOT product divided by the square root
of the number of embedding values per
token just like with scaling the values
after each step they found that scaling
the dot product helped encode and decode
long and complicated phrases
lastly to give a Transformer more
weights and biases to fit to complicated
data you can add additional neural
networks with hidden layers to both the
encoder and decoder bam
now it's time for some
Shameless self-promotion if you want to
review statistics and machine learning
offline check out the statquest PDF
study guides in my book the stat Quest
Illustrated guide to machine learning at
stackwest.org there's something for
everyone
hooray we've made it to the end of
another exciting stat Quest if you like
this stack Quest and want to see more
please subscribe and if you want to
support stackquest consider contributing
to my patreon campaign becoming a
channel member buying one or two of my
original songs or a t-shirt or a hoodie
or just donate the links are in the
description below alright until next
time Quest on
Parcourir plus de vidéos associées
How Google Translate Uses Math to Understand 134 Languages | WSJ Tech Behind
翻訳できないことばってどんなの?
GPTとは何か Transformerの視覚化 | Chapter 5, Deep Learning
【英単語攻略】もっと脳を使えや!!独学でも完璧に英単語をカチ込む方法を教えてやる卍
【英語耳】ネイティブの速い英語が聞き取れるようになる方法
This tool is a MUST for programmers 👩💻 #coder #technology #developer #software #tech #linux
5.0 / 5 (0 votes)