Transformer Neural Networks, ChatGPT's foundation, Clearly Explained!!!

StatQuest with Josh Starmer
23 Jul 202336:15

Summary

TLDRこの動画は、Transformerニューラルネットワークの仕組みを詳しく解説しています。Josh Starmerが主役で、Transformerが単純な英語文をスペイン語に翻訳する方法をステップバイステップで示しています。Word embeddingを使用して単語を数値に変換し、位置エンコーディングで単語の順序を追跡します。また、自己注意(self-attention)とエンコーダー-デコーダー注意(encoder-decoder attention)を用いて、単語間の関係を捉え、残差接続(residual connections)で各サブユニットが問題の特定部分に集中できるようにします。最後に、Transformerはこれらの技術を組み合わせて、入力フレーズを正確に翻訳することができ、翻訳タスクにおいて重要な単語を無視することなく、入力と出力のフレーズの関係を保持することができます。

Takeaways

  • 🤖 Transformerニューラルネットワークは、自然言語処理タスクで広く使用されています。
  • 📈 ワードエンベディングは、単語を数値に変換する手法で、ニューラルネットワークの入力として使用されます。
  • 📊 位置エンコーディングは、文の単語の順序を追跡するために使用されます。
  • 🔍 セルフアテンションは、文内の単語の関係を把握するために使用されます。
  • 🔄 エンコーダー-デコーダーアテンションは、入力文と出力文の関係を追跡し、翻訳の品質を向上させます。
  • 🔧 レジダラルコネクションは、複雑なニューラルネットワークをより簡単にトレーニングするために使用されます。
  • 🔢 Transformerは、並列コンピューティングを利用して高速に処理が行われるように設計されています。
  • 📚 Transformerモデルは、複数の自注意力セルをスタックすることで、複雑な文や段落内の単語の関係を捕捉できます。
  • 📈 トレーニングプロセスは、バックプロパゲーションを使用して、最適なウェイトを決定します。
  • 🔧 Transformerは、より複雑なデータに適応するために、エンコーダーやデコーダーに隠れられた層を持つ追加のニューラルネットワークを含めることができます。
  • 📝 オリジナルのTransformerモデルは、非常に大きなボキャブラリ(37,000トークン)と長い入力・出力文を扱うことができました。

Q & A

  • トランスフォーマーニューラルネットワークとは何ですか?

    -トランスフォーマーニューラルネットワークは、自然言語処理タスクで使用される一種のニューラルネットワークであり、翻訳や文章生成などに応用されています。

  • ワードエンベディングとは何ですか?

    -ワードエンベディングは、単語や記号を数値に変換する技術であり、ニューラルネットワークが扱える形式に変換します。これにより、入力された文を数値のシーケンスに変換することが可能です。

  • 位置エンコーディングの目的は何ですか?

    -位置エンコーディングは、文内の単語の順序を追跡するために使用されます。これにより、トランスフォーマーは単語の位置に関する情報を保持し、文の意味を正確に理解することができます。

  • セルフアテンションとは何ですか?

    -セルフアテンションは、トランスフォーマーのメカニズムの一つで、文内の各単語どうしの関連性を正確に捉えることができます。これにより、単語が文のどの部分に関連しているかを理解し、翻訳や文章生成の際に適切な単語を選択することができます。

  • エンコーダーとデコーダーとは何ですか?

    -エンコーダーは、入力された文を数値のエンコーディングに変換する部分であり、デコーダーは、そのエンコーディングをもとに翻訳された文を生成する部分です。トランスフォーマーでは、これらの部分を組み合わせて翻訳タスクを実行します。

  • リジダルコネクションとは何ですか?

    -リジダルコネクションは、トランスフォーマーの各サブユニット(例えばセルフアテンション)が特定の問題を解決するために、前段階の情報を保持しながら計算を実行できるようにする仕組みです。これにより、複雑なニューラルネットワークをより効率的にトレーニングできます。

  • トランスフォーマーが翻訳タスクでどのように動作するかを説明してください。

    -トランスフォーマーは、まず入力文をワードエンベディングで数値化し、位置エンコーディングを加えます。次に、エンコーダーでセルフアテンションを実行し、文内での単語の関係を捉えます。その後、デコーダーでエンコーディングされた情報を元に翻訳文を生成し、エンコーダーとデコーダーのアテンションを利用して入力文と出力文の関係を追跡します。最後に、リジダルコネクションを通じて各サブユニットが問題を解決し、翻訳タスクを実行します。

  • トランスフォーマーが翻訳タスクを実行する際に、なぜ位置エンコーディングが必要なのか?

    -位置エンコーディングは、文内の単語の順序を保持するために必要な情報を提供します。翻訳タスクでは、単語の順序が文の意味に大きな影響を与えるため、位置エンコーディングを使用して、トランスフォーマーが正確に翻訳を生成できるようにします。

  • トランスフォーマーのエンコーダーとデコーダーの役割は何ですか?

    -エンコーダーは、入力文を数値のエンコーディングに変換し、文内の単語の関係を理解する役割を持ちます。一方、デコーダーは、エンコーディングされた情報を元に翻訳文を生成し、入力文と翻訳文の関係を追跡する役割を担当します。

  • トランスフォーマーで使用されるアテンションメカニズムの利点は何ですか?

    -アテンションメカニズムは、文内の単語どうしの関連性を正確に捉えることができるため、翻訳や文章生成タスクで高い精度を達成することができます。また、アテンションは並列計算が可能であり、トランスフォーマーが高速に処理を実行できるようにします。

  • トランスフォーマーのトレーニング中に、バックプロパゲーションは何のために使用されるのですか?

    -バックプロパゲーションは、ニューラルネットワークの重みを最適化するために使用されます。トレーニング中に、バックプロパゲーションは、モデルの予測と実際の結果の差を利用して、重みを徐々に更新し、モデルのパフォーマンスを向上させます。

Outlines

00:00

🤖 トランスフォーマーニューラルネットワークの基本

この段落では、トランスフォーマーニューラルネットワークの基本的な仕組みとその応用について説明されています。例えば、チャットGPTのようなアプリケーションがトランスフォーマーをどのように使用しているか、またトランスフォーマーがどのように単純な英語文をスペイン語に翻訳するのかについて詳細に説明されています。

05:00

📈 単語の数値化と位置エンコーディング

この段落では、単語を数値に変換する方法である単語エンコーディングと、単語の順序を追跡するための位置エンコーディングについて説明されています。また、同じ単語エンコーディングネットワークをすべての入力単語や記号で再利用することで、異なる長さの入力文を扱う柔軟性を確保する方法も触れられています。

10:01

🔍 セルフアテンションと単語の関係の追跡

この段落では、トランスフォーマーが単語間の関係をどのように追跡するかを説明しています。セルフアテンションというメカニズムを使用して、各単語が文全体でどのように関連しているかを理解し、それぞれの単語をエンコードする方法が説明されています。

15:01

🔄 エンコーダーとデコーダーの働き

この段落では、トランスフォーマーのエンコーダーとデコーダーの働きについて説明されています。エンコーダーは入力文を数値に変換し、デコーダーはその数値をもとに翻訳された文を生成します。また、デコーダーはエンコーダーと同様に単語エンコーディング、位置エンコーディング、セルフアテンションを使用し、翻訳プロセスを進めます。

20:02

🔄 エンコーダー-デコーダーアテンションと翻訳

この段落では、デコーダーがどのように入力文と出力文の関係を追跡するかを説明しています。エンコーダー-デコーダーアテンションを使用して、翻訳時に重要な単語を無視しないようにすることで、翻訳の正確性を確保します。

25:05

📚 トランスフォーマーの拡張と改善

この段落では、トランスフォーマーをより複雑なデータに適応させるための拡張と改善について触れられています。例えば、より多くの単語を扱うためには、値の正規化やドットプロダクトのスケーリング、さらにはエンコーダーやデコーダーに隠れ層を持つ追加のニューラルネットワークを追加することが提案されています。

30:07

📖 ステークベストのプロモーション

この最終段落では、この動画のホストが、視聴者が統計学や機械学習について学ぶためのリソースを提供していることを紹介しています。また、チャンネルへの登録、パトロンの支援、商品の購入、または寄付を通じてサポートを提供するよう呼びかけています。

Mindmap

Keywords

💡Transformer Neural Networks

トランiformerニューラルネットワークは、自然言語処理タスクで使用される高度なアルゴリズムです。このビデオでは、トランiformerが単純な英語文をスペイン語に翻訳する方法を詳細に説明しています。トランiformerは、単語を数値に変換し、単語の順序や関係を追跡するために、単語埋め込み、位置エンコーディング、自己注意機構、およびエンコーダー-デコーダー注意機構を利用します。

💡Word Embedding

単語埋め込みは、自然言語をコンピュータが理解できる数値形式に変換する技術です。このビデオでは、トランiformerが単語を数値に変換する方法を説明し、各単語や記号をトークンと呼び、それらを入力値として扱います。例えば、「let's go」というフレーズを翻訳する際には、各単語を数値に変換し、ニューラルネットワークに入力します。

💡Positional Encoding

位置エンコーディングは、ニューラルネットワークが文の順序を理解できるようにする技術です。トランiformerでは、各単語の位置を表す数値を単語の埋め込みに追加することで、単語の順序を追跡します。例えば、フレーズ「Squatch eats pizza」では、各単語の位置エンコーディングを計算し、その値を単語の埋め込みに加算して、順序を保持します。

💡Self-Attention

自己注意機構は、トランiformerの核心機能の一つで、文内の単語間の関係を捉えるために使用されます。この機構は、各単語を他のすべての単語と比較し、類似度を計算することで、単語間の関連性を評価します。例えば、「it」という単語が「pizza」よりも「oven」に関連性が高い場合、その類似度スコアは、トランiformerが「it」を正しくエンコードする際に「pizza」の影響を強化します。

💡Encoder-Decoder Attention

エンコーダー-デコーダー注意機構は、翻訳タスクにおいて、入力文と出力文の関係を追跡するために使用されます。この機構は、デコーダーが最初の単語を生成する際に、入力文の各単語の重要度を考慮することができます。これにより、翻訳文が原文の意味を正確に反映するようにすることができます。例えば、翻訳時に「don't」が省略されないように、デコーダーはこの単語を適切に追跡する必要があります。

💡Backpropagation

バックプロパゲーションは、ニューラルネットワークの学習プロセスで使用されるアルゴリズムです。このビデオでは、トランiformerが英語フレーズとスペイン語の翻訳を学習する際に、バックプロパゲーションを使用してネットワークの重みを最適化することを説明しています。このプロセスは、最初にランダムな重み値を設定し、反復的に調整することで、最適な値を見つけます。

💡Softmax Function

ソフトマックス関数は、ニューラルネットワークでよく使用される関数で、複数の入力値を0から1の範囲の確率分布に変換します。このビデオでは、自己注意機構で計算された類似度スコアをソフトマックス関数を通して、各単語のエンコードにどれだけの影響を与えるかを決定します。例えば、「let's」が自身よりも「go」に関連性が高い場合、ソフトマックス関数は「let's」のエンコードに100%の影響を与え、「go」の影響を0%に設定します。

💡Residual Connections

残差接続は、複雑なニューラルネットワークをより効率的にトレーニングするために使用される技術です。このビデオでは、トランiformerのエンコーディングとデコーディングの各サブユニットで残差接続を使用することで、単語の埋め込みや位置エンコーディング情報を保持する必要性を減らし、トレーニングを容易にします。これにより、ネットワークはより迅速に学習し、より正確な翻訳を生成することができます。

💡Normalization

正規化は、ニューラルネットワークの学習プロセスで、各ステップの後の値を安定させるために使用されます。このビデオでは、トランiformerが位置エンコーディングや自己注意の後で値を正規化することで、より長い文や複雑なフレーズをエンコードおよびデコードする能力を向上させています。正規化は、ネットワークが学習時にオーバーフィッティングを防ぎ、より一般的なパターンを学習するのに役立ちます。

💡Dot Product

ドット積は、ベクトル間の類似度を計算するために使用される数学的な操作です。このビデオでは、自己注意機構で単語間の類似度を計算する際にドット積を使用しています。ドット積は、各単語の埋め込みベクトルを掛け合わせ、その積を合計することで計算されます。この結果は、ソフトマックス関数を通して処理され、各単語のエンコードに影響を与える重みを決定します。

💡Fully Connected Layer

全接続層は、ニューラルネットワークの基本的な構成要素で、各入力値をすべての出力値に接続します。このビデオでは、デコーダーが最後の出力単語を選択するために、全接続層を使用しています。全接続層は、入力値と重みを掛け合わせ、バイアスを加えた後、ソフトマックス関数を通して最終的な出力値を決定します。

Highlights

Transformer neural networks are explained in detail, focusing on their ability to translate English sentences into Spanish.

Transformers are a type of neural network that can handle input and output values in the form of numbers, using word embeddings.

Word embeddings convert words into numbers, allowing neural networks to process linguistic data.

The process of converting words into numbers involves multiplying input values by weights and passing them through activation functions.

Positional encoding is used to maintain the order of words, which is crucial for understanding the meaning of sentences.

Positional encoding adds a set of numbers to the word embeddings that correspond to the word's position in the sentence.

Self-attention is a mechanism within Transformers that associates words with their context within the sentence.

Self-attention calculates similarities between each word and all other words in the sentence, including itself.

The softmax function is used to translate similarity scores into a distribution that represents the influence of each word on the encoding of a given word.

Transformers use residual connections to combine the input word embeddings with the self-attention values, allowing for efficient training.

The encoder-decoder architecture of Transformers allows for the translation of input phrases into output phrases in different languages.

Encoder-decoder attention helps the decoder keep track of significant words in the input sentence, ensuring accurate translation.

The decoder uses self-attention and encoder-decoder attention to generate the translated output, starting with the EOS token.

The translation process continues until the decoder outputs an EOS token, indicating the end of the translated sentence.

Transformers can be scaled to handle larger vocabularies and longer sentences by normalizing values and using scaled dot products for attention.

Additional neural networks with hidden layers can be added to the encoder and decoder to increase the model's complexity and improve its performance.

The original Transformer model used 37,000 tokens and demonstrated the ability to encode and decode long and complex phrases.

Transcripts

play00:00

[Music]

play00:00

translation it's done with a transform

play00:03

ER

play00:05

stat Quest

play00:07

hello I'm Josh starmer and welcome to

play00:10

statquest today we're going to talk

play00:13

about Transformer neural networks and

play00:15

they're going to be clearly explained

play00:17

Transformers are more fun when you build

play00:21

them in the cloud with lightning

play00:24

bam

play00:25

right now people are going bonkers about

play00:28

something called chat GPT for example

play00:32

our friend statsquatch might type

play00:35

something into chat GPT like

play00:38

right and awesome song in the style of

play00:42

statquest

play00:44

translation it's done with a transform

play00:47

ER

play00:49

anyway there's a lot to be said about

play00:52

how chat GPT works but fundamentally it

play00:55

is based on something called a

play00:57

Transformer

play00:58

so in this stat Quest we're going to

play01:01

show you how a Transformer works one

play01:04

step at a time

play01:06

specifically we're going to focus on how

play01:08

a Transformer neural network can

play01:10

translate a simple English sentence

play01:13

let's go into Spanish vamos

play01:17

now since a Transformer is a type of

play01:20

neural network and neural networks

play01:22

usually only have numbers for input

play01:24

values

play01:25

the first thing we need to do is find a

play01:28

way to turn the input and output words

play01:30

into numbers

play01:32

there are a lot of ways to convert words

play01:35

into numbers but for neural networks one

play01:38

of the most commonly used methods is

play01:40

called word embedding the main idea of

play01:43

word embedding is to use a relatively

play01:45

simple neural network that has one input

play01:48

for every word and symbol in the

play01:50

vocabulary that you want to use in this

play01:54

case we have a super simple vocabulary

play01:56

that allows us to input short phrases

play01:59

like let's go and to go

play02:02

and we have an input for this symbol EOS

play02:05

which stands for end of sentence or end

play02:08

of sequence because the vocabulary can

play02:11

be a mix of words word fragments and

play02:13

symbols we call each input a token

play02:17

the inputs are then connected to

play02:19

something called an activation function

play02:21

and in this example we have two

play02:23

activation functions

play02:25

and each connection multiplies the input

play02:28

value by something called a weight

play02:30

hey Josh where do these numbers come

play02:33

from

play02:34

great question Squatch and we'll answer

play02:37

it in just a bit

play02:38

for now let's just see how we convert

play02:41

the word let's into numbers

play02:43

first we put a 1 into the input for

play02:46

let's and then put zeros into all of the

play02:49

other inputs

play02:51

now we multiply the inputs by their

play02:53

weights on the connections to the

play02:55

activation functions

play02:57

for example the input for let's is one

play03:00

so we multiply 1.87 by 1 to get 1.87

play03:05

going to the activation function on the

play03:07

left and we multiply 0.09 by 1 to get

play03:12

0.09 going to the activation function on

play03:16

the right

play03:17

in contrast if the input value for the

play03:20

word 2 is 0

play03:22

then we multiply negative 1.45 by 0 to

play03:27

get 0 going to the activation function

play03:29

on the left

play03:30

and we multiply 1.50 by 0 to get 0 going

play03:34

to the activation function on the right

play03:37

in other words when an input value is 0

play03:40

then it only sends zeros to the

play03:43

activation functions and that means to

play03:46

go and the EOS symbol all just send

play03:50

zeros to the activation functions

play03:52

and only the weight values for let's end

play03:55

up at the activation functions because

play03:57

its input value is 1.

play04:00

so in this case 1.87 goes to the

play04:04

activation function on the left

play04:06

and 0.09 goes to the activation function

play04:10

on the right

play04:12

in this example the activation functions

play04:14

themselves are just identity functions

play04:17

meaning the output values are the same

play04:19

as the input values

play04:21

in other words if the input value or

play04:24

x-axis coordinate for the activation

play04:26

function on the left is 1.87 then the

play04:30

output value the y-axis coordinate will

play04:33

also be 1.87

play04:36

likewise because the input to the

play04:38

activation function on the right is 0.09

play04:42

the output is also 0.09

play04:46

thus these output values 1.87 and 0.09

play04:50

are the numbers that represent the word

play04:53

leads bam

play04:55

likewise if we want to convert the word

play04:58

go into numbers

play05:00

we set the input value for go to 1. and

play05:04

all of the other inputs to zero

play05:07

and we end up with negative 0.78 and

play05:11

0.27 as the numbers that represent the

play05:13

word go

play05:15

and that is how we use word embedding to

play05:18

convert our input phrase let's go into

play05:21

numbers bam

play05:24

note there is a lot more to say about

play05:26

word embedding so if you're interested

play05:28

check out the quest

play05:30

also note before we move on I want to

play05:33

point out two things

play05:35

first we reuse the same word embedding

play05:38

Network for each input word or symbol

play05:41

in other words the weights in the

play05:43

network for let's

play05:45

are the exact same as the weights in the

play05:47

network for go

play05:49

this means that regardless of how long

play05:51

the input sentence is we just copy and

play05:54

use the exact same word embedding

play05:56

Network for each word or symbol

play05:59

and this gives us flexibility to handle

play06:01

input sentences with different lengths

play06:04

the second thing I want to mention is

play06:06

that all of these weights and all of the

play06:08

other weights we're going to talk about

play06:10

in this Quest are determined using

play06:12

something called back propagation

play06:15

to get a sense of what back propagation

play06:17

does let's imagine we had this data and

play06:20

we wanted to fit a line to it

play06:23

back propagation would start with a line

play06:25

that has a random value for the y-axis

play06:28

intercept and a random value for the

play06:31

slope

play06:31

and then using an iterative process back

play06:35

propagation would change the y-axis

play06:37

intercept and slope one step at a time

play06:40

until it found the optimal values

play06:43

likewise in the context of neural

play06:46

networks each weight starts out as a

play06:49

random number

play06:50

but when we train the Transformer with

play06:53

English phrases and known Spanish

play06:55

translations

play06:56

back propagation optimizes these values

play06:59

one step at a time and results in these

play07:02

final weights

play07:04

also just to be clear the process of

play07:07

optimizing the weights is also called

play07:09

training bam

play07:11

note there is a lot more to be said

play07:14

about training and back propagation so

play07:16

if you're interested check out the

play07:18

quests

play07:19

now because the word embedding networks

play07:22

are taking up the whole screen let's

play07:24

shrink them down and put them in the

play07:26

corner okay and now that we know how to

play07:29

convert words into numbers let's talk

play07:32

about word order

play07:34

for example if Norm said Squatch eats

play07:37

pizza then squash might say yum

play07:43

in contrast if Norm said Pizza eats

play07:47

squash then squash might say yikes

play07:52

so these two phrases

play07:54

Squatch eats Pizza

play07:56

and pizza eats squash use the exact same

play08:00

words but have very different meanings

play08:03

so keeping track of word order is super

play08:06

important so let's talk about positional

play08:09

encoding which is a technique that

play08:11

Transformers use to keep track of word

play08:13

order

play08:14

we'll start by showing how to add

play08:16

positional encoding to the first phrase

play08:18

Squatch eats Pizza

play08:21

note there are a bunch of ways to do

play08:23

positional encoding but we're just going

play08:26

to talk about one popular method

play08:28

that said the first thing we do is

play08:31

convert the words squash eats pizza into

play08:34

numbers using word embedding

play08:36

in this example we've got a new

play08:38

vocabulary and we're creating four word

play08:40

embedding values per word

play08:43

however in practice people often create

play08:46

hundreds or even thousands of embedding

play08:48

values per word

play08:50

now we add a set of numbers that

play08:52

correspond to word order to the

play08:54

embedding values for each word hey Josh

play08:58

where do the numbers that correspond to

play09:00

word order come from

play09:02

in this case the numbers that represent

play09:04

the word order come from a sequence of

play09:07

alternating sine and cosine squiggles

play09:10

each squiggle gives a specific position

play09:12

values for each word's embeddings

play09:15

for example the y-axis values on the

play09:18

green squiggle give us position encoding

play09:20

values for the first embeddings for each

play09:23

word

play09:24

specifically for the first word which

play09:27

has an x-axis coordinate all the way to

play09:29

the left of the green squiggle the

play09:32

position value for the first embedding

play09:34

is the y-axis coordinate zero

play09:37

the position value for the second

play09:39

embedding comes from the orange squiggle

play09:42

and the y-axis coordinate on the orange

play09:44

squiggle that corresponds to the first

play09:46

word is one

play09:48

likewise the blue squiggle which is more

play09:51

spread out than the first two squiggles

play09:53

gives us the position value for the

play09:56

third embedding value which for the

play09:58

first word is zero

play10:00

lastly the red squiggle gives us the

play10:03

position value for the fourth embedding

play10:05

which for the first word is one

play10:09

thus the position values for the first

play10:11

word come from the corresponding y-axis

play10:14

coordinates on the squiggles

play10:16

now to get the position values for the

play10:18

second word we simply use the y-axis

play10:21

coordinates on the squiggles that

play10:23

correspond to the x-axis coordinate for

play10:25

the second word

play10:27

lastly to get the position values for

play10:30

the third word we use the y-axis

play10:33

coordinates on the squiggles that

play10:34

correspond to the x-axis coordinate for

play10:37

the third word

play10:38

note because the sine and cosine

play10:41

squiggles are repetitive it's possible

play10:43

that two words might get the same

play10:45

position or y-axis values

play10:48

for example the second and third words

play10:51

both got negative 0.9 for the first

play10:54

position value

play10:55

however because the squiggles get wider

play10:58

for larger embedding positions and the

play11:01

more embedding values we have then the

play11:03

wider the squiggles get

play11:05

then even with a repeat value here and

play11:08

there we end up with a unique sequence

play11:10

of position values for each word

play11:13

thus each input word ends up with a

play11:16

unique sequence of position values

play11:19

now all we have to do is add the

play11:21

position values to the embedding values

play11:23

and we end up with the word embeddings

play11:25

plus positional encoding for the whole

play11:28

sentence Squatch eats Pizza

play11:31

yum

play11:33

note if we reverse the order of the

play11:36

input words to be Pizza eats squash then

play11:40

the embeddings for the first and third

play11:42

words get swapped but the positional

play11:45

values for the first second and third

play11:47

word stay the same and when we add the

play11:51

positional values to the embeddings

play11:53

we end up with new positional encoding

play11:55

for the first and third words and the

play11:58

second word since it didn't move stays

play12:01

the same

play12:02

thus positional encoding allows a

play12:05

Transformer to keep track of word order

play12:07

bam

play12:10

now let's go back to our simple example

play12:12

where we are just trying to translate

play12:14

the English sentence let's go

play12:16

and add position values to the word

play12:19

embeddings

play12:20

the first embedding for the first word

play12:22

Let's gets zero and the second embedding

play12:25

gets one and the first embedding for the

play12:28

second word go gets a negative 0.9 and

play12:33

the second embedding gets 0.4

play12:36

and now we just do the math to get the

play12:38

positional encoding for both words bam

play12:42

now because we're going to need all the

play12:45

space we can get let's consolidate the

play12:47

math in the diagram and let the sine and

play12:50

cosine and plus symbols represent the

play12:52

positional encoding

play12:54

now that we know how to keep track of

play12:56

each word's position let's talk about

play12:59

how a Transformer keeps track of the

play13:01

relationships among words

play13:04

for example if the input sentence was

play13:07

this the pizza came out of the oven and

play13:10

it tasted good then this word it could

play13:15

refer to pizza or potentially it could

play13:18

refer to the word oven Josh I've heard

play13:22

of good tasting pizza but never a good

play13:24

tasting oven I know Squatch that's why

play13:28

it's important that the Transformer

play13:29

correctly Associates the word it with

play13:32

pizza the good news is that Transformers

play13:35

have something called self-attention

play13:37

which is a mechanism to correctly

play13:39

associate the word ID with the word

play13:41

Pizza

play13:43

in general terms self-attention works by

play13:46

seeing how similar each word is to all

play13:48

of the words in the sentence including

play13:51

itself for example self-attention

play13:54

calculates the similarity between the

play13:56

first word the and all of the words in

play13:59

the sentence including itself

play14:02

and self-attention calculates these

play14:04

similarities for every word in the

play14:06

sentence

play14:07

once the similarities are calculated

play14:10

they are used to determine how the

play14:12

Transformer encodes each word

play14:15

for example if you looked at a lot of

play14:17

sentences about pizza and the word ID

play14:19

was more commonly associated with pizza

play14:22

than oven

play14:24

then the similarity score for pizza will

play14:26

cause it to have a larger impact on how

play14:28

the word ID is encoded by the

play14:31

Transformer

play14:32

bam

play14:33

and now that we know the main ideas of

play14:36

how self-attention Works let's look at

play14:38

the details

play14:39

so let's go back to our simple example

play14:41

where we had just added positional

play14:43

encoding to the words let's and go

play14:47

the first thing we do is multiply the

play14:49

position encoded values for the word

play14:51

let's by a pair of weights and we add

play14:54

those products together to get Negative

play14:57

1.0

play14:58

then we do the same thing with a

play15:00

different pair of weights to get 3.7

play15:04

we do this twice because we started out

play15:07

with two position encoded values that

play15:09

represent the word leads and after doing

play15:12

the math two times we still have two

play15:15

values representing the word leads

play15:17

Josh I don't get it

play15:20

if we want two values to represent let's

play15:22

why don't we just use the two values we

play15:25

started with

play15:26

that's a great question Squatch and

play15:29

we'll answer it in a little bit grr

play15:33

anyway for now just know that we have

play15:36

these two new values to represent the

play15:38

word let's and in Transformer

play15:40

terminology we call them query values

play15:43

and now that we have query values for

play15:45

the word let's use them to calculate the

play15:48

similarity between itself and the word

play15:50

go

play15:52

and we do this by creating two new

play15:54

values just like we did for the query to

play15:57

represent the word let's

play15:59

and we create two new values to

play16:01

represent the word go

play16:03

both sets of new values are called key

play16:06

values

play16:07

and we use them to calculate

play16:09

similarities with the query for let's

play16:13

one way to calculate similarities

play16:15

between the query and the keys is to

play16:17

calculate something called a DOT product

play16:20

for example in order to calculate the

play16:22

dot product similarity between the query

play16:25

and key for let's

play16:27

we simply multiply each pair of numbers

play16:29

together

play16:30

and then add the products to get 11.7

play16:34

likewise we can calculate the dot

play16:37

product similarity between the query for

play16:39

let's and the key for go

play16:41

by multiplying the pairs of numbers

play16:43

together

play16:45

and adding the products to get negative

play16:47

2.6

play16:49

the relatively large similarity value

play16:51

for let's relative to itself

play16:54

11.7 compared to the relatively small

play16:57

value for lets relative to the word go

play17:00

negative 2.6

play17:03

tells us that let's is much more similar

play17:05

to itself than it is to the word go

play17:09

that said if you remember the example

play17:11

where the word it could relate to pizza

play17:13

or oven

play17:15

the word it should have a relatively

play17:17

large similarity value with respect to

play17:20

the word Pizza since it refers to pizza

play17:22

and not oven

play17:25

note there's a lot to be said about

play17:27

calculating similarities in this context

play17:29

and the dot product so if you're

play17:31

interested check out the quests anyway

play17:34

since let's is much more similar to

play17:36

itself than it is to the word go

play17:39

then we want let's to have more

play17:41

influence on its encoding than the word

play17:43

go

play17:45

and we do this by first running the

play17:47

similarities course through something

play17:48

called a soft Max function

play17:51

the main idea of a soft Max function is

play17:54

that it preserves the order of the input

play17:56

values from low to high and translates

play17:59

them into numbers between 0 and 1 that

play18:02

add up to one

play18:04

so we can think of the output of the

play18:06

softmax function as a way to determine

play18:08

what percentage of each input word we

play18:11

should use to encode the word let's

play18:13

in this case because let's is so much

play18:16

more similar to itself than the word go

play18:18

we'll use one hundred percent of the

play18:21

word let's to encode less

play18:23

and zero percent of the word go to

play18:26

encode the word let's

play18:28

note there's a lot more to be said about

play18:30

the soft Max function so if you're

play18:32

interested check out the quest

play18:34

anyway because we want 100 of the word

play18:37

let's to encode let's

play18:40

we create two more values that will

play18:43

cleverly call values to represent the

play18:46

word let's

play18:47

and scale the values that represent

play18:49

let's by 1.0

play18:51

then we create two values to represent

play18:54

the word go

play18:55

and scale those values by 0.0

play18:59

lastly we add the scaled values together

play19:02

and these sums which combine separate

play19:05

encodings for both input words let's and

play19:08

go relative to their similarity to Let's

play19:10

are the self-attention values for leads

play19:13

bam

play19:15

now that we have self-attention values

play19:18

for the word let's it's time to

play19:20

calculate them for the word go

play19:22

the good news is that we don't need to

play19:25

recalculate the keys and values instead

play19:28

all we need to do is create the query

play19:31

that represents the word go

play19:33

and do the math

play19:35

by first calculating the similarity

play19:37

scores between the new query and the

play19:40

keys and then run the similarity scores

play19:42

through a softmax

play19:44

and then scale the values

play19:47

and then add them together

play19:49

and we end up with the self-attention

play19:51

values for go

play19:53

note before we move on I want to point

play19:56

out a few details about self-attention

play19:59

first the weights that we use to

play20:01

calculate the self-attention queries are

play20:03

the exact same for let's and go

play20:06

in other words this example uses one set

play20:09

of weights for calculating

play20:11

self-attention queries regardless of how

play20:13

many words are in the input

play20:16

likewise we reuse the sets of weights

play20:18

for calculating self-attention keys and

play20:21

values for each input word

play20:23

this means that no matter how many words

play20:26

are input into the Transformer

play20:28

we just reuse the same sets of weights

play20:31

for self-attention queries keys and

play20:34

values

play20:35

the other thing I want to point out is

play20:37

that we can calculate the queries keys

play20:39

and values for each word at the same

play20:42

time

play20:43

in other words we don't have to

play20:45

calculate the query key and value for

play20:47

the first word first before moving on to

play20:50

the second word

play20:51

and because we can do all of the

play20:53

computation at the same time

play20:55

Transformers can take advantage of

play20:58

parallel Computing and run fast

play21:01

now that we understand the details of

play21:03

how self-attention Works let's shrink it

play21:05

down so we can keep building our

play21:07

Transformer bam Josh you forgot

play21:11

something if we want two values to

play21:13

represent let's why don't we just use

play21:16

the two position encoded values we

play21:18

started with

play21:19

first the new self-attention values for

play21:22

each word contain input from all of the

play21:25

other words and this helps give each

play21:27

word context and this can help establish

play21:30

how each word in the input is related to

play21:32

the others

play21:33

also if we think of this unit with its

play21:37

weights for calculating queries keys and

play21:39

values as a self-attention cell

play21:42

then in order to correctly establish how

play21:45

words are related in complicated

play21:46

sentences and paragraphs

play21:49

we can create a stack of self-attention

play21:51

cells each with its own sets of Weights

play21:54

that we apply to the position encoded

play21:57

values for each word to capture

play21:59

different relationships among the words

play22:02

in the manuscript that first describes

play22:04

Transformers they stacked eight

play22:06

self-attention cells and they called

play22:08

this multi-head attention

play22:10

why eight instead of 12 or 16 I have no

play22:14

idea

play22:15

bam

play22:17

okay going back to our simple example

play22:20

with only one self-attention cell

play22:23

there's one more thing we need to do to

play22:25

encode the input

play22:27

we take the position encoded values

play22:30

and add them to the self-attention

play22:32

values these bypasses are called

play22:35

residual connections and they make it

play22:37

easier to train complex neural networks

play22:40

by allowing the self-attention layer to

play22:43

establish relationships among the input

play22:45

words without having to also preserve

play22:47

the word embedding and positioning

play22:49

coding information

play22:51

bam

play22:52

and that's all we need to do to encode

play22:55

the input for this simple Transformer

play22:58

double bam note this simple Transformer

play23:02

only contains the parts required for

play23:04

encoding the input word embedding

play23:07

positional encoding self-attention and

play23:11

residual connections

play23:13

these four features allow the

play23:15

Transformer to encode words into numbers

play23:19

encode the positions of the words encode

play23:23

the relationships among the words and

play23:26

relatively easily and quickly train in

play23:29

parallel that said there are lots of

play23:32

extra things we can add to a Transformer

play23:34

and we'll talk about those at the end of

play23:36

this Quest bam so now that we've encoded

play23:40

the English input phrase let's go it's

play23:43

time to decode it into Spanish

play23:46

in other words the first part of a

play23:49

transformer is called an encoder and now

play23:52

it's time to create the second part A

play23:54

decoder

play23:55

the decoder just like the encoder starts

play23:58

with word embedding however this time we

play24:02

create embedding values for the output

play24:04

vocabulary which consists of the Spanish

play24:06

words ear vamos e and the EOS end of

play24:12

sequence token

play24:14

now because we just finished encoding

play24:16

the English sentence let's go

play24:19

the decoder starts with embedding values

play24:21

for the EOS token in this case we're

play24:25

using the EOS token to start the

play24:27

decoding because that is a common way to

play24:30

initialize the process of decoding the

play24:32

encoded input sentence

play24:34

however sometimes you'll see people use

play24:37

SOS for startup sentence or start of

play24:40

sequence to initialize the process

play24:43

Josh starting with SOS makes more sense

play24:46

to me then you can do it that way

play24:48

Squatch I'm just saying a lot of people

play24:51

start with EOS

play24:53

anyway we plug in 1 for Eos and zero for

play24:57

everything else

play24:58

and do the math

play25:00

and we end up with 2.70 and negative

play25:04

1.34 as the numbers that represent the

play25:07

EOS token bam

play25:10

now let's shrink the word embedding down

play25:12

to make more space

play25:14

so that we can add the positional

play25:16

encoding

play25:17

note these are the exact same sine and

play25:20

cosine squiggles that we used when we

play25:22

encoded the input and since the EOS

play25:25

token is in the first position with two

play25:28

embeddings we just add those two

play25:30

position values

play25:31

and we get 2.70 and negative 0.34 as the

play25:36

position and word embedding values

play25:38

representing the EOS token bam now let's

play25:42

consolidate the math in the diagram

play25:45

and before we move on to the next step

play25:47

let's review a key concept from when we

play25:50

encoded the input

play25:51

one key concept from earlier was that we

play25:54

created a single unit to process an

play25:56

input word

play25:58

and then we just copied that unit for

play26:00

each word in the input

play26:02

and if we had more words we just make

play26:04

more copies of the same unit

play26:07

by creating a single unit that can be

play26:09

copied for each input word the

play26:12

Transformer can do all of the

play26:14

computation for each word in the input

play26:16

at the same time

play26:18

for example we can calculate the word

play26:20

embeddings on different processors at

play26:22

the same time

play26:24

and then add the positional encoding at

play26:26

the same time

play26:28

and then calculate the queries keys and

play26:30

values at the same time

play26:33

and once that is done we can calculate

play26:35

the self-attention values at the same

play26:38

time

play26:39

and lastly we can calculate the residual

play26:42

connections at the same time

play26:44

doing all of the computations at the

play26:47

same time rather than doing them

play26:49

sequentially for each word

play26:51

means we can process a lot of words

play26:53

relatively quickly on a chip with a lot

play26:56

of computing cores like a GPU Graphics

play26:59

Processing Unit or multiple chips in the

play27:02

cloud

play27:03

well likewise when we decode and

play27:06

translate the input we want a single

play27:08

unit that we can copy for each

play27:10

translated word for the same reasons we

play27:12

want to do the math quickly

play27:15

so even though we're only processing the

play27:18

EOS token so far we add a self-attention

play27:21

layer so that ultimately we can keep

play27:24

track of related words in the output

play27:27

now that we have the query key and value

play27:30

numbers for the EOS token we calculate

play27:33

itself attention values just like before

play27:36

and the self-attention values for the

play27:38

EOS token are negative 2.8 and negative

play27:41

2.3 note the sets of Weights we used to

play27:46

calculate the decoder self-attention

play27:48

query key and value are different from

play27:50

the sets we used in the encoder

play27:53

now let's consolidate the math and add

play27:56

residual connections just like before

play27:59

bam

play28:01

now so far we've talked about how

play28:03

self-attention helps the Transformer

play28:05

keep track of how words are related

play28:07

within a sentence

play28:09

however since We're translating a

play28:12

sentence we also need to keep track of

play28:14

the relationships between the input

play28:16

sentence and the output

play28:18

for example if the input sentence was

play28:22

don't eat the delicious looking and

play28:25

smelling pizza then when translating

play28:28

it's super important to keep track of

play28:31

the very first word don't

play28:33

if the translation focuses on other

play28:36

parts of the sentence and omits the

play28:38

don't then we'll end up with

play28:40

eat the delicious looking and smelling

play28:43

Pizza

play28:44

and these two sentences have completely

play28:46

opposite meanings

play28:48

so it's super important for the decoder

play28:51

to keep track of the significant words

play28:53

in the input

play28:54

so the main idea of encoder decoder

play28:57

attention is to allow the decoder to

play29:00

keep track of the significant words in

play29:02

the input

play29:04

now that we know the main idea behind

play29:06

encoder decoder attention here are the

play29:09

details

play29:10

first to give us a little more room

play29:13

let's consolidate the math and the

play29:15

diagrams

play29:16

now just like we did for self-attention

play29:18

we create two new values to represent

play29:21

the query for the EOS token in the

play29:24

decoder then we create keys for each

play29:27

word in the encoder and we calculate the

play29:30

similarities between the EOS token in

play29:33

the decoder and each word in the encoder

play29:36

by calculating the dot products just

play29:39

like before

play29:40

then we run the similarities through a

play29:42

softmax function and this tells us to

play29:45

use one hundred percent of the first

play29:47

input word and zero percent of the

play29:50

second when the decoder determines what

play29:53

should be the first translated word

play29:56

now that we know what percentage of each

play29:58

input word to use when determining what

play30:01

should be the first translated word

play30:03

we calculate values for each input word

play30:07

and then scale those values by the soft

play30:09

Max percentages

play30:11

and then add the pairs of scaled values

play30:14

together to get the encoder decoder

play30:16

attention values

play30:18

bam

play30:19

now to make room for the next step let's

play30:22

consolidate the encoder decoder

play30:24

attention in our diagram

play30:26

note the sets of Weights that we use to

play30:28

calculate the queries keys and values

play30:31

for encoder decoder attention are

play30:33

different from the sets of Weights we

play30:35

use for self-attention

play30:37

however just like for self-attention the

play30:40

sets of Weights are copied and reused

play30:42

for each word

play30:44

this allows the Transformer to be

play30:46

flexible with the length of the inputs

play30:48

and outputs

play30:50

and also we can stack encode or decoder

play30:53

attention just like we can stack

play30:55

self-attention to keep track of words in

play30:58

complicated phrases bam

play31:01

now we add another set of residual

play31:03

connections

play31:04

that allow the encoder decoder attention

play31:07

to focus on the relationships between

play31:09

the output words and the input words

play31:11

without having to preserve the

play31:13

self-attention or word and position

play31:15

encoding that happened earlier then we

play31:19

consolidate the math and the diagram

play31:21

lastly we need a way to take these two

play31:24

values that represent the EOS token in

play31:27

the decoder and select one of the four

play31:30

output tokens ear vamos e or EOS

play31:35

so we run these two values through a

play31:37

fully connected layer that has one input

play31:39

for each value that represents the

play31:41

current token so in this case we have

play31:44

two inputs

play31:45

and one output for each token in the

play31:48

output vocabulary which in this case

play31:50

means four outputs

play31:52

note a fully connected layer is just a

play31:55

simple neural network with weights

play31:57

numbers we multiply the inputs by

play32:00

and biases numbers we add to the sums of

play32:03

the products

play32:04

and when we do the math we get four

play32:07

output values

play32:08

which we run through a final soft Max

play32:11

function to select the first output word

play32:13

vamos

play32:15

bam

play32:17

note vamos is the Spanish translation

play32:20

for Let's Go triple boom no not yet

play32:25

so far the translation is correct but

play32:28

the decoder doesn't stop until it

play32:30

outputs an EOS token so let's

play32:34

consolidate our diagrams

play32:36

and plug the translated word vamos into

play32:39

a copy of the decoder's embedding layer

play32:41

and do the math

play32:43

first we get the word embeddings for

play32:45

vamos

play32:47

then we add the positional encoding

play32:50

now we calculate self-attention values

play32:52

for vamos using the exact same weights

play32:54

that we used for the EOS token

play32:57

now add the residual connections

play33:00

and calculate the encoder decoder

play33:02

attention using the same sets of Weights

play33:05

that we used for the EOS token

play33:07

now we add more residual connections

play33:10

lastly we run the values that represent

play33:13

vamos through the same fully connected

play33:15

layer and softmax that we used for the

play33:17

EOS token and the second output from the

play33:21

decoder is eos so we are done decoding

play33:25

triple bam

play33:27

at long last we've shown how a

play33:30

Transformer can encode a simple input

play33:32

phrase let's go

play33:34

and decode the encoding into the

play33:37

translated phrase of vamos

play33:40

in summary

play33:41

Transformers use word embedding to

play33:44

convert words into numbers

play33:46

positional encoding to keep track of

play33:48

word order

play33:50

self-attention to keep track of word

play33:52

relationships within the input and

play33:54

output phrases

play33:56

encoder decoder attention to keep track

play33:59

of things between the input and output

play34:01

phrases to make sure that important

play34:03

words in the input are not lost in the

play34:05

translation

play34:07

and residual connections to allow each

play34:09

subunit like self-attention to focus on

play34:12

solving just one part of the problem

play34:15

now that we understand the main ideas of

play34:18

how Transformers work let's talk about a

play34:20

few extra things we can add to them

play34:22

in this example we kept things super

play34:25

simple

play34:26

however if we had larger vocabularies

play34:29

and the original Transformer had 37 000

play34:31

tokens and longer input and output

play34:34

phrases

play34:35

then in order to get their model to work

play34:37

they had to normalize the values after

play34:40

every step

play34:41

for example they normalize the values

play34:44

after positional encoding and after

play34:46

self-attention in both the encoder and

play34:48

the decoder also when we calculated

play34:51

attention values we used the dot product

play34:54

to calculate the similarities but you

play34:57

can use whatever similarity function you

play34:59

want

play35:00

in the original Transformer manuscript

play35:02

they calculated the similarities with a

play35:05

DOT product divided by the square root

play35:07

of the number of embedding values per

play35:09

token just like with scaling the values

play35:12

after each step they found that scaling

play35:14

the dot product helped encode and decode

play35:16

long and complicated phrases

play35:19

lastly to give a Transformer more

play35:22

weights and biases to fit to complicated

play35:24

data you can add additional neural

play35:26

networks with hidden layers to both the

play35:28

encoder and decoder bam

play35:31

now it's time for some

play35:34

Shameless self-promotion if you want to

play35:37

review statistics and machine learning

play35:39

offline check out the statquest PDF

play35:41

study guides in my book the stat Quest

play35:44

Illustrated guide to machine learning at

play35:47

stackwest.org there's something for

play35:49

everyone

play35:50

hooray we've made it to the end of

play35:52

another exciting stat Quest if you like

play35:55

this stack Quest and want to see more

play35:56

please subscribe and if you want to

play35:58

support stackquest consider contributing

play36:01

to my patreon campaign becoming a

play36:03

channel member buying one or two of my

play36:05

original songs or a t-shirt or a hoodie

play36:07

or just donate the links are in the

play36:10

description below alright until next

play36:12

time Quest on

Rate This

5.0 / 5 (0 votes)

Related Tags
ニューラルネットワークTransformer翻訳Josh StarmerStatQuest機械学習自然言語処理単語埋め込み位置エンコーディング自己注意エンコーダーデコーダー
Do you need a summary in English?