Demystifying how GPT works: From Architecture to...Excel!?! 🚀

Spreadsheets are all you need
9 Oct 202309:57

Summary

TLDRこのビデオシリヌズでは、スプレッドシヌトを䜿っおGPT-2、ChatGPTの初期の祖先である倧芏暡蚀語モデルを実装する方法を玹介したす。GPT-2 smallを䟋に、テキストをトヌクンに分割し、それぞれのトヌクンを数倀のリストにマッピングするプロセスから、マルチヘッドアテンションやマルチレむダヌパヌセプトロンを含むモデルの構造たでを、基本的なスプレッドシヌト機胜を䜿甚しお解説したす。このアプロヌチにより、珟代のAI技術がどのように機胜するかに぀いお、より深い理解を埗るこずができたす。今埌のビデオでは、これらの各ステップに぀いお詳しく説明しおいきたす。

Takeaways

  • 📊 このシリヌズでは、基本的なスプレッドシヌト機胜だけで倧きな蚀語モデルGPT-2を実装しおいく。
  • 🔍 テキストはトヌクンに分割され、これらは事前定矩された蟞曞に基づいおいる。
  • 🧮 トヌクンはバむトペア笊号化ずいうアルゎリズムを䜿甚しおトヌクンIDにマッピングされる。
  • 📈 各トヌクンは、意味ず䜍眮をキャプチャする768の数字のリストにマッピングされる。
  • 🔄 トヌクンからテキストぞの埋め蟌みは、トヌクンの意味ずプロンプト内の䜍眮を反映しおいる。
  • 💡 マルチヘッドアテンションずマルチレむダヌパヌセプトロンニュヌラルネットワヌクの䞀皮を通じお、トヌクン間の関係が解析される。
  • 🔗 各ブロックの出力は次のブロックの入力ずしお䜿甚され、GPT-2は12の異なるレむダヌを通じおこのプロセスを繰り返す。
  • 🎯 アテンションメカニズムは、文䞭の重芁な単語やそれらの関係を識別する。
  • 🀖 マルチレむダヌパヌセプトロンは、䞎えられた文脈での単語の最も可胜性の高い意味を決定する。
  • 📝 最終的な蚀語ヘッドは、最も可胜性の高い次のトヌクンを遞択し、それを文に远加する。

Q & A

  • GPT-2のスプレッドシヌト実装では、どのようにテキストが凊理されたすか

    -テキストはたずトヌクンに分割されたす。各単語は事前定矩された蟞曞に基づいおトヌクンに倉換され、スプレッドシヌトの「プロンプトからトヌクンぞ」タブでバむトペア笊号化アルゎリズムにより最終的なトヌクンIDにマップされたす。

  • 埋め蟌みembeddingずは䜕ですか、そしおGPT-2でどのように䜿甚されたすか

    -埋め蟌みは、各トヌクンを数倀のリストにマッピングするプロセスです。GPT-2スモヌルでは、各トヌクンは768の数倀のリストにマップされ、これはトヌクンの意味ず䜍眮を捉えたす。

  • 䜍眮埋め蟌みの目的は䜕ですか

    -䜍眮埋め蟌みは、トヌクンのプロンプト内の䜍眮に応じお埋め蟌み倀をわずかに倉曎するこずで、トヌクンの䜍眮情報を捉えたす。これにより、モデルは同じ単語でも異なる文脈での意味を区別できたす。

  • 倚頭泚意機構multi-headed attentionの圹割は䜕ですか

    -倚頭泚意機構は、文䞭の単語がどのように関連しおいるかを理解し、重芁な単語を特定するこずで、文脈を把握したす。䟋えば、「he」が「Mike」を指すこずを認識するなどです。

  • 倚局パヌセプトロンの機胜ずは䜕ですか

    -倚局パヌセプトロンは、単語の耇数の意味を区別し、文脈に基づいお最も適切な意味を遞択する圹割を果たしたす。これにより、モデルは続く単語やトヌクンをより正確に予枬できたす。

  • 蚀語ヘッドlanguage headの圹割は䜕ですか

    -蚀語ヘッドは、最終ブロックの出力を確率セットに倉換し、蟞曞内の既知のトヌクンから最も可胜性の高いトヌクンを遞択しお文を完成させたす。

  • GPT-2のスプレッドシヌト実装で、どのようにしお次のトヌクンが遞択されたすか

    -スプレッドシヌトでは、最終ブロックの出力から生成された確率に基づいお、最も可胜性の高いトヌクンが遞択されたす。このデモでは、最も高い確率を持぀トヌクンが遞択されおいたす。

  • GPT-2モデルの繰り返しプロセスにおける各ブロックの圹割は䜕ですか

    -GPT-2の各ブロックは、泚意機構ずパヌセプトロンを含み、入力を受け取り、それを凊理しお次のブロックぞの出力を生成したす。このプロセスは、12の異なるレむダヌたたはブロックを通じお繰り返されたす。

  • トヌクンがどのようにしお埋め蟌みにマップされるかの䟋を教えおください。

    -䟋えば、'Mike' ずいう単語は、トヌクンIDにマップされ、その埌、768の数倀からなるリストに倉換されたす。これにより、単語の意味ずその䜍眮が衚珟されたす。

  • 枩床temperatureれロずは䜕を意味したすか

    -枩床れロずは、モデルが最も可胜性の高い1぀のトヌクンのみを遞択する状態を指したす。これは䞀貫性のある出力を提䟛したすが、より倚くのトヌクンから遞択するこずで倚様性を持たせるこずもできたす。

Outlines

00:00

📝スプレッドシヌトでGPT2の抂芁

このパラグラフは、スプレッドシヌトを䜿甚しおGPT2の構造ず凊理の流れを実装しおいるこずを説明しおいたす。入力テキストのトヌクン化、Embeddingsの生成、AttentionずMLPを䜿甚したブロックの反埩凊理などの抂芁が述べられおいたす。

05:03

😕日本語芁玄は難しい

2番目のパラグラフの内容は技術的で難解です。平易な日本語を䜿甚しお芁玄するこずをおすすめしたす。

Mindmap

Keywords

💡Tokenization

The process of splitting text into tokens. This is the first step that the model performs on the input text in order to convert it into a format that can be processed by the later stages of the model.

💡Embeddings

The numeric representations of tokens that capture semantic meaning as well as position. Each token is mapped to a 768-dimensional vector.

Highlights

The transcript walks through implementing GPT-2 in a spreadsheet using basic functions

The spreadsheet implements a smaller version called GPT-2 small but has the same architecture

Input text is split into tokens using byte-pair encoding

Tokens are mapped to lists of numbers called embeddings that capture meaning and position

There are 12 blocks with attention and multi-layer perceptron layers to refine predictions

Attention figures out which words are most relevant to refine the predictions

The final step predicts the most likely next token to complete the prompt

The spreadsheet picks the token with the highest probability for simplicity

The input text is parsed into tokens that map to IDs

Embeddings capture position as well as meaning of tokens

Attention identifies which words have the most influence on predictions

The blocks implement attention and neural network layers iteratively

Attention helps disambiguate meanings of words for the neural network

The final output predicts and selects the most likely next token

The spreadsheet uses the token with maximum probability for consistency

Transcripts

play00:00

welcome to spreadsheets are all you need

play00:03

how GPT Works where if you can read a

play00:06

spreadsheet you can understand modern AI

play00:10

That's because in this series we're

play00:12

walking through a spreadsheet that

play00:14

implements a large language model

play00:16

entirely in basic spreadsheet functions

play00:18

and not just any large language model

play00:21

we're implementing gpt2 an early

play00:23

ancestor of chat GPT now because it is a

play00:27

spreadsheet it can only support a

play00:29

smaller context link and it does

play00:31

implement the smallest form of gpt2

play00:34

known as gpt2 small but architecturally

play00:38

for all intents and purposes it's the

play00:41

same model that was breaking headlines

play00:43

just a few short years ago let's take a

play00:47

look under the hood how it works now in

play00:49

subsequent videos we're going to go

play00:51

through each of these stages step by

play00:53

step but for now I'm going to touch on

play00:55

each one lightly as a kind of table of

play00:58

contents for future videos

play01:00

in addition I've added a final column

play01:02

here on the right that indicates what

play01:04

tab in the spreadsheet corresponds to

play01:07

what action inside

play01:09

gpt2 let's start at the beginning after

play01:12

you input your text it is split into a

play01:14

series of tokens so for example let's

play01:17

take Mike is quick he moves this would

play01:20

be split into tokens per a predefined

play01:24

dictionary now you'll note that every

play01:26

single word here corresponds to a single

play01:29

token but that is not always the case in

play01:31

fact it's not uncommon for a single word

play01:33

to be split into two three or even more

play01:37

tokens let's take a look at the

play01:39

spreadsheet so here's where you input

play01:41

your prompt and because of the way the

play01:43

parsing works you have to put each word

play01:45

a separate line you can have to add the

play01:47

spaces as well as the punctuation it

play01:50

then gets taken to this sheet which is

play01:52

or tab called prompt to tokens where it

play01:55

goes through an algorithm called bite

play01:56

pair encoding to map it to a final list

play01:59

of known own token IDs you see right

play02:04

here now that we have the tokens we need

play02:07

to map them to a series of numbers

play02:09

called an embedding every token is

play02:12

mapped to a long list of numbers in the

play02:15

case of gpt2 small it's a list of

play02:18

768 numbers these capture both the

play02:21

meaning as well as the position of each

play02:24

token in the prompt let's see how this

play02:27

works inside the

play02:28

spreadsheet

play02:30

okay so here we are in the spreadsheet

play02:32

that implements this it's tokens to text

play02:35

embeddings Tab and there's two parts to

play02:37

it at the top you'll see our prompt

play02:39

tokens Mike is quick he moves and these

play02:42

are those prompt IDs we saw from the

play02:44

earlier stat and then from columns three

play02:46

onwards are the list the 768 numbers

play02:50

that represent the semantic meaning of

play02:52

the word Mike let's go look at column

play02:55

770 and we can see where this list

play02:58

ends right here you can see the list

play03:00

ending let's go back to the

play03:05

beginning and you'll notice there's

play03:07

another list here the job of this list

play03:10

is to actually change the tokens from

play03:13

the list above to reflect their

play03:15

different positions in the prompt let me

play03:19

explain and demonstrate that here by

play03:21

changing this word moves to the word

play03:25

Mike which is the first

play03:28

word in our prompt we'll go through

play03:32

here we'll recalculate our

play03:36

tokens we'll see we get Mike again then

play03:39

we back to our tokens to text embeddings

play03:42

we'll calculate the sheet and you'll

play03:44

notice that Mike here has the same ID

play03:47

and has the exact same embedding values

play03:50

as it did does up here right row two and

play03:53

row seven are totally identical that's

play03:56

because the only job of this first set

play03:58

of rows is to capture the semantic

play04:01

meaning but when we take a look here at

play04:04

this part where we have the position

play04:05

embeddings you'll notice that the values

play04:08

of the embedding for Mike at position

play04:10

one are different than the values for

play04:12

Mike at position six we've effectively

play04:15

altered the values of the embeddings for

play04:18

Mike slightly to reflect its different

play04:21

position in the

play04:28

prompt okay now that we've captured both

play04:31

the meaning and the position the tokens

play04:34

in the prompt they pass on to a series

play04:36

of layers or blocks the first is

play04:39

multi-headed attention and then the

play04:40

second is what's known as a multi-layer

play04:42

perceptron that's another name for a

play04:44

neural network let's consider our

play04:46

sentence again Mike is quick he moves

play04:49

where we want the Transformer or GPT to

play04:52

fill in the last word the attention

play04:55

mechanism the first phase tries to

play04:57

figure out what are the most important

play04:59

words words in the sentence and how they

play05:02

relate so for example the word he it

play05:06

might recognize as referring to M

play05:08

earlier in the prompt or it might

play05:12

realize that the word moves and quick

play05:14

probably relate this information is

play05:16

important for the next layer the

play05:18

multi-layer

play05:19

perceptron so take for example this word

play05:22

quick it has multiple meanings in

play05:24

English it can mean moving fast it can

play05:26

mean bright as in quick of wit it can

play05:30

mean a body part as in the quick of your

play05:32

fingernail and in Shakespearean English

play05:34

it can even mean alive as opposed to

play05:36

dead as in the phrase the quick and the

play05:39

dead the information from the attention

play05:42

layer that the word moves is there with

play05:44

the word quick helps the multi-layer

play05:47

perceptron disambiguate which of these

play05:49

four meanings is most likely in this

play05:51

sentence and that it's most likely the

play05:53

first one moving in physical space and

play05:56

it would use that to figure out what the

play05:58

most likely next word to complete the

play06:00

prompt is like the word quickly or the

play06:03

word fast or the word around all of

play06:06

which are about fast movement in

play06:08

physical

play06:11

space it's also important to note that

play06:14

this attention then perceptron attention

play06:17

then perceptron process happens

play06:20

iteratively in gpt2 small it happens

play06:22

across 12 different layers as it

play06:24

iteratively refines its prediction of

play06:26

what the next most likely word or token

play06:28

should be

play06:31

let's see how this is implemented in the

play06:33

spreadsheet so you'll notice in the

play06:35

spreadsheet there are these tabs block

play06:37

zero block one block two all the way to

play06:39

block 11 these are our 12 blocks and the

play06:42

output of block zero becomes the input

play06:44

of block one and the output of block one

play06:46

becomes the input of block two so

play06:48

they're all chained together all the way

play06:50

through let's look inside one of these

play06:53

blocks so here's the first block and

play06:57

each block has about 16 steps in this

play07:00

implementation steps one all the way to

play07:02

around step 10 are basically your

play07:06

attention mechanism and from Step 10 all

play07:08

the way to the remaining 16 is the

play07:10

multi-layer perceptron we're going to go

play07:12

through this in a lot more detail in

play07:14

future videos but I want to give you a

play07:15

sneak peek of something so

play07:18

here right at step seven is the heart of

play07:20

the attention mechanism it tells us

play07:23

where it's paying the most attention to

play07:25

amongst the words so let's look at the

play07:27

word he you'll notice the large lest

play07:29

value here

play07:31

0.48 is highest right here so it's

play07:33

taking the word he and it's realizing

play07:35

that most likely is referring to the

play07:37

word Mike 0.48 is larger than any of the

play07:40

other values so it's going to influence

play07:43

the values it passes to the multi-layer

play07:45

perceptron more than any of the other

play07:47

words the other other words are getting

play07:50

a much smaller influence on the output

play07:52

it passes along let's take the word

play07:54

moves again you'll notice that it's

play07:56

looking most at the word mik and then

play07:58

the next other word it's looking most at

play08:00

is quick so it's going to use the

play08:01

information from those two words again

play08:03

that it passes to the next layer to try

play08:05

and interpret the value or meaning of

play08:08

the word

play08:12

moves okay we're almost at the end the

play08:15

last step is the language head which

play08:18

figures out what the actual next likely

play08:20

token is what it does is it takes the

play08:23

output of the final block and converts

play08:25

it into a set of probabilities across

play08:27

all the known tokens and its dictionary

play08:30

and then it picks from amongst the most

play08:32

likely tokens randomly one of those

play08:35

tokens to complete the

play08:37

sentence in this case it's picked simply

play08:40

the highest probability token which was

play08:42

quickly and fills that in let's take a

play08:44

look at the spreadsh

play08:45

sheeet now in the spreadsheet you'll see

play08:48

this is broken across three tabs layer

play08:51

Norm which is a process we'll talk about

play08:53

in a future video generating logits and

play08:56

a softmax again Concepts we talk about

play08:58

later to find finally get our predicted

play09:00

token now in a true large language model

play09:03

that you've probably played with it

play09:05

actually picks from amongst a set of the

play09:07

most likely tokens but in order to

play09:09

simplify this sheet we just simply pick

play09:12

from the very most likely token which

play09:14

gives a very consistent output that's

play09:16

why we've got a Max function it's just

play09:18

simply taking the most likely output

play09:20

this is what's known as having

play09:21

temperature zero when you go outside of

play09:23

temperature zero it starts picking from

play09:25

more than just the top token and it

play09:27

starts looking at the top 10 or 20 or 30

play09:30

or more tokens and it picks from them

play09:32

according to an

play09:37

algorithm okay that's gpt2 at a glance

play09:41

we'll be going through each of these

play09:42

steps in future videos but for now I

play09:44

hope that gives you a starting point as

play09:46

to what's going on under the hood and

play09:49

where you can see it happening live for

play09:51

yourself inside the spreadsheet thank

play09:55

you

Rate This
★
★
★
★
★

5.0 / 5 (0 votes)

Do you need a summary in English?