Demystifying how GPT works: From Architecture to...Excel!?! 🚀

Spreadsheets are all you need
9 Oct 202309:57

Summary

TLDRこのビデオシリーズでは、スプレッドシートを使ってGPT-2、ChatGPTの初期の祖先である大規模言語モデルを実装する方法を紹介します。GPT-2 smallを例に、テキストをトークンに分割し、それぞれのトークンを数値のリストにマッピングするプロセスから、マルチヘッドアテンションやマルチレイヤーパーセプトロンを含むモデルの構造までを、基本的なスプレッドシート機能を使用して解説します。このアプローチにより、現代のAI技術がどのように機能するかについて、より深い理解を得ることができます。今後のビデオでは、これらの各ステップについて詳しく説明していきます。

Takeaways

  • 📊 このシリーズでは、基本的なスプレッドシート機能だけで大きな言語モデルGPT-2を実装していく。
  • 🔍 テキストはトークンに分割され、これらは事前定義された辞書に基づいている。
  • 🧮 トークンはバイトペア符号化というアルゴリズムを使用してトークンIDにマッピングされる。
  • 📈 各トークンは、意味と位置をキャプチャする768の数字のリストにマッピングされる。
  • 🔄 トークンからテキストへの埋め込みは、トークンの意味とプロンプト内の位置を反映している。
  • 💡 マルチヘッドアテンションとマルチレイヤーパーセプトロン(ニューラルネットワークの一種)を通じて、トークン間の関係が解析される。
  • 🔗 各ブロックの出力は次のブロックの入力として使用され、GPT-2は12の異なるレイヤーを通じてこのプロセスを繰り返す。
  • 🎯 アテンションメカニズムは、文中の重要な単語やそれらの関係を識別する。
  • 🤖 マルチレイヤーパーセプトロンは、与えられた文脈での単語の最も可能性の高い意味を決定する。
  • 📝 最終的な言語ヘッドは、最も可能性の高い次のトークンを選択し、それを文に追加する。

Q & A

  • GPT-2のスプレッドシート実装では、どのようにテキストが処理されますか?

    -テキストはまずトークンに分割されます。各単語は事前定義された辞書に基づいてトークンに変換され、スプレッドシートの「プロンプトからトークンへ」タブでバイトペア符号化アルゴリズムにより最終的なトークンIDにマップされます。

  • 埋め込み(embedding)とは何ですか、そしてGPT-2でどのように使用されますか?

    -埋め込みは、各トークンを数値のリストにマッピングするプロセスです。GPT-2スモールでは、各トークンは768の数値のリストにマップされ、これはトークンの意味と位置を捉えます。

  • 位置埋め込みの目的は何ですか?

    -位置埋め込みは、トークンのプロンプト内の位置に応じて埋め込み値をわずかに変更することで、トークンの位置情報を捉えます。これにより、モデルは同じ単語でも異なる文脈での意味を区別できます。

  • 多頭注意機構(multi-headed attention)の役割は何ですか?

    -多頭注意機構は、文中の単語がどのように関連しているかを理解し、重要な単語を特定することで、文脈を把握します。例えば、「he」が「Mike」を指すことを認識するなどです。

  • 多層パーセプトロンの機能とは何ですか?

    -多層パーセプトロンは、単語の複数の意味を区別し、文脈に基づいて最も適切な意味を選択する役割を果たします。これにより、モデルは続く単語やトークンをより正確に予測できます。

  • 言語ヘッド(language head)の役割は何ですか?

    -言語ヘッドは、最終ブロックの出力を確率セットに変換し、辞書内の既知のトークンから最も可能性の高いトークンを選択して文を完成させます。

  • GPT-2のスプレッドシート実装で、どのようにして次のトークンが選択されますか?

    -スプレッドシートでは、最終ブロックの出力から生成された確率に基づいて、最も可能性の高いトークンが選択されます。このデモでは、最も高い確率を持つトークンが選択されています。

  • GPT-2モデルの繰り返しプロセスにおける各ブロックの役割は何ですか?

    -GPT-2の各ブロックは、注意機構とパーセプトロンを含み、入力を受け取り、それを処理して次のブロックへの出力を生成します。このプロセスは、12の異なるレイヤーまたはブロックを通じて繰り返されます。

  • トークンがどのようにして埋め込みにマップされるかの例を教えてください。

    -例えば、'Mike' という単語は、トークンIDにマップされ、その後、768の数値からなるリストに変換されます。これにより、単語の意味とその位置が表現されます。

  • 温度(temperature)ゼロとは何を意味しますか?

    -温度ゼロとは、モデルが最も可能性の高い1つのトークンのみを選択する状態を指します。これは一貫性のある出力を提供しますが、より多くのトークンから選択することで多様性を持たせることもできます。

Outlines

00:00

📝スプレッドシートでGPT2の概要

このパラグラフは、スプレッドシートを使用してGPT2の構造と処理の流れを実装していることを説明しています。入力テキストのトークン化、Embeddingsの生成、AttentionとMLPを使用したブロックの反復処理などの概要が述べられています。

05:03

😕日本語要約は難しい

2番目のパラグラフの内容は技術的で難解です。平易な日本語を使用して要約することをおすすめします。

Mindmap

Keywords

💡Tokenization

The process of splitting text into tokens. This is the first step that the model performs on the input text in order to convert it into a format that can be processed by the later stages of the model.

💡Embeddings

The numeric representations of tokens that capture semantic meaning as well as position. Each token is mapped to a 768-dimensional vector.

Highlights

The transcript walks through implementing GPT-2 in a spreadsheet using basic functions

The spreadsheet implements a smaller version called GPT-2 small but has the same architecture

Input text is split into tokens using byte-pair encoding

Tokens are mapped to lists of numbers called embeddings that capture meaning and position

There are 12 blocks with attention and multi-layer perceptron layers to refine predictions

Attention figures out which words are most relevant to refine the predictions

The final step predicts the most likely next token to complete the prompt

The spreadsheet picks the token with the highest probability for simplicity

The input text is parsed into tokens that map to IDs

Embeddings capture position as well as meaning of tokens

Attention identifies which words have the most influence on predictions

The blocks implement attention and neural network layers iteratively

Attention helps disambiguate meanings of words for the neural network

The final output predicts and selects the most likely next token

The spreadsheet uses the token with maximum probability for consistency

Transcripts

play00:00

welcome to spreadsheets are all you need

play00:03

how GPT Works where if you can read a

play00:06

spreadsheet you can understand modern AI

play00:10

That's because in this series we're

play00:12

walking through a spreadsheet that

play00:14

implements a large language model

play00:16

entirely in basic spreadsheet functions

play00:18

and not just any large language model

play00:21

we're implementing gpt2 an early

play00:23

ancestor of chat GPT now because it is a

play00:27

spreadsheet it can only support a

play00:29

smaller context link and it does

play00:31

implement the smallest form of gpt2

play00:34

known as gpt2 small but architecturally

play00:38

for all intents and purposes it's the

play00:41

same model that was breaking headlines

play00:43

just a few short years ago let's take a

play00:47

look under the hood how it works now in

play00:49

subsequent videos we're going to go

play00:51

through each of these stages step by

play00:53

step but for now I'm going to touch on

play00:55

each one lightly as a kind of table of

play00:58

contents for future videos

play01:00

in addition I've added a final column

play01:02

here on the right that indicates what

play01:04

tab in the spreadsheet corresponds to

play01:07

what action inside

play01:09

gpt2 let's start at the beginning after

play01:12

you input your text it is split into a

play01:14

series of tokens so for example let's

play01:17

take Mike is quick he moves this would

play01:20

be split into tokens per a predefined

play01:24

dictionary now you'll note that every

play01:26

single word here corresponds to a single

play01:29

token but that is not always the case in

play01:31

fact it's not uncommon for a single word

play01:33

to be split into two three or even more

play01:37

tokens let's take a look at the

play01:39

spreadsheet so here's where you input

play01:41

your prompt and because of the way the

play01:43

parsing works you have to put each word

play01:45

a separate line you can have to add the

play01:47

spaces as well as the punctuation it

play01:50

then gets taken to this sheet which is

play01:52

or tab called prompt to tokens where it

play01:55

goes through an algorithm called bite

play01:56

pair encoding to map it to a final list

play01:59

of known own token IDs you see right

play02:04

here now that we have the tokens we need

play02:07

to map them to a series of numbers

play02:09

called an embedding every token is

play02:12

mapped to a long list of numbers in the

play02:15

case of gpt2 small it's a list of

play02:18

768 numbers these capture both the

play02:21

meaning as well as the position of each

play02:24

token in the prompt let's see how this

play02:27

works inside the

play02:28

spreadsheet

play02:30

okay so here we are in the spreadsheet

play02:32

that implements this it's tokens to text

play02:35

embeddings Tab and there's two parts to

play02:37

it at the top you'll see our prompt

play02:39

tokens Mike is quick he moves and these

play02:42

are those prompt IDs we saw from the

play02:44

earlier stat and then from columns three

play02:46

onwards are the list the 768 numbers

play02:50

that represent the semantic meaning of

play02:52

the word Mike let's go look at column

play02:55

770 and we can see where this list

play02:58

ends right here you can see the list

play03:00

ending let's go back to the

play03:05

beginning and you'll notice there's

play03:07

another list here the job of this list

play03:10

is to actually change the tokens from

play03:13

the list above to reflect their

play03:15

different positions in the prompt let me

play03:19

explain and demonstrate that here by

play03:21

changing this word moves to the word

play03:25

Mike which is the first

play03:28

word in our prompt we'll go through

play03:32

here we'll recalculate our

play03:36

tokens we'll see we get Mike again then

play03:39

we back to our tokens to text embeddings

play03:42

we'll calculate the sheet and you'll

play03:44

notice that Mike here has the same ID

play03:47

and has the exact same embedding values

play03:50

as it did does up here right row two and

play03:53

row seven are totally identical that's

play03:56

because the only job of this first set

play03:58

of rows is to capture the semantic

play04:01

meaning but when we take a look here at

play04:04

this part where we have the position

play04:05

embeddings you'll notice that the values

play04:08

of the embedding for Mike at position

play04:10

one are different than the values for

play04:12

Mike at position six we've effectively

play04:15

altered the values of the embeddings for

play04:18

Mike slightly to reflect its different

play04:21

position in the

play04:28

prompt okay now that we've captured both

play04:31

the meaning and the position the tokens

play04:34

in the prompt they pass on to a series

play04:36

of layers or blocks the first is

play04:39

multi-headed attention and then the

play04:40

second is what's known as a multi-layer

play04:42

perceptron that's another name for a

play04:44

neural network let's consider our

play04:46

sentence again Mike is quick he moves

play04:49

where we want the Transformer or GPT to

play04:52

fill in the last word the attention

play04:55

mechanism the first phase tries to

play04:57

figure out what are the most important

play04:59

words words in the sentence and how they

play05:02

relate so for example the word he it

play05:06

might recognize as referring to M

play05:08

earlier in the prompt or it might

play05:12

realize that the word moves and quick

play05:14

probably relate this information is

play05:16

important for the next layer the

play05:18

multi-layer

play05:19

perceptron so take for example this word

play05:22

quick it has multiple meanings in

play05:24

English it can mean moving fast it can

play05:26

mean bright as in quick of wit it can

play05:30

mean a body part as in the quick of your

play05:32

fingernail and in Shakespearean English

play05:34

it can even mean alive as opposed to

play05:36

dead as in the phrase the quick and the

play05:39

dead the information from the attention

play05:42

layer that the word moves is there with

play05:44

the word quick helps the multi-layer

play05:47

perceptron disambiguate which of these

play05:49

four meanings is most likely in this

play05:51

sentence and that it's most likely the

play05:53

first one moving in physical space and

play05:56

it would use that to figure out what the

play05:58

most likely next word to complete the

play06:00

prompt is like the word quickly or the

play06:03

word fast or the word around all of

play06:06

which are about fast movement in

play06:08

physical

play06:11

space it's also important to note that

play06:14

this attention then perceptron attention

play06:17

then perceptron process happens

play06:20

iteratively in gpt2 small it happens

play06:22

across 12 different layers as it

play06:24

iteratively refines its prediction of

play06:26

what the next most likely word or token

play06:28

should be

play06:31

let's see how this is implemented in the

play06:33

spreadsheet so you'll notice in the

play06:35

spreadsheet there are these tabs block

play06:37

zero block one block two all the way to

play06:39

block 11 these are our 12 blocks and the

play06:42

output of block zero becomes the input

play06:44

of block one and the output of block one

play06:46

becomes the input of block two so

play06:48

they're all chained together all the way

play06:50

through let's look inside one of these

play06:53

blocks so here's the first block and

play06:57

each block has about 16 steps in this

play07:00

implementation steps one all the way to

play07:02

around step 10 are basically your

play07:06

attention mechanism and from Step 10 all

play07:08

the way to the remaining 16 is the

play07:10

multi-layer perceptron we're going to go

play07:12

through this in a lot more detail in

play07:14

future videos but I want to give you a

play07:15

sneak peek of something so

play07:18

here right at step seven is the heart of

play07:20

the attention mechanism it tells us

play07:23

where it's paying the most attention to

play07:25

amongst the words so let's look at the

play07:27

word he you'll notice the large lest

play07:29

value here

play07:31

0.48 is highest right here so it's

play07:33

taking the word he and it's realizing

play07:35

that most likely is referring to the

play07:37

word Mike 0.48 is larger than any of the

play07:40

other values so it's going to influence

play07:43

the values it passes to the multi-layer

play07:45

perceptron more than any of the other

play07:47

words the other other words are getting

play07:50

a much smaller influence on the output

play07:52

it passes along let's take the word

play07:54

moves again you'll notice that it's

play07:56

looking most at the word mik and then

play07:58

the next other word it's looking most at

play08:00

is quick so it's going to use the

play08:01

information from those two words again

play08:03

that it passes to the next layer to try

play08:05

and interpret the value or meaning of

play08:08

the word

play08:12

moves okay we're almost at the end the

play08:15

last step is the language head which

play08:18

figures out what the actual next likely

play08:20

token is what it does is it takes the

play08:23

output of the final block and converts

play08:25

it into a set of probabilities across

play08:27

all the known tokens and its dictionary

play08:30

and then it picks from amongst the most

play08:32

likely tokens randomly one of those

play08:35

tokens to complete the

play08:37

sentence in this case it's picked simply

play08:40

the highest probability token which was

play08:42

quickly and fills that in let's take a

play08:44

look at the spreadsh

play08:45

sheeet now in the spreadsheet you'll see

play08:48

this is broken across three tabs layer

play08:51

Norm which is a process we'll talk about

play08:53

in a future video generating logits and

play08:56

a softmax again Concepts we talk about

play08:58

later to find finally get our predicted

play09:00

token now in a true large language model

play09:03

that you've probably played with it

play09:05

actually picks from amongst a set of the

play09:07

most likely tokens but in order to

play09:09

simplify this sheet we just simply pick

play09:12

from the very most likely token which

play09:14

gives a very consistent output that's

play09:16

why we've got a Max function it's just

play09:18

simply taking the most likely output

play09:20

this is what's known as having

play09:21

temperature zero when you go outside of

play09:23

temperature zero it starts picking from

play09:25

more than just the top token and it

play09:27

starts looking at the top 10 or 20 or 30

play09:30

or more tokens and it picks from them

play09:32

according to an

play09:37

algorithm okay that's gpt2 at a glance

play09:41

we'll be going through each of these

play09:42

steps in future videos but for now I

play09:44

hope that gives you a starting point as

play09:46

to what's going on under the hood and

play09:49

where you can see it happening live for

play09:51

yourself inside the spreadsheet thank

play09:55

you

Rate This
★
★
★
★
★

5.0 / 5 (0 votes)

Do you need a summary in English?