Demystifying how GPT works: From Architecture to...Excel!?! ð
Summary
TLDRãã®ãããªã·ãªãŒãºã§ã¯ãã¹ãã¬ããã·ãŒãã䜿ã£ãŠGPT-2ãChatGPTã®åæã®ç¥å ã§ãã倧èŠæš¡èšèªã¢ãã«ãå®è£ ããæ¹æ³ã玹ä»ããŸããGPT-2 smallãäŸã«ãããã¹ããããŒã¯ã³ã«åå²ããããããã®ããŒã¯ã³ãæ°å€ã®ãªã¹ãã«ãããã³ã°ããããã»ã¹ããããã«ããããã¢ãã³ã·ã§ã³ããã«ãã¬ã€ã€ãŒããŒã»ãããã³ãå«ãã¢ãã«ã®æ§é ãŸã§ããåºæ¬çãªã¹ãã¬ããã·ãŒãæ©èœã䜿çšããŠè§£èª¬ããŸãããã®ã¢ãããŒãã«ãããçŸä»£ã®AIæè¡ãã©ã®ããã«æ©èœãããã«ã€ããŠãããæ·±ãç解ãåŸãããšãã§ããŸããä»åŸã®ãããªã§ã¯ããããã®åã¹ãããã«ã€ããŠè©³ãã説æããŠãããŸãã
Takeaways
- ð ãã®ã·ãªãŒãºã§ã¯ãåºæ¬çãªã¹ãã¬ããã·ãŒãæ©èœã ãã§å€§ããªèšèªã¢ãã«GPT-2ãå®è£ ããŠããã
- ð ããã¹ãã¯ããŒã¯ã³ã«åå²ããããããã¯äºåå®çŸ©ãããèŸæžã«åºã¥ããŠããã
- 𧮠ããŒã¯ã³ã¯ãã€ããã¢ç¬Šå·åãšããã¢ã«ãŽãªãºã ã䜿çšããŠããŒã¯ã³IDã«ãããã³ã°ãããã
- ð åããŒã¯ã³ã¯ãæå³ãšäœçœ®ããã£ããã£ãã768ã®æ°åã®ãªã¹ãã«ãããã³ã°ãããã
- ð ããŒã¯ã³ããããã¹ããžã®åã蟌ã¿ã¯ãããŒã¯ã³ã®æå³ãšããã³ããå ã®äœçœ®ãåæ ããŠããã
- ð¡ ãã«ããããã¢ãã³ã·ã§ã³ãšãã«ãã¬ã€ã€ãŒããŒã»ãããã³ïŒãã¥ãŒã©ã«ãããã¯ãŒã¯ã®äžçš®ïŒãéããŠãããŒã¯ã³éã®é¢ä¿ã解æãããã
- ð åãããã¯ã®åºåã¯æ¬¡ã®ãããã¯ã®å ¥åãšããŠäœ¿çšãããGPT-2ã¯12ã®ç°ãªãã¬ã€ã€ãŒãéããŠãã®ããã»ã¹ãç¹°ãè¿ãã
- ð¯ ã¢ãã³ã·ã§ã³ã¡ã«ããºã ã¯ãæäžã®éèŠãªåèªããããã®é¢ä¿ãèå¥ããã
- ð€ ãã«ãã¬ã€ã€ãŒããŒã»ãããã³ã¯ãäžããããæèã§ã®åèªã®æãå¯èœæ§ã®é«ãæå³ã決å®ããã
- ð æçµçãªèšèªãããã¯ãæãå¯èœæ§ã®é«ã次ã®ããŒã¯ã³ãéžæãããããæã«è¿œå ããã
Q & A
GPT-2ã®ã¹ãã¬ããã·ãŒãå®è£ ã§ã¯ãã©ã®ããã«ããã¹ããåŠçãããŸããïŒ
-ããã¹ãã¯ãŸãããŒã¯ã³ã«åå²ãããŸããååèªã¯äºåå®çŸ©ãããèŸæžã«åºã¥ããŠããŒã¯ã³ã«å€æãããã¹ãã¬ããã·ãŒãã®ãããã³ããããããŒã¯ã³ãžãã¿ãã§ãã€ããã¢ç¬Šå·åã¢ã«ãŽãªãºã ã«ããæçµçãªããŒã¯ã³IDã«ããããããŸãã
åã蟌ã¿ïŒembeddingïŒãšã¯äœã§ããããããŠGPT-2ã§ã©ã®ããã«äœ¿çšãããŸããïŒ
-åã蟌ã¿ã¯ãåããŒã¯ã³ãæ°å€ã®ãªã¹ãã«ãããã³ã°ããããã»ã¹ã§ããGPT-2ã¹ã¢ãŒã«ã§ã¯ãåããŒã¯ã³ã¯768ã®æ°å€ã®ãªã¹ãã«ããããããããã¯ããŒã¯ã³ã®æå³ãšäœçœ®ãæããŸãã
äœçœ®åã蟌ã¿ã®ç®çã¯äœã§ããïŒ
-äœçœ®åã蟌ã¿ã¯ãããŒã¯ã³ã®ããã³ããå ã®äœçœ®ã«å¿ããŠåã蟌ã¿å€ããããã«å€æŽããããšã§ãããŒã¯ã³ã®äœçœ®æ å ±ãæããŸããããã«ãããã¢ãã«ã¯åãåèªã§ãç°ãªãæèã§ã®æå³ãåºå¥ã§ããŸãã
å€é 泚ææ©æ§ïŒmulti-headed attentionïŒã®åœ¹å²ã¯äœã§ããïŒ
-å€é 泚ææ©æ§ã¯ãæäžã®åèªãã©ã®ããã«é¢é£ããŠããããç解ããéèŠãªåèªãç¹å®ããããšã§ãæèãææ¡ããŸããäŸãã°ããheãããMikeããæãããšãèªèãããªã©ã§ãã
å€å±€ããŒã»ãããã³ã®æ©èœãšã¯äœã§ããïŒ
-å€å±€ããŒã»ãããã³ã¯ãåèªã®è€æ°ã®æå³ãåºå¥ããæèã«åºã¥ããŠæãé©åãªæå³ãéžæãã圹å²ãæãããŸããããã«ãããã¢ãã«ã¯ç¶ãåèªãããŒã¯ã³ãããæ£ç¢ºã«äºæž¬ã§ããŸãã
èšèªãããïŒlanguage headïŒã®åœ¹å²ã¯äœã§ããïŒ
-èšèªãããã¯ãæçµãããã¯ã®åºåã確çã»ããã«å€æããèŸæžå ã®æ¢ç¥ã®ããŒã¯ã³ããæãå¯èœæ§ã®é«ãããŒã¯ã³ãéžæããŠæãå®æãããŸãã
GPT-2ã®ã¹ãã¬ããã·ãŒãå®è£ ã§ãã©ã®ããã«ããŠæ¬¡ã®ããŒã¯ã³ãéžæãããŸããïŒ
-ã¹ãã¬ããã·ãŒãã§ã¯ãæçµãããã¯ã®åºåããçæããã確çã«åºã¥ããŠãæãå¯èœæ§ã®é«ãããŒã¯ã³ãéžæãããŸãããã®ãã¢ã§ã¯ãæãé«ã確çãæã€ããŒã¯ã³ãéžæãããŠããŸãã
GPT-2ã¢ãã«ã®ç¹°ãè¿ãããã»ã¹ã«ãããåãããã¯ã®åœ¹å²ã¯äœã§ããïŒ
-GPT-2ã®åãããã¯ã¯ã泚ææ©æ§ãšããŒã»ãããã³ãå«ã¿ãå ¥åãåãåãããããåŠçããŠæ¬¡ã®ãããã¯ãžã®åºåãçæããŸãããã®ããã»ã¹ã¯ã12ã®ç°ãªãã¬ã€ã€ãŒãŸãã¯ãããã¯ãéããŠç¹°ãè¿ãããŸãã
ããŒã¯ã³ãã©ã®ããã«ããŠåã蟌ã¿ã«ãããããããã®äŸãæããŠãã ããã
-äŸãã°ã'Mike' ãšããåèªã¯ãããŒã¯ã³IDã«ãããããããã®åŸã768ã®æ°å€ãããªããªã¹ãã«å€æãããŸããããã«ãããåèªã®æå³ãšãã®äœçœ®ãè¡šçŸãããŸãã
枩床ïŒtemperatureïŒãŒããšã¯äœãæå³ããŸããïŒ
-枩床ãŒããšã¯ãã¢ãã«ãæãå¯èœæ§ã®é«ã1ã€ã®ããŒã¯ã³ã®ã¿ãéžæããç¶æ ãæããŸããããã¯äžè²«æ§ã®ããåºåãæäŸããŸãããããå€ãã®ããŒã¯ã³ããéžæããããšã§å€æ§æ§ãæãããããšãã§ããŸãã
Outlines
ðã¹ãã¬ããã·ãŒãã§GPT2ã®æŠèŠ
ãã®ãã©ã°ã©ãã¯ãã¹ãã¬ããã·ãŒãã䜿çšããŠGPT2ã®æ§é ãšåŠçã®æµããå®è£ ããŠããããšã説æããŠããŸããå ¥åããã¹ãã®ããŒã¯ã³åãEmbeddingsã®çæãAttentionãšMLPã䜿çšãããããã¯ã®å埩åŠçãªã©ã®æŠèŠãè¿°ã¹ãããŠããŸãã
ðæ¥æ¬èªèŠçŽã¯é£ãã
2çªç®ã®ãã©ã°ã©ãã®å 容ã¯æè¡çã§é£è§£ã§ããå¹³æãªæ¥æ¬èªã䜿çšããŠèŠçŽããããšãããããããŸãã
Mindmap
Keywords
ð¡Tokenization
ð¡Embeddings
Highlights
The transcript walks through implementing GPT-2 in a spreadsheet using basic functions
The spreadsheet implements a smaller version called GPT-2 small but has the same architecture
Input text is split into tokens using byte-pair encoding
Tokens are mapped to lists of numbers called embeddings that capture meaning and position
There are 12 blocks with attention and multi-layer perceptron layers to refine predictions
Attention figures out which words are most relevant to refine the predictions
The final step predicts the most likely next token to complete the prompt
The spreadsheet picks the token with the highest probability for simplicity
The input text is parsed into tokens that map to IDs
Embeddings capture position as well as meaning of tokens
Attention identifies which words have the most influence on predictions
The blocks implement attention and neural network layers iteratively
Attention helps disambiguate meanings of words for the neural network
The final output predicts and selects the most likely next token
The spreadsheet uses the token with maximum probability for consistency
Transcripts
welcome to spreadsheets are all you need
how GPT Works where if you can read a
spreadsheet you can understand modern AI
That's because in this series we're
walking through a spreadsheet that
implements a large language model
entirely in basic spreadsheet functions
and not just any large language model
we're implementing gpt2 an early
ancestor of chat GPT now because it is a
spreadsheet it can only support a
smaller context link and it does
implement the smallest form of gpt2
known as gpt2 small but architecturally
for all intents and purposes it's the
same model that was breaking headlines
just a few short years ago let's take a
look under the hood how it works now in
subsequent videos we're going to go
through each of these stages step by
step but for now I'm going to touch on
each one lightly as a kind of table of
contents for future videos
in addition I've added a final column
here on the right that indicates what
tab in the spreadsheet corresponds to
what action inside
gpt2 let's start at the beginning after
you input your text it is split into a
series of tokens so for example let's
take Mike is quick he moves this would
be split into tokens per a predefined
dictionary now you'll note that every
single word here corresponds to a single
token but that is not always the case in
fact it's not uncommon for a single word
to be split into two three or even more
tokens let's take a look at the
spreadsheet so here's where you input
your prompt and because of the way the
parsing works you have to put each word
a separate line you can have to add the
spaces as well as the punctuation it
then gets taken to this sheet which is
or tab called prompt to tokens where it
goes through an algorithm called bite
pair encoding to map it to a final list
of known own token IDs you see right
here now that we have the tokens we need
to map them to a series of numbers
called an embedding every token is
mapped to a long list of numbers in the
case of gpt2 small it's a list of
768 numbers these capture both the
meaning as well as the position of each
token in the prompt let's see how this
works inside the
spreadsheet
okay so here we are in the spreadsheet
that implements this it's tokens to text
embeddings Tab and there's two parts to
it at the top you'll see our prompt
tokens Mike is quick he moves and these
are those prompt IDs we saw from the
earlier stat and then from columns three
onwards are the list the 768 numbers
that represent the semantic meaning of
the word Mike let's go look at column
770 and we can see where this list
ends right here you can see the list
ending let's go back to the
beginning and you'll notice there's
another list here the job of this list
is to actually change the tokens from
the list above to reflect their
different positions in the prompt let me
explain and demonstrate that here by
changing this word moves to the word
Mike which is the first
word in our prompt we'll go through
here we'll recalculate our
tokens we'll see we get Mike again then
we back to our tokens to text embeddings
we'll calculate the sheet and you'll
notice that Mike here has the same ID
and has the exact same embedding values
as it did does up here right row two and
row seven are totally identical that's
because the only job of this first set
of rows is to capture the semantic
meaning but when we take a look here at
this part where we have the position
embeddings you'll notice that the values
of the embedding for Mike at position
one are different than the values for
Mike at position six we've effectively
altered the values of the embeddings for
Mike slightly to reflect its different
position in the
prompt okay now that we've captured both
the meaning and the position the tokens
in the prompt they pass on to a series
of layers or blocks the first is
multi-headed attention and then the
second is what's known as a multi-layer
perceptron that's another name for a
neural network let's consider our
sentence again Mike is quick he moves
where we want the Transformer or GPT to
fill in the last word the attention
mechanism the first phase tries to
figure out what are the most important
words words in the sentence and how they
relate so for example the word he it
might recognize as referring to M
earlier in the prompt or it might
realize that the word moves and quick
probably relate this information is
important for the next layer the
multi-layer
perceptron so take for example this word
quick it has multiple meanings in
English it can mean moving fast it can
mean bright as in quick of wit it can
mean a body part as in the quick of your
fingernail and in Shakespearean English
it can even mean alive as opposed to
dead as in the phrase the quick and the
dead the information from the attention
layer that the word moves is there with
the word quick helps the multi-layer
perceptron disambiguate which of these
four meanings is most likely in this
sentence and that it's most likely the
first one moving in physical space and
it would use that to figure out what the
most likely next word to complete the
prompt is like the word quickly or the
word fast or the word around all of
which are about fast movement in
physical
space it's also important to note that
this attention then perceptron attention
then perceptron process happens
iteratively in gpt2 small it happens
across 12 different layers as it
iteratively refines its prediction of
what the next most likely word or token
should be
let's see how this is implemented in the
spreadsheet so you'll notice in the
spreadsheet there are these tabs block
zero block one block two all the way to
block 11 these are our 12 blocks and the
output of block zero becomes the input
of block one and the output of block one
becomes the input of block two so
they're all chained together all the way
through let's look inside one of these
blocks so here's the first block and
each block has about 16 steps in this
implementation steps one all the way to
around step 10 are basically your
attention mechanism and from Step 10 all
the way to the remaining 16 is the
multi-layer perceptron we're going to go
through this in a lot more detail in
future videos but I want to give you a
sneak peek of something so
here right at step seven is the heart of
the attention mechanism it tells us
where it's paying the most attention to
amongst the words so let's look at the
word he you'll notice the large lest
value here
0.48 is highest right here so it's
taking the word he and it's realizing
that most likely is referring to the
word Mike 0.48 is larger than any of the
other values so it's going to influence
the values it passes to the multi-layer
perceptron more than any of the other
words the other other words are getting
a much smaller influence on the output
it passes along let's take the word
moves again you'll notice that it's
looking most at the word mik and then
the next other word it's looking most at
is quick so it's going to use the
information from those two words again
that it passes to the next layer to try
and interpret the value or meaning of
the word
moves okay we're almost at the end the
last step is the language head which
figures out what the actual next likely
token is what it does is it takes the
output of the final block and converts
it into a set of probabilities across
all the known tokens and its dictionary
and then it picks from amongst the most
likely tokens randomly one of those
tokens to complete the
sentence in this case it's picked simply
the highest probability token which was
quickly and fills that in let's take a
look at the spreadsh
sheeet now in the spreadsheet you'll see
this is broken across three tabs layer
Norm which is a process we'll talk about
in a future video generating logits and
a softmax again Concepts we talk about
later to find finally get our predicted
token now in a true large language model
that you've probably played with it
actually picks from amongst a set of the
most likely tokens but in order to
simplify this sheet we just simply pick
from the very most likely token which
gives a very consistent output that's
why we've got a Max function it's just
simply taking the most likely output
this is what's known as having
temperature zero when you go outside of
temperature zero it starts picking from
more than just the top token and it
starts looking at the top 10 or 20 or 30
or more tokens and it picks from them
according to an
algorithm okay that's gpt2 at a glance
we'll be going through each of these
steps in future videos but for now I
hope that gives you a starting point as
to what's going on under the hood and
where you can see it happening live for
yourself inside the spreadsheet thank
you
æµè§æŽå€çžå ³è§é¢
ãä»äºé©åœãChatGPTãã¹ãã¬ããã·ãŒããã䜿ãã!! 調æ»ãæ å ±æŽçãæ ¹æ¬ããå€ããâŠïŒ
ãChatGPT以å€ãç°¡åæ¯èŒïŒãè€æ°LLMãåæã«äœ¿çšã§ãããäžçºæ€çŽ¢ãã®æ©èœãGMOãæããŠAIãã«æèŒ
ãëë°ì/éåœèªãïŒïŒåã§æ¥åžžäŒè©±ã«å¿ é ã®ãã€ãã£ãè¡šçŸããã¹ã¿ãŒïŒ
GPTs解説#3 å€éšAPIãå©çšã§ãããActionsãæ©èœã®ä»çµã¿ãšäœ¿ãæ¹ ãChatGPTã
ãåé€èŠæã奜ããªäººã沌ããã4ã¹ããããææå¿çåŠã
ChatGPT Tips and Tricks - Part 3: Timestamps and counters
5.0 / 5 (0 votes)