Byte Pair Encoding in AI Explained with a Spreadsheet
Summary
TLDRThe video script delves into the intricacies of tokenization and byte pair encoding (BPE), essential components in the operation of large language models like GPT-2. It explains how morphemes, the smallest units of meaning in a language, enable the understanding of even made-up words. The script outlines the tokenization process, where text is broken down into tokens, and how BPE identifies common subword units to handle an extensive vocabulary efficiently. The video also addresses the limitations of character-based and word-based tokenization, highlighting increased memory and computational requirements. It demonstrates the BPE algorithm's learning phase using a simplified example and shows its application in a spreadsheet, illustrating how text like 'flavorize' is tokenized. The script concludes by noting BPE's limitations, such as the 'Solid Gold Magikarp' effect and its English-centric nature, and mentions alternative tokenization methods and the flexibility of tokens to represent different types of data.
Takeaways
- 📚 **Tokenization**: The process of converting text into tokens, which are the subword units that a language model like GPT-2 understands and uses for processing.
- 🔍 **Byte Pair Encoding (BPE)**: An algorithm used for subword tokenization that learns common subword units from a corpus and then tokenizes input text into these units.
- 🔑 **Morphemes**: The smallest units of meaning in a language, which BPE aims to capture by breaking down words into meaningful parts.
- 📈 **Vocabulary Size**: GPT-2 uses around 50,000 tokens, which is a balance between the memory and compute required for a model that uses character-based or word-based tokenization.
- 🧠 **Model Parameters**: The GPT-2 model has 124 million parameters, which would significantly increase if using a word-based tokenization for the entire English language.
- 🌐 **Corpus Learning**: BPE starts with a corpus of text and iteratively learns the most frequent character pairs to build its vocabulary of tokens.
- ✂️ **Tokenization Process**: Involves breaking down input text into tokens based on the learned vocabulary, with the algorithm prioritizing certain subword units over others.
- ⚙️ **Handling Unknown Words**: BPE can handle unknown or misspelled words better than a simple word-to-number mapping, although it may not always align with a native speaker's expectations.
- 📉 **Solid Gold Magikarp Effect**: A problem where certain strings or tokens are learned by the tokenization algorithm but not frequently output by the model, leading to unexpected responses.
- 🌐 **Language Centrism**: BPE is more effective with languages like English that have clear word separation, but it may not be as effective for languages with different linguistic structures.
- 🔄 **Flexibility in Tokenization**: Tokens are not limited to text and can be used to represent other types of data, such as audio or image patches, for processing through a Transformer model.
Q & A
What is the term 'funology' in the context of the video?
-The term 'funology' is a made-up word that combines 'fun' with the suffix '-ology', which typically denotes a study or science. In the context of the video, it's used to illustrate how people can extrapolate meanings of such portmanteau words even when they are not found in a dictionary.
What is tokenization in the context of language models?
-Tokenization is the process of converting text into a format that a language model can understand, which involves breaking down the text into its constituent parts or tokens. This is a crucial step as language models like GPT-2 only understand numbers, not text.
How does Byte Pair Encoding (BPE) work in tokenization?
-Byte Pair Encoding (BPE) is a subword tokenization algorithm used by models like GPT-2. It operates in two phases: first, it learns common subwords from a corpus of text to create a vocabulary, and second, it tokenizes new input text using this vocabulary. BPE identifies and merges the most frequently occurring pairs of symbols (letters, subwords, or words) into single tokens, which helps in handling large vocabularies efficiently.
Why is BPE preferred over character-based or word-based tokenization?
-BPE is preferred because it strikes a balance between the two. Character-based tokenization creates longer sequences and puts more work on the training algorithm, while word-based tokenization may not handle unknown or misspelled words well and requires a larger model size to accommodate a full vocabulary.
What is the 'solid gold Magikarp' effect in language models?
-The 'solid gold Magikarp' effect refers to a situation where a language model fails to repeat back certain tokens or strings accurately. This can occur when a token is learned by the tokenization algorithm but has a low probability of being output due to its infrequent occurrence in the training data.
How does BPE handle complex or unknown words?
-BPE can break down complex or unknown words into known subword units or tokens based on the vocabulary it has learned. If a word or subword is not in its vocabulary, BPE will tokenize it into the closest matching subword units it does recognize.
What are embeddings in the context of language models?
-Embeddings are numerical representations of words or tokens that capture their semantic meaning. Each token is transformed into a high-dimensional vector space, where each dimension represents some aspect of the word's semantic meaning. These embeddings are used as inputs to the neural network within a language model.
Why is the vocabulary size in GPT-2 around 50,000 tokens?
-The vocabulary size of around 50,000 tokens in GPT-2 is a compromise between model efficiency and expressiveness. A larger vocabulary would increase the model size and computational requirements, while a smaller vocabulary might not capture enough nuances of the language.
What is the significance of the embedding dimension in language models?
-The embedding dimension refers to the size of the vector space used to represent each token. It is a hyperparameter that determines the richness of the representation. A higher embedding dimension can capture more nuances but also increases the computational complexity.
How does the BPE algorithm decide which pairs of characters to merge?
-The BPE algorithm decides which pairs to merge based on frequency. It identifies the most frequently occurring pairs of characters (or existing tokens) and merges them into a single token, thus gradually building up a vocabulary that represents common subwords in the language.
What are some limitations or challenges of using BPE?
-BPE has some limitations, including being more effective for languages with clear word separation and less effective for languages where word separation principles differ. It can also lead to issues like the 'solid gold Magikarp' effect and may not always perfectly align with a native speaker's expectations of word boundaries.
Outlines
😀 Understanding Funology and Tokenization
This paragraph introduces the concept of 'funology,' a made-up word that illustrates how morphemes help us understand new words. It sets the stage for discussing tokenization, a process that breaks text into tokens that a language model like GPT-2 can understand. The paragraph explains that tokenization doesn't just split sentences into words but can also break single words into multiple tokens, which is crucial for a language model to convert text into a numerical form it can process.
🧮 Tokenization and the GPT-2 Model
The second paragraph delves into the technical aspects of tokenization within the GPT-2 model. It discusses the challenges of converting words into numbers and the limitations of direct word-to-number mapping. The paragraph also highlights the trade-offs involved in using a large vocabulary size and the computational implications of such a model. The explanation includes a practical demonstration using a spreadsheet to show how the GPT-2 model uses a text embedding matrix to represent tokens, and the impact of expanding the vocabulary on the model's parameters.
🔍 Exploring Subword Tokenization
This paragraph explores the concept of subword tokenization as a middle ground between character-based and word-based tokenization. It explains the two phases of the Byte Pair Encoding (BPE) algorithm used by GPT-2: the learning phase, where common subwords are identified from a text corpus, and the tokenization phase, which processes user input into tokens. The paragraph uses an example to illustrate how BPE learns from a small corpus and builds a vocabulary that can tokenize a given text, emphasizing the algorithm's flexibility and efficiency.
📝 Implementing BPE Tokenization in a Spreadsheet
The fourth paragraph provides a detailed walkthrough of implementing BPE tokenization within a spreadsheet application. It demonstrates how to break down words into characters, form possible pairs, calculate scores for these pairs, and merge them based on the highest score. The process is shown step-by-step, including handling edge cases like blank characters and ensuring that tokens are correctly propagated through multiple passes of the algorithm.
🔄 Iterative Process of BPE Tokenization
This paragraph continues the explanation of the iterative process of BPE tokenization within the spreadsheet. It shows how the algorithm progresses through multiple passes, refining the tokenization of a word each time. The paragraph also points out that the algorithm converges on the correct tokenization relatively quickly and continues to propagate the result through subsequent passes. It emphasizes the algorithm's ability to handle different words and the nuances of its operation.
🚧 Caveats and Considerations of BPE
The seventh paragraph discusses the limitations and trade-offs associated with BPE tokenization. It mentions issues like the 'Solid Gold Magikarp' effect, where certain strings are not repeated back correctly by the language model. The paragraph also addresses the English-centric nature of BPE and its potential shortcomings in other languages. It concludes by noting that tokenization is not limited to text and can be applied to other forms of data, such as audio or image patches, which can be translated into tokens for processing by a Transformer model.
📈 Future Coverage of the GPT-2 Model
The final paragraph briefly mentions that future videos will cover the remaining components of the GPT-2 model, including text and position embeddings. It serves as a transition, indicating that the current discussion on tokenization is part of a larger series exploring the intricacies of modern AI and language models.
Mindmap
Keywords
💡Funology
💡Tokenization
💡Byte Pair Encoding (BPE)
💡Embedding Matrix
💡Morphemes
💡Transformer Architecture
💡Subword Units
💡Corpus
💡Parameter
💡Solid Gold Magikarp Effect
💡Context Window
Highlights
Funology is a made-up word, but it can be understood by combining the word 'fun' with the suffix '-ology', demonstrating how morphemes help in understanding language.
Tokenization is the process of converting text into tokens, which can be words, subwords, or characters, and is a crucial step in language model implementation.
GPT2, an early precursor to chat GPT, uses a method called byte pair encoding (BPE) for tokenization, which breaks down words into subword units.
Byte pair encoding has two phases: learning the common subwords in a language and then using that vocabulary to tokenize input text.
BPE starts by counting character pairs and iteratively merges the most frequent pairs, building a vocabulary that the model can use for tokenization.
The tokenization process in BPE is not always perfect and may not align with a native speaker's expectations, but it captures common subword units that tend to have meaning.
Assigning each word a number is not practical due to the inability to handle unknown or misspelled words and the large vocabulary size increasing the model's memory and compute requirements.
The GPT2 model has about 50,000 tokens, and increasing the vocabulary to include all English words would nearly double the model's parameters to 216 million.
Character-based tokenization creates longer sequences and has low semantic correlation of characters, making it more challenging for the model to learn the language.
Subword tokenization, like BPE, strikes a balance between character and word tokenization, providing flexibility and efficiency in processing language.
The learning phase of BPE involves creating a corpus of text and iteratively merging the most frequent character pairs until a vocabulary is established.
The tokenization phase uses the established vocabulary to convert user input into tokens that can be processed by the language model.
BPE's tokenization algorithm prioritizes certain subword units over others, which can result in different tokenization outcomes for similar characters in different contexts.
BPE has its limitations, including issues like the 'Solid Gold Magikarp' effect where certain strings are not accurately repeated by the model.
BPE is very English-centric and may not work well for languages where word separation principles do not apply.
There are other tokenization algorithms and Transformer architectures that use character-based tokenization, which may address some of BPE's shortcomings.
Tokens are not limited to text and can represent any data type that can be translated into numbers, allowing for the processing of various data such as audio or images.
Transcripts
suppose your friend told you they were
an expert in
funology now this isn't a word you'll
find in the dictionary but because it
combines the word fun with the suffix
ology you'd be able to extrapolate the
meaning and if they were friend to tell
you they were an expert in making things
enjoyable or suppose your friend told
you they were going to flavorize a bland
soup again not a real word but you'd be
able to figure out they were telling you
they were able to make their soup
tasty or finally let's suppose they told
you they were going to chilf your party
by dimming the lights and playing smooth
jazz again you'd be able to understand
they were trying to make the Ambiance
more
relaxed what all these examples have in
common is that even though they're
madeup words you're able to figure them
out thanks to what linguists call
morphemes these are subword units that
contain the meaning and it turns out
when you're busy typing away into a
large language model like chat GPT it's
actually using a similar set of Clues to
figure out what you're saying as
well and that's the subject of today's
video on tokenization and bite pair
encoding welcome to spreadsheets are all
you need if you're just joining us
spreadsheets are all you need is a
series of tutorials on how modern
artificial intelligence Works through
the lens of a spreadsheet that's right
if you can read a spreadsheet she you
can understand modern AI That's because
this spreadsheet that you see here which
you can download on the website actually
implements a large language model all
the way from a prompt to getting a
predicted token out in fact it doesn't
just Implement any large language model
it implements the entire gpt2
architecture an early precursor to chat
GPT that was state-of-the-art just a few
years
ago now for the purpose of today's
episode the problem we're trying to
wrestle with if you take a look at this
spreadsheet or uh if you look at the
internals of gpt2 what you'll notice is
sure we're typing in text here but as we
go deeper into the implementation we'll
just see Table after Table after table
of
numbers in essence the Transformer
architecture that's at the heart of
gpt2 only understands numbers but you're
inputting text what we need is a process
that can convert text into numbers
there's two steps to that and in this
video we're going to talk about the
first step what's known as
tokenization when I introduced
tokenization in a previous video I used
this example Mike is quick period he
moves and showed how it was broken into
separate words but tokenization doesn't
just break a sentence into its words
it's not uncommon for a single word to
be broken into one two three or more
separate tokens let's see this in action
in the
spreadsheet so here this tab type prompt
here is where we enter our prompt with
each word or punctuation on a separate
line and note that we have to add the
spaces in here manually and then prompt
to tokens is where that text gets broken
out into separate tokens as you can see
here Mike is quick period he moves and
underneath we'll see the token ID now
for now just think of the token ID as
its position inside the dictionary of
known tokens we'll be talking more about
that in later videos for now let's just
try some of the examples we used in the
introduction to see tokenization for
more complex words
like I
will
flavorize the
suit and then because this is such a
large spreadsheet remember that we've
got manual calculation turned on so to
calculate we need to actually hit this
calculate sheet button and then just
wait a little
bit
okay here you can see the word flavorize
has been broken into two parts the
suffix i Ze as well as the word flavor
with the space in front of it let's try
one more
example let
us
chifi the
party and again you see chifi has been
broken into the suffix ify along with
the word chill now by par encoding isn't
always perfect so let's talk about an
example here where it
fails the brace will prevent
reinjury now to you and I reinjury is
the word injury with prefix re in front
of it but as you see here in bite pair
encoding it's actually broken into the
word re n and then the word jury so it
doesn't always line up with you know a
native English speaker expectation but
it does do a good job of actually
capturing the most common subword units
that tend to have
meaning now you're probably wondering
why we don't just turn words into
numbers by just assigning each word a
number like dog as one cat as two and so
forth throughout the entire dictionary
well this has a couple problems the
first is that this isn't able to handle
unknown or misspelled Words which are
actually really common if you're say
scraping all the text on the internet
now that being said there are
Transformer models that do have a
special unknown token so it's not an
insurmountable problem another problem
is just supporting a large vocabulary
size increases the size of the
model gpd2 has about 50,000 tokens but
English has about three times that
170,000 words and so the end result is
that it creates more memory and more
compute needed to run and train the
model we can actually demonstrate this
inside the
spreadsheet so inside the spreadsheet is
a specific tab called Model wte what's
known as the text embedding Matrix the
important thing to know here is that
each row of this Matrix corresponds to a
single word so it's the entire
vocabulary size I in this case
50257 because they're 50,00
257 known tokens inside the gpt2 model
now the width of this Matrix is
something called the embedding Dimension
we'll learn about that in later videos
for now just know it's about 760
columns the key point is that the entire
gpt2 model we're playing with is 124
million parameters if we were to take
this Matrix and add 120,000 more rows
effectively what we would need to have
all the words in the English language in
a word Style embedding we're basically
actually nearly doubling the number of
parameters to 216 million that's a huge
increase in the amount of cute needed
just to calculate a token let's go see
this in the
spreadsheet so here we are in the sheet
and if you look at the list of all the
tabs in this workbook you'll notice
there's a large number of them that
start with model underscore and these
are basically tables of numbers like
this when somebody talks to you about
the number of parameters in a model it's
really basically how many numbers are in
every single one of these tabs across
this entire model and in this case in
the spreadsheet there's 124 million of
them now the model we were talking about
before or rather the Matrix we were
talking about before is this Matrix wte
and here you can see it's got 768
columns and it's got 50, 257 rows each
of these rows corresponds to a single
token so let's do some calculations here
create a blank workbook so we know that
the
original
Matrix height is
50,
257 we know that the original Matrix
with is
768 if we were going to add enough rows
to accommodate all the words in the
English language in a word style
tokenization then we would need
additional
rows we know that's
170,000 so actually that's the total
rows for word
encoding organization I should
say that means the additional
rows is this minus that is about
120,000 so let's add how many additional
parameters that would be well that's
additional
parameters are just simply the number of
additional rows times the width of each
row that's
768 let's expand this out so it's easier
to see and then let's add some commas so
it's easier to read and let's get rid of
those extra arrows or sorry zeros rather
and then remember that the entire model
is about 124 so this is the original
model
size so basically we're nearly doubling
the size of the model we're adding 91
million more parameters just to
accommodate all the words in the English
language using bite pair encoding we can
do this All a lot more flexibly and
using only 50,000 tokens
now you might be wondering why don't we
go the other way let's use
character-based tokenization so give
each letter a number so a equals 1 b
equals 2 and at least in English
language there aren't that many
characters so you're not going to have
that many rows and if you're familiar
with the ask key encoding or you come
from a developer background you're
probably used to this kind of setup as
well this has its own couple of problems
so the first one is it actually creates
longer sequences let me show you that
let's close this and let's go to where
we enter in our
prompt right
here after we've turned these The Prompt
into tokens those tokens are then turned
into what are called embeddings These
are long vectors this is each of these
tokens goes into a row of 768 values the
key point though is that this may Matrix
here this is basically what you're
probably familiar with as the context
link when you type into something like
chat GPT is as long as the number of
tokens right now you can see it's only
well it's only about six tokens long but
if each of these were split out by its
character it's going to be a lot longer
so our context link that we have to
process for something that's only this
case six words long is going to be a lot
longer and this gets carried through
this length of six gets carried through
not just here but through actually every
stage in every layer of the entire
Transformer pipeline so in effect even
though with the character-based
tokenization the size of the model that
WT Matrix may be smaller the length of
the context window in order to process
the same amount of text gets longer
through the entire inference and
training
pass that means again more memory and
more compute there's another problem
that's more subtle which is that there's
low semantic correlation of characters
themselves so as an example of this
you've probably seen this email that
went around uh a couple years ago and
they've jumbled all the letters in the
words but it's still readable so the
first sentence basically says according
to research at Cambridge University
doesn't matter what order the letters in
a word r the only important thing is
that the first and last letter be in the
right place by the way that's
technically not true that the first and
last letter need to be in the right
place but the point still stands that
the meaning of a word isn't conveyed in
just the characters itself but in how
they're grouped together and because the
characters themselves don't carry that
information that puts more work on the
training algorithm to learn English and
what the meanings of words are than it
would if it was actually using words to
begin with in the end both these
problems mean more memory more compute
it becomes harder to train and use the
model so if character tokenization has
its own problems and word tokenization
at the other end has another set of
problems maybe there's a Goldilocks in
between the two small to big that's just
just right and it turns out there is and
that is subword
tokenization now there's more than one
subword tokenization algorithm out there
the one that gpt2 uses is called bite
pair encoding or bpe and it's got two
phases the first phase is a learning
word where it learns what the common
subwords are in a particular language
and then the second phase is the actual
tokenization phase that takes input and
turns it into tokens be processed by the
large language
model so for the first phase I want you
to imagine you've gathered a large body
of text Maybe by saping all the English
language on the internet this sometimes
referred to as a corpus of
text and you pass that into the BP
learning algorithm which turns this into
a known vocabulary of tokens and we'll
walk through this in a little more
detail in the next few slides then in
the tokenization phase we take the input
from the
user and we combine it with the
vocabulary that we got from the first
phase to out tokens that are then used
in later processing stages of the
model I'm going to illustrate the
learning phase through some slides and
then implement the tokenization phase
inside the sheet so you can see how it
actually
works to show the learning phase I'm
actually going to use the same example
that was used in this paper that
introduced this algorithm to machine
learning and it's a short readable paper
it actually has a python script in it if
you want to try it out I'll just run
through it in slides here for now now I
want you to imagine the part on the left
the Corpus is the result of maybe
scanning all the English language we
could find on the internet obviously
it's not that long it's a toy example to
illustrate the process and it just has
in this case five words the word low
occurring five times in our scan the
word lower occurring twice the word
newest occurring six times and the word
widest occurring three times and then
we'll start on the right with our vocab
of just the characters from A to Z now
in order to make the process more clear
we're going to rewrite the Corpus a
little bit I'll use the dot symbol to
show the separation between the
individual characters I'll use this
underscore character to illustrate the
boundary of where a word ends and
another one might
begin the first step in learning the
vocabulary is to look at all the
adjacent pairs of characters and count
up which one is the most frequently
occurred so for example the letter e
next to the letter s occurs nine times
it occurs six times in the word newest
and it occurs three times in the word
widest so it occurs a total of nine
times so we then write down e paired
with s with a frequency of nine and then
we do this for all the adjacent pairs of
characters inside our Corpus and then we
look at the one that has the most
frequent occurrence in this case it's
and although we could pick s and t or T
and the end of word character they all
have nine and then we take that and we
add that to our
vocabulary we then reprocess our corpus
with our new vocabulary we take all
occurrences of e paired with s we
transform them into a new es character
so now es together are going to be
treated as if they were single character
even though to us they're considered
separate
characters then we repeat the process
so now the most frequently occurring
pair is this new es with the T character
and that occurs nine times we then add
this to our vocabulary below the newly
added vocabulary entry from the previous
pass so now we have a vocab that's e
paired with s and then es s paired with
t and again we reprocess our Corpus such
that es paired with T is now treated as
a single subword unit
EST and this is what we have after two
passes we have a vocabulary of two new
subword units e and s and then EST and
then we have our Corpus that's been
tokenized into our
vocabulary after multiple passes of this
process in this case after 10 passes we
have something like shown here we have a
vocabulary which has 10 tokens the
number of tokens
equal to the number of passes we did of
the algorithm and then on the left our
Corpus that's been segmented and
tokenized according to this new
vocabulary so note the most frequent
words in our Corpus low and newest were
actually tokenized into their own
individual
words meanwhile lower and widest were
broken into separate tokens but you'll
notice that BP has already started to
learn some of those morphemes we talked
about earlier in the case of lower and
low it's understood that they both use
the same subword unit low and in the
case of EST in newest and widest it's
understood that morphe and recognized
that as well as its own independent
token while the bpe learning algorithm
is actually fairly simple it may take
rewatching this a few times to get an
intuition for how it works or you can
run the Python program I showed earlier
and get an even better feel for how it
works now let's see a tokenization
process that two of the algorithm Works
in detail inside the
spreadsheet before we get to the
spreadsheet I want to show you this file
this is the vocab BP file you can get
this from open aai when you download the
full weights of gpt2 you'll notice
there's 50,000 tokens and these are the
merges that we saw earlier where there's
a space character between the left and
the right you're seeing a DOT right
there but that's because of my text
editor in the actual text it's really
just a space and then as result to
represent the space character itself you
see this special kind of G with an
accent on it that's actually the real
space character we've to substitute that
out I've gone ahead and actually done
that inside the spreadsheet so that
special G character has been replaced
with a true space we have here is the
left half of a pair and in column two
the right half of a pair I've added two
new columns one is called the rank so we
can understand which pair or merge has
the highest priority in the parsing
and score which is the inverse of rank
just to make the math and calculations
and formulas a little easier it's easier
to find the Max and use zero to
represent when there is no
match now let's go to where the parsing
and tokenization takes
place that's again in this sheet
prompted
tokens now this Sheet's a little complex
so what I'm going to do is I'm going to
create a new sheet that builds up to the
same process that you see here let's
insert a new sheet and let's use an
input word so let's use input word let's
use uh chillii for example actually
let's use
flavorize and note that I've used space
then the word flavorize because in GPT
2's algorithm the tokens start with the
space character rather than having that
end to word character that I showed in
the previous
example now let's actually break this
into
characters input whoops input
characters and I'm going to use a number
to indicate which paths we're at and
then I'm going to take this and I'm
going to split it into characters now
you notice I'm using the split into
characters function that's not a normal
Excel function if you go to the name
manager you'll see this is really just a
named function and you can actually see
the implementation of it right here
split into
characters now we're going to create all
the possible
pairs all the possible adjacent pairs so
that would be this which is the space
character con catenated with that
s let's move that one
over and then if I hit calculate sheet
you can see the result here so here we
have space with f as a possible pair we
have f with L as a possible pair L with
a and so forth I'm going to actually put
this and Mark this as our
end so let's do that let's format these
cells and let's put a border here so we
can see where the end is we want to be
able to handle blanks
properly
so next up we want to understand what
the score of each of these possible
pairs are inside a vocabulary so let's
get the
score and I will put question mark there
I'll save that for
booleans so for this I have again a
named function that really is just
implementing a standard filter function
with a little protection around it for
if something doesn't match and we'll
input the left character and then the
right character in a pair to get its
score so this space with an F has a
score of 4
9979 we can then do that cross the
entire sequence let's say calculate
sheet again and then we can see the
scores of each of the possible pairs now
we need to figure out which of these
pairs has the highest score so let's
extract the max score out of this range
so this is going to be
Max and we're going to take the previous
row and we'll go from column two all the
way to previous
row and column
12 and hit calculate sheet and they
should all match up there we go now we
want to see which one has the max score
so is Max score and we'll make this AB
Boolean and that's simply saying is this
Max score equal to the score for this
pair so this one is obviously false and
then let's carry this formula
through H calculate sheet and here we
can see the max score is
49983 which is the pair o what we want
to do for our
output is say
if it's the max score then take the pair
otherwise just carry down the original
character that was in the input let's
carry that
through paste that and then calculate
sheet okay here we see of these columns
only the one with the maximum one o r
pair which had the highest rength gets
carried through into our next version of
flavorize now there's a little problem
actually there's two problems the first
is you'll notice that we carried through
this blank character so one way you can
fix that is look for blanks and then
make sure they don't get propagated down
the second problem you'll notice is a
little more subtle and it's that
because R was used in O we now need to
make sure that this character doesn't
get propagated down either and leave an
empty space right here we're going to
add two
new
rows the first one is going to be
is input
blank and then the second is going to be
is previous
sibling
maxed so let's solve this blank one
first so all we have to do here is just
check if the input character itself is
empty and we'll propagate that
over and here you can see in the last
column it's true indicating that that
one is blank and then for solving this R
getting propagated down even though it
was part of o
R all we have to do is check if the
previous sibling was the max column and
that all I have to do is just check Max
score which is this row right here just
propagate that value
over and here at the very first column
there's no previous siblings so it's
always
false now let's calculate the
sheet
and here we can see the column with r
now has a true here which is going to
let us know that this field right here
or cell needs to be
blank so we're going to change our
output
formula to now check if the previous
sibling is at Max score or the input
character itself was blank in that case
we just return an empty
result
otherwise if this column
is the max score for this pass then we
take the pair otherwise we take the
original input character and propagate
that
through and propagate this
through and calculate
sheet okay so here we can see in this
case where there was no input we get a
blank here as expected and in this case
where the r has been moved over into
this pair o which got merged down that
column is now blank as well so now let's
show what the next pass is going to
do and we'll use a formula so we can
keep track of which pass we're on this
is our second
pass and the input for this pass is
going to be the output right here of the
previous pass but we're going to remove
any blanks that are sitting in between
these characters and to do that I'm
going to use another special
function get non blanks in
range and this
function you can find again by going
into the name manager you'll see it in
here it's really just a uh simple
version of filter with some error
checking on on
it and removing any blank characters or
rather blank spaces and cells in the
range here we've got flavor eyes with
this blank removed and now we're just
going to repeat the same process we did
before so
take this
row copy that and paste it
down and now we rerun the
algorithm and here we can see in the
second
pass the highest scoring pair was the
first one that's the space character
with the letter F and that got
propagated down and left a empty cell
here and then the rest of them came down
as normal now at this
point we can just copy all these rows
and just continue repeating through the
spreadsheet so that's another pass and
then we'll go down some more
and that'll be another
toss down one more there we
go
then right
here and we keep
going now because we have 10 characters
we're basically gonna need
about nine passes one more or sorry one
less than the number of characters in
the word itself that should be enough
let's calculate the
sheet okay here you can see flavor eyes
has finally been broken down into the
word flavor with the space in front of
it and the suffix
I let me point out a few things here
you'll notice
that basically around around the eth
pass or so it already figured out what
the right tokenization was and it was
just propagating it through the next few
phases you'll also notice that some of
these possible pairs like this one right
here which is a with
I those two put together has no score
that's because that token simply does
not exist in the vocabulary for GPT 2s
by pair en coding so it ends up as a
blank the other thing that's worth
pointing out is that how this would be
different from say naive string matching
and just looking for a particular set of
characters so here let's take I here if
I change this word to say
Eisenberg and run the sheet
again you'll notice that
the final tokenization is a lot
different we don't break out I together
in fact it's I Zen and Berg and that's
because of how the bite pairing coding
tokenization algorithm is working by
prioritizing certain sets of more themes
or rather subword units over others when
it tokenizes the word into its different
pieces okay so that's in essence the
algorithm for how B pair encoding works
now you'll notice that prompt to tokens
is laid out a little bit differently
it's it's got a bit different formulas
that help it work across multiple words
so you'll notice there's multiple words
here and it's got formulas that try to
take into account the position of the
cell using modulo arithmetic to make
copy and pasting across the columns work
a lot easier but in essence the
algorithm implementation works roughly
the same as I've outlined in this sheet
here okay now that you've seen the
learning and the tokenization algorithms
for bike pair en coding I just want to
wrap up with a few caveats and the first
is that bite pair encoding is not a
universal good it has its own trade-offs
and this post from Andre kpoy highlights
some of those problems one of those is
this Effect called solid gold Magikarp
and the effect is if you ask one of
these large language models to repeat
back certain tokens in this case
streamer bot it responds back with
you're a jerk or if you ask it to repeat
back this string it responds with you
are not a robot or you are a banana my
favorite example of this is if you ask
how many letters are in this username it
can't repeat that string back to you now
part of the problem here really isn't
bite pair en coding itself it's the fact
that there's a learning algorithm for
bite PA and coding and then there's a
separate learning algorithm for the rest
of the large language model and what has
happened is in this case this David JL
is a very prolific user name on Reddit
and that particular user occurs so often
that it got learned by the tokenization
algorithm but it occurs so infrequently
in the rest of the training data across
the rest of the internet that it has a
very low probability of ever being
output as an output token and so you end
up with this connect between it's in the
token space but it's not in the
probabilistic output space and that
creates these kinds of effects where the
model is going to give you back some
weirdness when you ever use these
tokens another problem with bite pair
encoding is it's very English Centric
and there are languages where the
fundamental word separation and
principles involved in bipar encoding
don't work there are other encoding
schemes sent piece for example is one of
them and this is being used for English
to Japanese translation and there are
plenty of other tokenization algorithms
the website hugging phase has a great
list and libraries for tokenization that
you should take a look
at and then I did say that both
character and word-based encoding don't
work there are certainly examples of
other Transformer architectures that do
use these types of tokenization here
actually is one called car forer that us
uses character-based
tokenization and the tokenization
learning algorithm actually takes place
at the same time as the rest of the
machine learning model which hopefully
solves some of the other problems we've
just talked
about and then finally it's worth noting
that tokens do not have to be limited to
just characters or text anything that
you can translate to numbers you can
then put through the rest of the machine
learning model whether that's audio or
in this case images so this paper they
used patches of an image and turned
those into tokens that they then put
through the rest of the Transformer
model okay so that's how tokenization or
bite pair encoding Works inside the gpt2
architecture in future videos we'll
cover the rest of the model with the
text and position embeddings up
next thank you
تصفح المزيد من مقاطع الفيديو ذات الصلة
Natural Language Processing - Tokenization (NLP Zero to Hero - Part 1)
Count number of tokens in compiler design | Lexical Analyzer
Lec-3: Lexical Analysis in Compiler Design with Examples
Large Language Models (LLMs) - Everything You NEED To Know
How does ChatGPT work? Explained by Deep-Fake Ryan Gosling.
Complete Road Map To Prepare NLP-Follow This Video-You Will Able to Crack Any DS Interviews🔥🔥
5.0 / 5 (0 votes)