Byte Pair Encoding in AI Explained with a Spreadsheet

Spreadsheets are all you need
27 Nov 202335:26

Summary

TLDRThe video script delves into the intricacies of tokenization and byte pair encoding (BPE), essential components in the operation of large language models like GPT-2. It explains how morphemes, the smallest units of meaning in a language, enable the understanding of even made-up words. The script outlines the tokenization process, where text is broken down into tokens, and how BPE identifies common subword units to handle an extensive vocabulary efficiently. The video also addresses the limitations of character-based and word-based tokenization, highlighting increased memory and computational requirements. It demonstrates the BPE algorithm's learning phase using a simplified example and shows its application in a spreadsheet, illustrating how text like 'flavorize' is tokenized. The script concludes by noting BPE's limitations, such as the 'Solid Gold Magikarp' effect and its English-centric nature, and mentions alternative tokenization methods and the flexibility of tokens to represent different types of data.

Takeaways

  • ๐Ÿ“š **Tokenization**: The process of converting text into tokens, which are the subword units that a language model like GPT-2 understands and uses for processing.
  • ๐Ÿ” **Byte Pair Encoding (BPE)**: An algorithm used for subword tokenization that learns common subword units from a corpus and then tokenizes input text into these units.
  • ๐Ÿ”‘ **Morphemes**: The smallest units of meaning in a language, which BPE aims to capture by breaking down words into meaningful parts.
  • ๐Ÿ“ˆ **Vocabulary Size**: GPT-2 uses around 50,000 tokens, which is a balance between the memory and compute required for a model that uses character-based or word-based tokenization.
  • ๐Ÿง  **Model Parameters**: The GPT-2 model has 124 million parameters, which would significantly increase if using a word-based tokenization for the entire English language.
  • ๐ŸŒ **Corpus Learning**: BPE starts with a corpus of text and iteratively learns the most frequent character pairs to build its vocabulary of tokens.
  • โœ‚๏ธ **Tokenization Process**: Involves breaking down input text into tokens based on the learned vocabulary, with the algorithm prioritizing certain subword units over others.
  • โš™๏ธ **Handling Unknown Words**: BPE can handle unknown or misspelled words better than a simple word-to-number mapping, although it may not always align with a native speaker's expectations.
  • ๐Ÿ“‰ **Solid Gold Magikarp Effect**: A problem where certain strings or tokens are learned by the tokenization algorithm but not frequently output by the model, leading to unexpected responses.
  • ๐ŸŒ **Language Centrism**: BPE is more effective with languages like English that have clear word separation, but it may not be as effective for languages with different linguistic structures.
  • ๐Ÿ”„ **Flexibility in Tokenization**: Tokens are not limited to text and can be used to represent other types of data, such as audio or image patches, for processing through a Transformer model.

Q & A

  • What is the term 'funology' in the context of the video?

    -The term 'funology' is a made-up word that combines 'fun' with the suffix '-ology', which typically denotes a study or science. In the context of the video, it's used to illustrate how people can extrapolate meanings of such portmanteau words even when they are not found in a dictionary.

  • What is tokenization in the context of language models?

    -Tokenization is the process of converting text into a format that a language model can understand, which involves breaking down the text into its constituent parts or tokens. This is a crucial step as language models like GPT-2 only understand numbers, not text.

  • How does Byte Pair Encoding (BPE) work in tokenization?

    -Byte Pair Encoding (BPE) is a subword tokenization algorithm used by models like GPT-2. It operates in two phases: first, it learns common subwords from a corpus of text to create a vocabulary, and second, it tokenizes new input text using this vocabulary. BPE identifies and merges the most frequently occurring pairs of symbols (letters, subwords, or words) into single tokens, which helps in handling large vocabularies efficiently.

  • Why is BPE preferred over character-based or word-based tokenization?

    -BPE is preferred because it strikes a balance between the two. Character-based tokenization creates longer sequences and puts more work on the training algorithm, while word-based tokenization may not handle unknown or misspelled words well and requires a larger model size to accommodate a full vocabulary.

  • What is the 'solid gold Magikarp' effect in language models?

    -The 'solid gold Magikarp' effect refers to a situation where a language model fails to repeat back certain tokens or strings accurately. This can occur when a token is learned by the tokenization algorithm but has a low probability of being output due to its infrequent occurrence in the training data.

  • How does BPE handle complex or unknown words?

    -BPE can break down complex or unknown words into known subword units or tokens based on the vocabulary it has learned. If a word or subword is not in its vocabulary, BPE will tokenize it into the closest matching subword units it does recognize.

  • What are embeddings in the context of language models?

    -Embeddings are numerical representations of words or tokens that capture their semantic meaning. Each token is transformed into a high-dimensional vector space, where each dimension represents some aspect of the word's semantic meaning. These embeddings are used as inputs to the neural network within a language model.

  • Why is the vocabulary size in GPT-2 around 50,000 tokens?

    -The vocabulary size of around 50,000 tokens in GPT-2 is a compromise between model efficiency and expressiveness. A larger vocabulary would increase the model size and computational requirements, while a smaller vocabulary might not capture enough nuances of the language.

  • What is the significance of the embedding dimension in language models?

    -The embedding dimension refers to the size of the vector space used to represent each token. It is a hyperparameter that determines the richness of the representation. A higher embedding dimension can capture more nuances but also increases the computational complexity.

  • How does the BPE algorithm decide which pairs of characters to merge?

    -The BPE algorithm decides which pairs to merge based on frequency. It identifies the most frequently occurring pairs of characters (or existing tokens) and merges them into a single token, thus gradually building up a vocabulary that represents common subwords in the language.

  • What are some limitations or challenges of using BPE?

    -BPE has some limitations, including being more effective for languages with clear word separation and less effective for languages where word separation principles differ. It can also lead to issues like the 'solid gold Magikarp' effect and may not always perfectly align with a native speaker's expectations of word boundaries.

Outlines

00:00

๐Ÿ˜€ Understanding Funology and Tokenization

This paragraph introduces the concept of 'funology,' a made-up word that illustrates how morphemes help us understand new words. It sets the stage for discussing tokenization, a process that breaks text into tokens that a language model like GPT-2 can understand. The paragraph explains that tokenization doesn't just split sentences into words but can also break single words into multiple tokens, which is crucial for a language model to convert text into a numerical form it can process.

05:00

๐Ÿงฎ Tokenization and the GPT-2 Model

The second paragraph delves into the technical aspects of tokenization within the GPT-2 model. It discusses the challenges of converting words into numbers and the limitations of direct word-to-number mapping. The paragraph also highlights the trade-offs involved in using a large vocabulary size and the computational implications of such a model. The explanation includes a practical demonstration using a spreadsheet to show how the GPT-2 model uses a text embedding matrix to represent tokens, and the impact of expanding the vocabulary on the model's parameters.

10:03

๐Ÿ” Exploring Subword Tokenization

This paragraph explores the concept of subword tokenization as a middle ground between character-based and word-based tokenization. It explains the two phases of the Byte Pair Encoding (BPE) algorithm used by GPT-2: the learning phase, where common subwords are identified from a text corpus, and the tokenization phase, which processes user input into tokens. The paragraph uses an example to illustrate how BPE learns from a small corpus and builds a vocabulary that can tokenize a given text, emphasizing the algorithm's flexibility and efficiency.

15:04

๐Ÿ“ Implementing BPE Tokenization in a Spreadsheet

The fourth paragraph provides a detailed walkthrough of implementing BPE tokenization within a spreadsheet application. It demonstrates how to break down words into characters, form possible pairs, calculate scores for these pairs, and merge them based on the highest score. The process is shown step-by-step, including handling edge cases like blank characters and ensuring that tokens are correctly propagated through multiple passes of the algorithm.

20:05

๐Ÿ”„ Iterative Process of BPE Tokenization

This paragraph continues the explanation of the iterative process of BPE tokenization within the spreadsheet. It shows how the algorithm progresses through multiple passes, refining the tokenization of a word each time. The paragraph also points out that the algorithm converges on the correct tokenization relatively quickly and continues to propagate the result through subsequent passes. It emphasizes the algorithm's ability to handle different words and the nuances of its operation.

25:08

๐Ÿšง Caveats and Considerations of BPE

The seventh paragraph discusses the limitations and trade-offs associated with BPE tokenization. It mentions issues like the 'Solid Gold Magikarp' effect, where certain strings are not repeated back correctly by the language model. The paragraph also addresses the English-centric nature of BPE and its potential shortcomings in other languages. It concludes by noting that tokenization is not limited to text and can be applied to other forms of data, such as audio or image patches, which can be translated into tokens for processing by a Transformer model.

30:11

๐Ÿ“ˆ Future Coverage of the GPT-2 Model

The final paragraph briefly mentions that future videos will cover the remaining components of the GPT-2 model, including text and position embeddings. It serves as a transition, indicating that the current discussion on tokenization is part of a larger series exploring the intricacies of modern AI and language models.

Mindmap

Keywords

๐Ÿ’กFunology

A made-up term that combines 'fun' with the suffix '-ology' to suggest the study of fun. It's used in the video to illustrate how morphemes can help infer the meaning of non-existent words. The concept is introduced to explain the importance of morphemes in language understanding, which is crucial for the discussion on tokenization and byte pair encoding in AI.

๐Ÿ’กTokenization

Tokenization is the process of breaking text into smaller units, known as tokens, which are often words or subwords. In the context of the video, it's a critical step in preparing text for machine learning models, as it converts text into a numerical format that models can understand and process.

๐Ÿ’กByte Pair Encoding (BPE)

Byte Pair Encoding is a subword tokenization algorithm used by models like GPT-2 to handle a large vocabulary size more efficiently. It involves two phases: learning common subwords from a text corpus and then using those subwords to tokenize new text. BPE is central to the video's discussion on how language models encode textual information into a numerical format.

๐Ÿ’กEmbedding Matrix

The Embedding Matrix is a component within the GPT-2 model that represents words or tokens as vectors of numbers. Each row corresponds to a single word, and the width of the matrix is known as the embedding dimension. The video uses the Embedding Matrix to demonstrate the trade-offs between vocabulary size and model complexity.

๐Ÿ’กMorphemes

Morphemes are the smallest meaningful units in a language. They can be entire words or parts of words. The video discusses how morphemes are used to understand the meaning of words and how they are important for language models to interpret text accurately.

๐Ÿ’กTransformer Architecture

The Transformer Architecture is a type of deep learning model that is particularly effective for natural language processing tasks. It is the foundation for models like GPT-2 and is discussed in the video in relation to how it processes text through tokenization and embeddings.

๐Ÿ’กSubword Units

Subword units are parts of words that are used to tokenize text in models like GPT-2. They are important because they allow the model to handle a large vocabulary without significantly increasing the model's size. The video explains that BPE breaks words into these subword units for processing.

๐Ÿ’กCorpus

A corpus is a large, structured set of texts that is used to train a language model. In the video, the corpus is used to illustrate how BPE learns common subwords by analyzing the frequency of character pairs in the text data.

๐Ÿ’กParameter

In the context of machine learning, a parameter is a value that is learned from the data during training. The video discusses how the number of parameters in a model, such as GPT-2, directly relates to its complexity and the computational resources required to run it.

๐Ÿ’กSolid Gold Magikarp Effect

This is a phenomenon where certain strings or tokens are not properly learned by the model's output probabilities, leading to unexpected or nonsensical responses. The video uses this effect to highlight some of the limitations and quirks of BPE and large language models.

๐Ÿ’กContext Window

The context window refers to the amount of text that a model can consider at once. The video explains that character-based tokenization can lead to longer context windows, which increases memory and computational requirements during inference and training.

Highlights

Funology is a made-up word, but it can be understood by combining the word 'fun' with the suffix '-ology', demonstrating how morphemes help in understanding language.

Tokenization is the process of converting text into tokens, which can be words, subwords, or characters, and is a crucial step in language model implementation.

GPT2, an early precursor to chat GPT, uses a method called byte pair encoding (BPE) for tokenization, which breaks down words into subword units.

Byte pair encoding has two phases: learning the common subwords in a language and then using that vocabulary to tokenize input text.

BPE starts by counting character pairs and iteratively merges the most frequent pairs, building a vocabulary that the model can use for tokenization.

The tokenization process in BPE is not always perfect and may not align with a native speaker's expectations, but it captures common subword units that tend to have meaning.

Assigning each word a number is not practical due to the inability to handle unknown or misspelled words and the large vocabulary size increasing the model's memory and compute requirements.

The GPT2 model has about 50,000 tokens, and increasing the vocabulary to include all English words would nearly double the model's parameters to 216 million.

Character-based tokenization creates longer sequences and has low semantic correlation of characters, making it more challenging for the model to learn the language.

Subword tokenization, like BPE, strikes a balance between character and word tokenization, providing flexibility and efficiency in processing language.

The learning phase of BPE involves creating a corpus of text and iteratively merging the most frequent character pairs until a vocabulary is established.

The tokenization phase uses the established vocabulary to convert user input into tokens that can be processed by the language model.

BPE's tokenization algorithm prioritizes certain subword units over others, which can result in different tokenization outcomes for similar characters in different contexts.

BPE has its limitations, including issues like the 'Solid Gold Magikarp' effect where certain strings are not accurately repeated by the model.

BPE is very English-centric and may not work well for languages where word separation principles do not apply.

There are other tokenization algorithms and Transformer architectures that use character-based tokenization, which may address some of BPE's shortcomings.

Tokens are not limited to text and can represent any data type that can be translated into numbers, allowing for the processing of various data such as audio or images.

Transcripts

play00:00

suppose your friend told you they were

play00:02

an expert in

play00:04

funology now this isn't a word you'll

play00:07

find in the dictionary but because it

play00:09

combines the word fun with the suffix

play00:12

ology you'd be able to extrapolate the

play00:14

meaning and if they were friend to tell

play00:16

you they were an expert in making things

play00:19

enjoyable or suppose your friend told

play00:21

you they were going to flavorize a bland

play00:24

soup again not a real word but you'd be

play00:28

able to figure out they were telling you

play00:30

they were able to make their soup

play00:31

tasty or finally let's suppose they told

play00:34

you they were going to chilf your party

play00:37

by dimming the lights and playing smooth

play00:40

jazz again you'd be able to understand

play00:42

they were trying to make the Ambiance

play00:44

more

play00:45

relaxed what all these examples have in

play00:47

common is that even though they're

play00:49

madeup words you're able to figure them

play00:52

out thanks to what linguists call

play00:54

morphemes these are subword units that

play00:57

contain the meaning and it turns out

play01:00

when you're busy typing away into a

play01:02

large language model like chat GPT it's

play01:05

actually using a similar set of Clues to

play01:07

figure out what you're saying as

play01:09

well and that's the subject of today's

play01:12

video on tokenization and bite pair

play01:15

encoding welcome to spreadsheets are all

play01:17

you need if you're just joining us

play01:19

spreadsheets are all you need is a

play01:21

series of tutorials on how modern

play01:23

artificial intelligence Works through

play01:25

the lens of a spreadsheet that's right

play01:28

if you can read a spreadsheet she you

play01:30

can understand modern AI That's because

play01:32

this spreadsheet that you see here which

play01:34

you can download on the website actually

play01:36

implements a large language model all

play01:38

the way from a prompt to getting a

play01:40

predicted token out in fact it doesn't

play01:43

just Implement any large language model

play01:46

it implements the entire gpt2

play01:48

architecture an early precursor to chat

play01:51

GPT that was state-of-the-art just a few

play01:53

years

play01:55

ago now for the purpose of today's

play01:58

episode the problem we're trying to

play01:59

wrestle with if you take a look at this

play02:02

spreadsheet or uh if you look at the

play02:04

internals of gpt2 what you'll notice is

play02:07

sure we're typing in text here but as we

play02:10

go deeper into the implementation we'll

play02:13

just see Table after Table after table

play02:17

of

play02:17

numbers in essence the Transformer

play02:20

architecture that's at the heart of

play02:23

gpt2 only understands numbers but you're

play02:26

inputting text what we need is a process

play02:29

that can convert text into numbers

play02:32

there's two steps to that and in this

play02:34

video we're going to talk about the

play02:35

first step what's known as

play02:38

tokenization when I introduced

play02:40

tokenization in a previous video I used

play02:43

this example Mike is quick period he

play02:46

moves and showed how it was broken into

play02:49

separate words but tokenization doesn't

play02:51

just break a sentence into its words

play02:54

it's not uncommon for a single word to

play02:56

be broken into one two three or more

play02:58

separate tokens let's see this in action

play03:01

in the

play03:02

spreadsheet so here this tab type prompt

play03:05

here is where we enter our prompt with

play03:07

each word or punctuation on a separate

play03:09

line and note that we have to add the

play03:11

spaces in here manually and then prompt

play03:14

to tokens is where that text gets broken

play03:16

out into separate tokens as you can see

play03:18

here Mike is quick period he moves and

play03:22

underneath we'll see the token ID now

play03:24

for now just think of the token ID as

play03:27

its position inside the dictionary of

play03:29

known tokens we'll be talking more about

play03:31

that in later videos for now let's just

play03:33

try some of the examples we used in the

play03:35

introduction to see tokenization for

play03:37

more complex words

play03:39

like I

play03:41

will

play03:43

flavorize the

play03:46

suit and then because this is such a

play03:49

large spreadsheet remember that we've

play03:51

got manual calculation turned on so to

play03:53

calculate we need to actually hit this

play03:55

calculate sheet button and then just

play03:57

wait a little

play03:58

bit

play04:00

okay here you can see the word flavorize

play04:02

has been broken into two parts the

play04:03

suffix i Ze as well as the word flavor

play04:06

with the space in front of it let's try

play04:08

one more

play04:11

example let

play04:14

us

play04:17

chifi the

play04:20

party and again you see chifi has been

play04:23

broken into the suffix ify along with

play04:25

the word chill now by par encoding isn't

play04:29

always perfect so let's talk about an

play04:33

example here where it

play04:35

fails the brace will prevent

play04:42

reinjury now to you and I reinjury is

play04:46

the word injury with prefix re in front

play04:50

of it but as you see here in bite pair

play04:53

encoding it's actually broken into the

play04:55

word re n and then the word jury so it

play05:00

doesn't always line up with you know a

play05:02

native English speaker expectation but

play05:04

it does do a good job of actually

play05:05

capturing the most common subword units

play05:08

that tend to have

play05:12

meaning now you're probably wondering

play05:14

why we don't just turn words into

play05:16

numbers by just assigning each word a

play05:18

number like dog as one cat as two and so

play05:21

forth throughout the entire dictionary

play05:23

well this has a couple problems the

play05:26

first is that this isn't able to handle

play05:28

unknown or misspelled Words which are

play05:30

actually really common if you're say

play05:31

scraping all the text on the internet

play05:33

now that being said there are

play05:35

Transformer models that do have a

play05:37

special unknown token so it's not an

play05:39

insurmountable problem another problem

play05:42

is just supporting a large vocabulary

play05:44

size increases the size of the

play05:46

model gpd2 has about 50,000 tokens but

play05:50

English has about three times that

play05:53

170,000 words and so the end result is

play05:56

that it creates more memory and more

play05:58

compute needed to run and train the

play06:03

model we can actually demonstrate this

play06:06

inside the

play06:07

spreadsheet so inside the spreadsheet is

play06:10

a specific tab called Model wte what's

play06:14

known as the text embedding Matrix the

play06:17

important thing to know here is that

play06:19

each row of this Matrix corresponds to a

play06:21

single word so it's the entire

play06:23

vocabulary size I in this case

play06:27

50257 because they're 50,00

play06:30

257 known tokens inside the gpt2 model

play06:34

now the width of this Matrix is

play06:36

something called the embedding Dimension

play06:37

we'll learn about that in later videos

play06:39

for now just know it's about 760

play06:42

columns the key point is that the entire

play06:45

gpt2 model we're playing with is 124

play06:48

million parameters if we were to take

play06:50

this Matrix and add 120,000 more rows

play06:54

effectively what we would need to have

play06:56

all the words in the English language in

play06:58

a word Style embedding we're basically

play07:01

actually nearly doubling the number of

play07:04

parameters to 216 million that's a huge

play07:06

increase in the amount of cute needed

play07:08

just to calculate a token let's go see

play07:11

this in the

play07:12

spreadsheet so here we are in the sheet

play07:16

and if you look at the list of all the

play07:18

tabs in this workbook you'll notice

play07:20

there's a large number of them that

play07:21

start with model underscore and these

play07:24

are basically tables of numbers like

play07:27

this when somebody talks to you about

play07:29

the number of parameters in a model it's

play07:31

really basically how many numbers are in

play07:34

every single one of these tabs across

play07:36

this entire model and in this case in

play07:38

the spreadsheet there's 124 million of

play07:41

them now the model we were talking about

play07:44

before or rather the Matrix we were

play07:45

talking about before is this Matrix wte

play07:48

and here you can see it's got 768

play07:51

columns and it's got 50, 257 rows each

play07:55

of these rows corresponds to a single

play07:56

token so let's do some calculations here

play08:03

create a blank workbook so we know that

play08:07

the

play08:08

original

play08:14

Matrix height is

play08:18

50,

play08:20

257 we know that the original Matrix

play08:25

with is

play08:28

768 if we were going to add enough rows

play08:31

to accommodate all the words in the

play08:33

English language in a word style

play08:36

tokenization then we would need

play08:38

additional

play08:40

rows we know that's

play08:44

170,000 so actually that's the total

play08:49

rows for word

play08:53

encoding organization I should

play08:57

say that means the additional

play09:03

rows is this minus that is about

play09:10

120,000 so let's add how many additional

play09:13

parameters that would be well that's

play09:15

additional

play09:17

parameters are just simply the number of

play09:20

additional rows times the width of each

play09:23

row that's

play09:25

768 let's expand this out so it's easier

play09:28

to see and then let's add some commas so

play09:30

it's easier to read and let's get rid of

play09:32

those extra arrows or sorry zeros rather

play09:36

and then remember that the entire model

play09:38

is about 124 so this is the original

play09:42

model

play09:43

size so basically we're nearly doubling

play09:46

the size of the model we're adding 91

play09:48

million more parameters just to

play09:50

accommodate all the words in the English

play09:52

language using bite pair encoding we can

play09:55

do this All a lot more flexibly and

play09:57

using only 50,000 tokens

play10:03

now you might be wondering why don't we

play10:04

go the other way let's use

play10:05

character-based tokenization so give

play10:08

each letter a number so a equals 1 b

play10:10

equals 2 and at least in English

play10:12

language there aren't that many

play10:13

characters so you're not going to have

play10:14

that many rows and if you're familiar

play10:17

with the ask key encoding or you come

play10:18

from a developer background you're

play10:19

probably used to this kind of setup as

play10:21

well this has its own couple of problems

play10:24

so the first one is it actually creates

play10:26

longer sequences let me show you that

play10:31

let's close this and let's go to where

play10:35

we enter in our

play10:38

prompt right

play10:43

here after we've turned these The Prompt

play10:46

into tokens those tokens are then turned

play10:49

into what are called embeddings These

play10:50

are long vectors this is each of these

play10:52

tokens goes into a row of 768 values the

play10:56

key point though is that this may Matrix

play10:59

here this is basically what you're

play11:00

probably familiar with as the context

play11:02

link when you type into something like

play11:03

chat GPT is as long as the number of

play11:07

tokens right now you can see it's only

play11:10

well it's only about six tokens long but

play11:13

if each of these were split out by its

play11:15

character it's going to be a lot longer

play11:17

so our context link that we have to

play11:19

process for something that's only this

play11:21

case six words long is going to be a lot

play11:23

longer and this gets carried through

play11:25

this length of six gets carried through

play11:27

not just here but through actually every

play11:30

stage in every layer of the entire

play11:33

Transformer pipeline so in effect even

play11:36

though with the character-based

play11:37

tokenization the size of the model that

play11:39

WT Matrix may be smaller the length of

play11:43

the context window in order to process

play11:45

the same amount of text gets longer

play11:46

through the entire inference and

play11:48

training

play11:49

pass that means again more memory and

play11:51

more compute there's another problem

play11:54

that's more subtle which is that there's

play11:55

low semantic correlation of characters

play11:58

themselves so as an example of this

play12:00

you've probably seen this email that

play12:02

went around uh a couple years ago and

play12:05

they've jumbled all the letters in the

play12:07

words but it's still readable so the

play12:09

first sentence basically says according

play12:10

to research at Cambridge University

play12:13

doesn't matter what order the letters in

play12:14

a word r the only important thing is

play12:16

that the first and last letter be in the

play12:18

right place by the way that's

play12:19

technically not true that the first and

play12:21

last letter need to be in the right

play12:22

place but the point still stands that

play12:25

the meaning of a word isn't conveyed in

play12:27

just the characters itself but in how

play12:29

they're grouped together and because the

play12:32

characters themselves don't carry that

play12:33

information that puts more work on the

play12:36

training algorithm to learn English and

play12:37

what the meanings of words are than it

play12:39

would if it was actually using words to

play12:41

begin with in the end both these

play12:43

problems mean more memory more compute

play12:45

it becomes harder to train and use the

play12:48

model so if character tokenization has

play12:51

its own problems and word tokenization

play12:53

at the other end has another set of

play12:55

problems maybe there's a Goldilocks in

play12:56

between the two small to big that's just

play12:58

just right and it turns out there is and

play13:00

that is subword

play13:02

tokenization now there's more than one

play13:04

subword tokenization algorithm out there

play13:06

the one that gpt2 uses is called bite

play13:09

pair encoding or bpe and it's got two

play13:11

phases the first phase is a learning

play13:14

word where it learns what the common

play13:15

subwords are in a particular language

play13:18

and then the second phase is the actual

play13:19

tokenization phase that takes input and

play13:22

turns it into tokens be processed by the

play13:25

large language

play13:27

model so for the first phase I want you

play13:30

to imagine you've gathered a large body

play13:32

of text Maybe by saping all the English

play13:34

language on the internet this sometimes

play13:36

referred to as a corpus of

play13:38

text and you pass that into the BP

play13:41

learning algorithm which turns this into

play13:43

a known vocabulary of tokens and we'll

play13:45

walk through this in a little more

play13:46

detail in the next few slides then in

play13:49

the tokenization phase we take the input

play13:51

from the

play13:52

user and we combine it with the

play13:55

vocabulary that we got from the first

play13:57

phase to out tokens that are then used

play14:00

in later processing stages of the

play14:02

model I'm going to illustrate the

play14:04

learning phase through some slides and

play14:07

then implement the tokenization phase

play14:09

inside the sheet so you can see how it

play14:11

actually

play14:13

works to show the learning phase I'm

play14:16

actually going to use the same example

play14:18

that was used in this paper that

play14:20

introduced this algorithm to machine

play14:21

learning and it's a short readable paper

play14:24

it actually has a python script in it if

play14:27

you want to try it out I'll just run

play14:29

through it in slides here for now now I

play14:31

want you to imagine the part on the left

play14:33

the Corpus is the result of maybe

play14:35

scanning all the English language we

play14:37

could find on the internet obviously

play14:39

it's not that long it's a toy example to

play14:42

illustrate the process and it just has

play14:44

in this case five words the word low

play14:46

occurring five times in our scan the

play14:49

word lower occurring twice the word

play14:51

newest occurring six times and the word

play14:54

widest occurring three times and then

play14:57

we'll start on the right with our vocab

play14:59

of just the characters from A to Z now

play15:02

in order to make the process more clear

play15:04

we're going to rewrite the Corpus a

play15:05

little bit I'll use the dot symbol to

play15:08

show the separation between the

play15:09

individual characters I'll use this

play15:12

underscore character to illustrate the

play15:14

boundary of where a word ends and

play15:15

another one might

play15:17

begin the first step in learning the

play15:20

vocabulary is to look at all the

play15:22

adjacent pairs of characters and count

play15:24

up which one is the most frequently

play15:26

occurred so for example the letter e

play15:29

next to the letter s occurs nine times

play15:33

it occurs six times in the word newest

play15:37

and it occurs three times in the word

play15:39

widest so it occurs a total of nine

play15:41

times so we then write down e paired

play15:44

with s with a frequency of nine and then

play15:47

we do this for all the adjacent pairs of

play15:49

characters inside our Corpus and then we

play15:52

look at the one that has the most

play15:54

frequent occurrence in this case it's

play15:56

and although we could pick s and t or T

play15:59

and the end of word character they all

play16:00

have nine and then we take that and we

play16:03

add that to our

play16:05

vocabulary we then reprocess our corpus

play16:09

with our new vocabulary we take all

play16:11

occurrences of e paired with s we

play16:14

transform them into a new es character

play16:18

so now es together are going to be

play16:20

treated as if they were single character

play16:22

even though to us they're considered

play16:24

separate

play16:26

characters then we repeat the process

play16:29

so now the most frequently occurring

play16:31

pair is this new es with the T character

play16:35

and that occurs nine times we then add

play16:38

this to our vocabulary below the newly

play16:41

added vocabulary entry from the previous

play16:43

pass so now we have a vocab that's e

play16:46

paired with s and then es s paired with

play16:49

t and again we reprocess our Corpus such

play16:54

that es paired with T is now treated as

play16:57

a single subword unit

play17:01

EST and this is what we have after two

play17:04

passes we have a vocabulary of two new

play17:08

subword units e and s and then EST and

play17:12

then we have our Corpus that's been

play17:14

tokenized into our

play17:17

vocabulary after multiple passes of this

play17:19

process in this case after 10 passes we

play17:23

have something like shown here we have a

play17:25

vocabulary which has 10 tokens the

play17:27

number of tokens

play17:29

equal to the number of passes we did of

play17:30

the algorithm and then on the left our

play17:33

Corpus that's been segmented and

play17:34

tokenized according to this new

play17:37

vocabulary so note the most frequent

play17:40

words in our Corpus low and newest were

play17:43

actually tokenized into their own

play17:45

individual

play17:46

words meanwhile lower and widest were

play17:49

broken into separate tokens but you'll

play17:51

notice that BP has already started to

play17:53

learn some of those morphemes we talked

play17:55

about earlier in the case of lower and

play17:59

low it's understood that they both use

play18:01

the same subword unit low and in the

play18:04

case of EST in newest and widest it's

play18:07

understood that morphe and recognized

play18:09

that as well as its own independent

play18:10

token while the bpe learning algorithm

play18:13

is actually fairly simple it may take

play18:15

rewatching this a few times to get an

play18:17

intuition for how it works or you can

play18:19

run the Python program I showed earlier

play18:22

and get an even better feel for how it

play18:24

works now let's see a tokenization

play18:27

process that two of the algorithm Works

play18:29

in detail inside the

play18:32

spreadsheet before we get to the

play18:33

spreadsheet I want to show you this file

play18:36

this is the vocab BP file you can get

play18:38

this from open aai when you download the

play18:41

full weights of gpt2 you'll notice

play18:43

there's 50,000 tokens and these are the

play18:46

merges that we saw earlier where there's

play18:48

a space character between the left and

play18:50

the right you're seeing a DOT right

play18:53

there but that's because of my text

play18:54

editor in the actual text it's really

play18:56

just a space and then as result to

play18:59

represent the space character itself you

play19:00

see this special kind of G with an

play19:02

accent on it that's actually the real

play19:04

space character we've to substitute that

play19:07

out I've gone ahead and actually done

play19:10

that inside the spreadsheet so that

play19:12

special G character has been replaced

play19:13

with a true space we have here is the

play19:16

left half of a pair and in column two

play19:19

the right half of a pair I've added two

play19:21

new columns one is called the rank so we

play19:23

can understand which pair or merge has

play19:25

the highest priority in the parsing

play19:28

and score which is the inverse of rank

play19:31

just to make the math and calculations

play19:33

and formulas a little easier it's easier

play19:35

to find the Max and use zero to

play19:37

represent when there is no

play19:39

match now let's go to where the parsing

play19:42

and tokenization takes

play19:43

place that's again in this sheet

play19:46

prompted

play19:47

tokens now this Sheet's a little complex

play19:49

so what I'm going to do is I'm going to

play19:51

create a new sheet that builds up to the

play19:53

same process that you see here let's

play19:56

insert a new sheet and let's use an

play19:59

input word so let's use input word let's

play20:02

use uh chillii for example actually

play20:05

let's use

play20:08

flavorize and note that I've used space

play20:11

then the word flavorize because in GPT

play20:15

2's algorithm the tokens start with the

play20:17

space character rather than having that

play20:19

end to word character that I showed in

play20:20

the previous

play20:21

example now let's actually break this

play20:24

into

play20:26

characters input whoops input

play20:31

characters and I'm going to use a number

play20:35

to indicate which paths we're at and

play20:37

then I'm going to take this and I'm

play20:39

going to split it into characters now

play20:41

you notice I'm using the split into

play20:42

characters function that's not a normal

play20:45

Excel function if you go to the name

play20:46

manager you'll see this is really just a

play20:49

named function and you can actually see

play20:51

the implementation of it right here

play20:53

split into

play20:55

characters now we're going to create all

play20:57

the possible

play20:59

pairs all the possible adjacent pairs so

play21:02

that would be this which is the space

play21:04

character con catenated with that

play21:08

s let's move that one

play21:12

over and then if I hit calculate sheet

play21:15

you can see the result here so here we

play21:17

have space with f as a possible pair we

play21:19

have f with L as a possible pair L with

play21:21

a and so forth I'm going to actually put

play21:25

this and Mark this as our

play21:27

end so let's do that let's format these

play21:30

cells and let's put a border here so we

play21:33

can see where the end is we want to be

play21:35

able to handle blanks

play21:37

properly

play21:39

so next up we want to understand what

play21:42

the score of each of these possible

play21:44

pairs are inside a vocabulary so let's

play21:47

get the

play21:49

score and I will put question mark there

play21:53

I'll save that for

play21:55

booleans so for this I have again a

play21:58

named function that really is just

play22:00

implementing a standard filter function

play22:03

with a little protection around it for

play22:04

if something doesn't match and we'll

play22:07

input the left character and then the

play22:09

right character in a pair to get its

play22:12

score so this space with an F has a

play22:15

score of 4

play22:17

9979 we can then do that cross the

play22:20

entire sequence let's say calculate

play22:22

sheet again and then we can see the

play22:25

scores of each of the possible pairs now

play22:28

we need to figure out which of these

play22:30

pairs has the highest score so let's

play22:33

extract the max score out of this range

play22:36

so this is going to be

play22:38

Max and we're going to take the previous

play22:41

row and we'll go from column two all the

play22:45

way to previous

play22:48

row and column

play22:55

12 and hit calculate sheet and they

play22:58

should all match up there we go now we

play23:01

want to see which one has the max score

play23:03

so is Max score and we'll make this AB

play23:06

Boolean and that's simply saying is this

play23:09

Max score equal to the score for this

play23:12

pair so this one is obviously false and

play23:16

then let's carry this formula

play23:18

through H calculate sheet and here we

play23:21

can see the max score is

play23:24

49983 which is the pair o what we want

play23:27

to do for our

play23:30

output is say

play23:34

if it's the max score then take the pair

play23:38

otherwise just carry down the original

play23:40

character that was in the input let's

play23:42

carry that

play23:46

through paste that and then calculate

play23:50

sheet okay here we see of these columns

play23:55

only the one with the maximum one o r

play23:57

pair which had the highest rength gets

play23:59

carried through into our next version of

play24:02

flavorize now there's a little problem

play24:04

actually there's two problems the first

play24:06

is you'll notice that we carried through

play24:08

this blank character so one way you can

play24:10

fix that is look for blanks and then

play24:13

make sure they don't get propagated down

play24:15

the second problem you'll notice is a

play24:17

little more subtle and it's that

play24:19

because R was used in O we now need to

play24:24

make sure that this character doesn't

play24:26

get propagated down either and leave an

play24:28

empty space right here we're going to

play24:31

add two

play24:33

new

play24:43

rows the first one is going to be

play24:47

is input

play24:51

blank and then the second is going to be

play24:54

is previous

play24:56

sibling

play24:58

maxed so let's solve this blank one

play25:02

first so all we have to do here is just

play25:04

check if the input character itself is

play25:08

empty and we'll propagate that

play25:14

over and here you can see in the last

play25:17

column it's true indicating that that

play25:20

one is blank and then for solving this R

play25:24

getting propagated down even though it

play25:26

was part of o

play25:28

R all we have to do is check if the

play25:32

previous sibling was the max column and

play25:35

that all I have to do is just check Max

play25:38

score which is this row right here just

play25:41

propagate that value

play25:46

over and here at the very first column

play25:50

there's no previous siblings so it's

play25:51

always

play25:52

false now let's calculate the

play25:56

sheet

play25:58

and here we can see the column with r

play26:01

now has a true here which is going to

play26:03

let us know that this field right here

play26:05

or cell needs to be

play26:07

blank so we're going to change our

play26:09

output

play26:10

formula to now check if the previous

play26:14

sibling is at Max score or the input

play26:18

character itself was blank in that case

play26:21

we just return an empty

play26:23

result

play26:25

otherwise if this column

play26:28

is the max score for this pass then we

play26:32

take the pair otherwise we take the

play26:35

original input character and propagate

play26:37

that

play26:46

through and propagate this

play26:52

through and calculate

play26:54

sheet okay so here we can see in this

play26:58

case where there was no input we get a

play27:01

blank here as expected and in this case

play27:04

where the r has been moved over into

play27:06

this pair o which got merged down that

play27:10

column is now blank as well so now let's

play27:14

show what the next pass is going to

play27:17

do and we'll use a formula so we can

play27:19

keep track of which pass we're on this

play27:22

is our second

play27:23

pass and the input for this pass is

play27:25

going to be the output right here of the

play27:28

previous pass but we're going to remove

play27:32

any blanks that are sitting in between

play27:34

these characters and to do that I'm

play27:36

going to use another special

play27:39

function get non blanks in

play27:44

range and this

play27:46

function you can find again by going

play27:50

into the name manager you'll see it in

play27:52

here it's really just a uh simple

play27:55

version of filter with some error

play27:57

checking on on

play27:58

it and removing any blank characters or

play28:01

rather blank spaces and cells in the

play28:03

range here we've got flavor eyes with

play28:07

this blank removed and now we're just

play28:09

going to repeat the same process we did

play28:12

before so

play28:14

take this

play28:16

row copy that and paste it

play28:19

down and now we rerun the

play28:24

algorithm and here we can see in the

play28:27

second

play28:28

pass the highest scoring pair was the

play28:32

first one that's the space character

play28:35

with the letter F and that got

play28:37

propagated down and left a empty cell

play28:40

here and then the rest of them came down

play28:42

as normal now at this

play28:45

point we can just copy all these rows

play28:49

and just continue repeating through the

play28:52

spreadsheet so that's another pass and

play28:55

then we'll go down some more

play29:00

and that'll be another

play29:03

toss down one more there we

play29:06

go

play29:08

then right

play29:10

here and we keep

play29:23

going now because we have 10 characters

play29:25

we're basically gonna need

play29:27

about nine passes one more or sorry one

play29:31

less than the number of characters in

play29:34

the word itself that should be enough

play29:36

let's calculate the

play29:39

sheet okay here you can see flavor eyes

play29:43

has finally been broken down into the

play29:45

word flavor with the space in front of

play29:47

it and the suffix

play29:49

I let me point out a few things here

play29:53

you'll notice

play29:55

that basically around around the eth

play29:58

pass or so it already figured out what

play30:00

the right tokenization was and it was

play30:03

just propagating it through the next few

play30:05

phases you'll also notice that some of

play30:08

these possible pairs like this one right

play30:10

here which is a with

play30:13

I those two put together has no score

play30:17

that's because that token simply does

play30:19

not exist in the vocabulary for GPT 2s

play30:23

by pair en coding so it ends up as a

play30:25

blank the other thing that's worth

play30:27

pointing out is that how this would be

play30:29

different from say naive string matching

play30:32

and just looking for a particular set of

play30:34

characters so here let's take I here if

play30:39

I change this word to say

play30:45

Eisenberg and run the sheet

play30:49

again you'll notice that

play30:52

the final tokenization is a lot

play30:55

different we don't break out I together

play31:00

in fact it's I Zen and Berg and that's

play31:03

because of how the bite pairing coding

play31:05

tokenization algorithm is working by

play31:08

prioritizing certain sets of more themes

play31:11

or rather subword units over others when

play31:14

it tokenizes the word into its different

play31:17

pieces okay so that's in essence the

play31:19

algorithm for how B pair encoding works

play31:22

now you'll notice that prompt to tokens

play31:24

is laid out a little bit differently

play31:26

it's it's got a bit different formulas

play31:29

that help it work across multiple words

play31:33

so you'll notice there's multiple words

play31:35

here and it's got formulas that try to

play31:37

take into account the position of the

play31:40

cell using modulo arithmetic to make

play31:43

copy and pasting across the columns work

play31:45

a lot easier but in essence the

play31:48

algorithm implementation works roughly

play31:50

the same as I've outlined in this sheet

play31:55

here okay now that you've seen the

play31:58

learning and the tokenization algorithms

play32:01

for bike pair en coding I just want to

play32:03

wrap up with a few caveats and the first

play32:05

is that bite pair encoding is not a

play32:07

universal good it has its own trade-offs

play32:10

and this post from Andre kpoy highlights

play32:14

some of those problems one of those is

play32:17

this Effect called solid gold Magikarp

play32:20

and the effect is if you ask one of

play32:22

these large language models to repeat

play32:24

back certain tokens in this case

play32:27

streamer bot it responds back with

play32:29

you're a jerk or if you ask it to repeat

play32:31

back this string it responds with you

play32:33

are not a robot or you are a banana my

play32:37

favorite example of this is if you ask

play32:40

how many letters are in this username it

play32:42

can't repeat that string back to you now

play32:45

part of the problem here really isn't

play32:47

bite pair en coding itself it's the fact

play32:50

that there's a learning algorithm for

play32:52

bite PA and coding and then there's a

play32:53

separate learning algorithm for the rest

play32:56

of the large language model and what has

play32:58

happened is in this case this David JL

play33:02

is a very prolific user name on Reddit

play33:07

and that particular user occurs so often

play33:10

that it got learned by the tokenization

play33:13

algorithm but it occurs so infrequently

play33:16

in the rest of the training data across

play33:18

the rest of the internet that it has a

play33:20

very low probability of ever being

play33:23

output as an output token and so you end

play33:26

up with this connect between it's in the

play33:28

token space but it's not in the

play33:31

probabilistic output space and that

play33:33

creates these kinds of effects where the

play33:36

model is going to give you back some

play33:38

weirdness when you ever use these

play33:42

tokens another problem with bite pair

play33:44

encoding is it's very English Centric

play33:47

and there are languages where the

play33:49

fundamental word separation and

play33:51

principles involved in bipar encoding

play33:53

don't work there are other encoding

play33:56

schemes sent piece for example is one of

play33:58

them and this is being used for English

play34:00

to Japanese translation and there are

play34:02

plenty of other tokenization algorithms

play34:05

the website hugging phase has a great

play34:06

list and libraries for tokenization that

play34:09

you should take a look

play34:10

at and then I did say that both

play34:13

character and word-based encoding don't

play34:15

work there are certainly examples of

play34:18

other Transformer architectures that do

play34:20

use these types of tokenization here

play34:23

actually is one called car forer that us

play34:26

uses character-based

play34:28

tokenization and the tokenization

play34:30

learning algorithm actually takes place

play34:32

at the same time as the rest of the

play34:34

machine learning model which hopefully

play34:36

solves some of the other problems we've

play34:37

just talked

play34:39

about and then finally it's worth noting

play34:42

that tokens do not have to be limited to

play34:44

just characters or text anything that

play34:48

you can translate to numbers you can

play34:50

then put through the rest of the machine

play34:51

learning model whether that's audio or

play34:53

in this case images so this paper they

play34:58

used patches of an image and turned

play35:00

those into tokens that they then put

play35:02

through the rest of the Transformer

play35:06

model okay so that's how tokenization or

play35:10

bite pair encoding Works inside the gpt2

play35:15

architecture in future videos we'll

play35:17

cover the rest of the model with the

play35:20

text and position embeddings up

play35:23

next thank you

Rate This
โ˜…
โ˜…
โ˜…
โ˜…
โ˜…

5.0 / 5 (0 votes)

Related Tags
TokenizationByte Pair EncodingAI Language ModelsNatural Language ProcessingMachine LearningText EmbeddingsGPT-2 ArchitectureCorpus AnalysisSpreadsheetsSubword UnitsLinguistic Morphemes