But what is a GPT? Visual intro to Transformers | Deep learning, chapter 5

3Blue1Brown
1 Apr 202427:14

Summary

TLDREl guion del video ofrece una explicación visual de cómo funciona una Generative Pretrained Transformer (GPT), una red neuronal clave en el avance de la inteligencia artificial. Se discute el proceso de 'pre-entrenamiento' en gran cantidad de datos y la capacidad de afinación en tareas específicas. El modelo de transformer, introducido por Google en 2017, está diseñado para traducir texto y generar nuevas secuencias de texto a partir de un snippet inicial. El video explora cómo se desglosa la entrada en 'tokens', se convierten en vectores y se procesan a través de bloques de atención y percepciones multicapas antes de generar una distribución de probabilidad para el token siguiente. Además, se menciona el uso de la función Softmax para normalizar valores en una distribución de probabilidad y cómo la 'temperatura' afecta la creatividad del texto generado. El guion prepara al espectador para comprender el mecanismo de atención, una pieza central en el éxito de los modelos de lenguaje modernos.

Takeaways

  • 🧠 GPT significa Generative Pretrained Transformer, un modelo de bot que genera nuevo texto a través de aprendizaje masivo y ajuste fino en tareas específicas.
  • 🤖 'Pretrained' se refiere a que el modelo aprendió de una gran cantidad de datos y puede ser afinado con más entrenamiento para tareas específicas.
  • 🔑 La palabra 'Transformer' hace referencia a un tipo específico de red neuronal, el núcleo de la inteligencia artificial moderna.
  • 🎨 Los transformers pueden utilizarse para construir diferentes modelos, desde audio a transcripciones, de texto a discurso sintético, e inclusive generación de imágenes a partir de descripciones textuales.
  • 🌐 El modelo original de 'transformer' fue creado por Google en 2017 con el objetivo específico de traducir texto de un idioma a otro.
  • 📚 El modelo de ChatGPT se entrena para tomar un trozo de texto y predecir lo que sigue, tomando la forma de una distribución de probabilidad sobre posibles siguientes trozos de texto.
  • 🛠️ La predicción y muestreo repetidos es el proceso básico que ocurre cuando interactuamos con modelos de lenguaje grandes como ChatGPT.
  • 🔍 La entrada de datos en un transformador se descompone en 'tokens', que pueden ser palabras, partes de palabras o combinaciones de caracteres comunes.
  • 🔄 Los tokens se asocian con vectores que codifican su significado, y estos vectores pasan por bloques de atención y operaciones de perceptrón multicapa para actualizar su información.
  • 📉 La función Softmax se utiliza para convertir una lista de números en una distribución de probabilidad válida, asegurando que los valores sean positivos y sumen 1.
  • 🌡️ El 'temperature' en la función de distribución de probabilidad afecta la originalidad y la coherencia del texto generado; un valor más alto permite más variedad, mientras que un valor más bajo refuerza las palabras más probables.

Q & A

  • ¿Qué significa la abreviatura GPT y qué representa cada palabra?

    -GPT significa Generative Pretrained Transformer. 'Generative' se refiere a la capacidad de los bots para generar nuevo texto. 'Pretrained' indica que el modelo ha aprendido de una gran cantidad de datos y está preparado para ser ajustado o afinado en tareas específicas. 'Transformer' es un tipo específico de red neuronal que es la base de la revolución actual en Inteligencia Artificial.

  • ¿Qué es un modelo transformer y cómo es clave en el avance de la IA?

    -Un modelo transformer es una red neuronal de aprendizaje profundo que permite a los modelos procesar y generar texto, así como realizar otras tareas de lenguaje natural. Es fundamental en el avance de la IA porque permite a los modelos procesar grandes cantidades de datos y realizar tareas complejas de manera más eficiente y efectiva.

  • ¿Cómo se relaciona un modelo transformer con la generación de texto o la traducción de idiomas?

    -Un modelo transformer se entrena para tomar un fragmento de texto y predecir cuál sería el siguiente fragmento en la secuencia. Esta capacidad de predicción se puede utilizar para generar texto nuevo o para traducir texto de un idioma a otro, ya que el modelo puede aprender patrones y contextos lingüísticos a gran escala.

  • ¿Qué es la función de los tokens en un modelo de transformer?

    -Los tokens son piezas pequeñas en las que se divide la entrada. En el caso del texto, estos pueden ser palabras, partes de palabras o combinaciones de caracteres comunes. Estos tokens se asocian con vectores que codifican su significado, permitiendo que el modelo procese y comprenda el lenguaje.

  • ¿Qué es un bloque de atención y cómo funciona dentro de un modelo transformer?

    -Un bloque de atención es una operación en un modelo transformer que permite que los vectores se comuniquen entre sí y actualicen sus valores en función de la relevancia contextual. Esto ayuda al modelo a entender cómo las palabras en el contexto afectan el significado de otras palabras.

  • ¿Qué es una multi-layer perceptron y cómo se relaciona con un modelo transformer?

    -Una multi-layer perceptron, o capa de avance, es una operación en un modelo transformer donde los vectores no se comunican entre sí, sino que todos pasan por la misma operación en paralelo. Ayuda a interpretar y actualizar los vectores basándose en una serie de preguntas y respuestas.

  • ¿Cómo se relaciona el concepto de 'embedding' con la representación de palabras en un modelo de transformer?

    -El 'embedding' es el proceso de convertir palabras en vectores en un espacio de alta dimensión. Los vectores resultantes, conocidos como embeddings, tienen la capacidad de capturar el significado de las palabras y su contexto, lo que es fundamental para la predicción y generación de texto.

  • ¿Qué es la función Softmax y cómo se utiliza en un modelo transformer?

    -La función Softmax se utiliza para convertir una lista de números en una distribución de probabilidad válida. Es fundamental en modelos transformers para normalizar los valores de salida, asegurando que cada valor esté entre 0 y 1 y que la suma de todos ellos sea 1.

  • ¿Cómo se relaciona el tamaño del modelo GPT-3 con su capacidad para generar texto?

    -GPT-3 tiene 175 mil millones de parámetros, lo que le permite tener una gran capacidad de generación de texto. Cuanto más grande es el modelo, más datos puede procesar y más complejas son las relaciones y contextos que puede aprender, lo que resulta en una generación de texto más coherente y variada.

  • ¿Qué es la matriz de 'Unembedding' y cómo se utiliza en la predicción del siguiente token?

    -La matriz de 'Unembedding' es otra matriz utilizada en un modelo transformer para mapear el último vector del contexto a una lista de valores para cada token en el vocabulario. Es similar a la matriz de embedding, pero con el orden inverso y es utilizada para realizar la predicción del siguiente token en la secuencia.

  • ¿Cómo se utiliza la temperatura en la función de distribución de probabilidad para influir en la generación de texto?

    -La temperatura es un parámetro en la función de distribución de probabilidad que controla la variabilidad de las elecciones de palabras. Un valor alto de temperatura hace que la distribución sea más uniforme, permitiendo la elección de palabras menos probables, mientras que un valor bajo hace que las palabras más probables dominen la elección.

Outlines

00:00

🤖 Introducción a Generative Pretrained Transformer (GPT)

El primer párrafo introduce el concepto de Generative Pretrained Transformer (GPT), explicando que son bots capaces de generar nuevo texto. El modelo se entrena con una gran cantidad de datos y puede afinarse con entrenamiento adicional en tareas específicas. El núcleo de GPT es el 'transformer', un tipo específico de red neuronal que ha impulsado el auge reciente en inteligencia artificial. El objetivo del video es explicar visualmente cómo funciona un transformer, siguiendo el flujo de datos y analizando paso a paso. También menciona diferentes modelos basados en transformers, desde transcripciones de audio hasta la generación de imágenes a partir de descripciones de texto. El transformer original fue creado por Google en 2017 para traducir texto de un idioma a otro, pero el enfoque del video será en el modelo que subyace a herramientas como ChatGPT, que se entrena para predecir el siguiente paso en un pasaje de texto dado.

05:02

🔍 Cómo funciona un chatbot basado en transformers

El segundo párrafo se enfoca en cómo funciona un chatbot que utiliza transformers. Comienza con una explicación de alto nivel del flujo de datos a través de un transformer, desde la división del texto en 'tokens' hasta la asociación de cada token con un vector que codifica su significado. Luego, se describe el proceso de 'bloques de atención' que permiten a los vectores interactuar y actualizar sus valores en función del contexto. Después, se menciona la operación de 'multi-capas perceptrón' o 'capa de alimentación hacia delante', que procesa los vectores en paralelo. El párrafo concluye con una descripción de cómo se repiten estos procesos, culminando en la generación de una distribución de probabilidad sobre los posibles tokens siguientes, que es fundamental para la funcionalidad del chatbot.

10:04

🧠 Detalles técnicos de la red neuronal y el aprendizaje profundo

El tercer párrafo proporciona detalles técnicos sobre la estructura y el entrenamiento de las redes neuronales y el aprendizaje profundo. Explica que los modelos de aprendizaje profundo son estructuras flexibles con parámetros ajustables que se entrenan con ejemplos para mimetizar comportamientos. Se menciona el uso de algoritmos de entrenamiento como la retropropagación y la importancia de que los modelos sigan un formato específico para que estos algoritmos funcionen a gran escala. Además, se describe cómo los datos de entrada se transforman en capas de arrays de números reales, conocidos como tensores, antes de ser procesados por la red neuronal.

15:08

📚 Procesamiento de texto y word embeddings en GPT-3

El cuarto párrafo se centra en el procesamiento de texto en GPT-3, comenzando con la conversión de palabras en vectores a través de una matriz de incrustación o 'embedding matrix'. Se discute cómo estos vectores, conocidos como 'word embeddings', se sitúan en un espacio de alta dimensión y cómo se relacionan con el significado semántico de las palabras. Se ilustra cómo las direcciones en este espacio pueden tener significados semánticos, como la diferencia entre 'mujer' y 'hombre' que se asemeja a la entre 'reina' y 'rey'. Además, se menciona cómo se aprende la forma en que las palabras se incrustan utilizando datos y se exploran ejemplos específicos de cómo las direcciones en el espacio de incrustación pueden representar relaciones semánticas entre palabras.

20:09

🔄 La función de atención y su importancia en transformers

El quinto párrafo introduce la función de atención y su papel crucial en los transformers. Se menciona que la atención permite que los vectores en la red se muevan y adapten su significado en función del contexto más amplio, más allá de la simple representación de palabras individuales. Se discute cómo la red solo puede procesar un número fijo de vectores a la vez, conocido como 'tamaño del contexto', y cómo esto limita la cantidad de texto que un transformer puede considerar al hacer predicciones. Además, se describe el proceso de cómo se genera la distribución de probabilidad final para los tokens posibles que siguen, utilizando una matriz adicional llamada 'Unembedding matrix'.

25:11

🌡️ El uso de la función Softmax y la temperatura en la generación de texto

El sexto y último párrafo explica el uso de la función Softmax y cómo se normaliza una lista de números en una distribución de probabilidad válida. Se describe cómo Softmax convierte los valores en positivos y luego los normaliza para que la suma total sea uno, permitiendo que los valores más grandes se acerquen a uno y los más pequeños a cero. Además, se introduce el concepto de 'temperatura' en la función de generación de texto, que permite controlar la probabilidad de palabras menos comunes en la generación. Se menciona cómo la temperatura afecta la calidad y la coherencia del texto generado y cómo se utiliza en herramientas como GPT-3 para influir en la creatividad y la variedad del lenguaje producido.

Mindmap

Keywords

💡GPT

GPT significa Generative Pretrained Transformer, y es un modelo de inteligencia artificial diseñado para generar texto nuevo. En el video, GPT es el núcleo de la explicación sobre cómo funcionan los 'bots' de texto generado, y cómo a través del entrenamiento con una gran cantidad de datos, pueden ser ajustados para realizar tareas específicas. Por ejemplo, se menciona que GPT-2 y GPT-3 son modelos que permiten la generación de historias y diálogos basados en un texto inicial.

💡Transformer

Un transformer es una clase específica de red neuronal y modelo de aprendizaje automático que es fundamental en la revolución actual en inteligencia artificial. En el video, se destaca como es la pieza clave en el funcionamiento de los modelos de texto generado, permitiendo que el modelo capture la relación entre palabras y contexto para predecir y generar texto de manera coherente.

💡Modelo de lenguaje

Un modelo de lenguaje es un tipo de modelo que procesa texto y es capaz de predecir la siguiente palabra o frase en un pasaje dado. En el video, se enfoca en cómo estos modelos, como ChatGPT, toman un trozo de texto y generan una predicción probabilística de qué seguirá después, basándose en la distribución de probabilidad sobre posibles segmentos de texto.

💡Procesamiento de lenguaje natural

El procesamiento del lenguaje natural (NLP) es una área de la inteligencia artificial que se ocupa de la interacción entre computadoras y el lenguaje humano. En el video, se describe cómo los modelos de transformers procesan el lenguaje, transformando texto en vectores para que puedan ser comprendidos y utilizados por la máquina para realizar tareas como la traducción automática o la generación de texto.

💡Vectores

En el contexto del video, vectores son listas de números que representan a las palabras o 'tokens' en un espacio de alta dimensión. Estos vectores编码, o sea, codifican el significado de las palabras y se relacionan entre sí para entender el contexto y la relación semántica entre ellas. Por ejemplo, se menciona que palabras con significados similares tienden a tener vectores cercanos en el espacio vectorial.

💡Atención

La atención es un concepto clave en los transformers que permite que los vectores se comuniquen entre sí y actualicen sus valores en función del contexto. En el video, se describe cómo la operación de atención permite que el modelo determine qué palabras son relevantes para actualizar el significado de otras palabras dentro de un contexto específico.

💡Perceptrón multicapa

Un perceptrón multicapa, también conocido como capa de avance o feed-forward layer, es una parte del modelo transformer donde los vectores no se comunican entre sí, sino que todos pasan por la misma operación en paralelo. En el video, se sugiere que esta operación es similar a hacer una larga lista de preguntas sobre cada vector y actualizarlo en función de las respuestas.

💡Entrenamiento

El entrenamiento es el proceso mediante el cual los modelos de aprendizaje automático aprenden a partir de los datos. En el video, se menciona que los pesos o matrices del modelo, que son inicialmente aleatorios, se ajustan o 'aprenden' durante el entrenamiento basándose en los datos para mejorar la precisión de las predicciones y la generación de texto.

💡Embedding

El embedding es el proceso de convertir palabras en vectores en un espacio de alta dimensión, donde cada dirección en el espacio puede tener un significado semántico. En el video, se ilustra cómo el modelo aprende a asignar vectores a las palabras de tal manera que palabras con significados similares se encuentran cerca en este espacio, permitiendo que el modelo capture relaciones y contextos más complejos.

💡Softmax

La función Softmax es una función matemática utilizada para normalizar un conjunto de números en una distribución de probabilidad. En el video, se describe cómo Softmax convierte una lista de valores, como los logits generados por el modelo, en una distribución de probabilidad válida donde los valores más grandes se acercan a 1 y los más pequeños a 0, permitiendo así la generación de texto basada en probabilidades.

Highlights

The initials GPT stand for Generative Pretrained Transformer.

Pretrained refers to the model's learning from a massive amount of data, with room for fine-tuning on specific tasks.

A transformer is a specific kind of neural network and is the core invention underlying the current boom in AI.

Transformers can be used for various tasks, such as generating synthetic speech from text and creating images from text descriptions.

The original transformer introduced by Google in 2017 was invented for translating text from one language to another.

Tools like ChatGPT use a variant of transformers trained to take in text and produce a prediction for what comes next in the passage.

The process of predicting and sampling repeatedly is what allows models like GPT-3 to generate coherent stories.

Transformers break input into tokens, which are then associated with vectors that encode the meaning of each piece.

Attention blocks allow tokens to interact and update their values based on context, enhancing the meaning encoded in vectors.

A multi-layer perceptron block, or feed-forward layer, processes each vector independently and updates them based on various criteria.

The network's goal is to transform input vectors progressively, incorporating more context with each layer.

The final layer produces a probability distribution over all possible next tokens, determining the most likely next word.

Word embeddings encode semantic meanings in high-dimensional space, where similar words have vectors close to each other.

The training process adjusts weights in matrices through backpropagation, enabling the model to learn from data.

Softmax is used to convert logits into a probability distribution, ensuring the sum of probabilities equals one.

Temperature adjustment in softmax can control the randomness of the model's output, affecting the creativity and coherence of generated text.

Embedding matrices in models like GPT-3 contain millions of parameters, highlighting the complexity and scale of modern transformers.

Training transformers involve tuning billions of parameters to optimize performance and generalize across various tasks.

Understanding foundational concepts like dot products, matrix multiplications, and word embeddings is crucial for grasping the attention mechanism.

Future chapters will delve into the attention blocks, multi-layer perceptron blocks, and the overall training process of transformers.

Transcripts

play00:00

The initials GPT stand for Generative Pretrained Transformer.

play00:05

So that first word is straightforward enough, these are bots that generate new text.

play00:09

Pretrained refers to how the model went through a process of learning

play00:13

from a massive amount of data, and the prefix insinuates that there's

play00:16

more room to fine-tune it on specific tasks with additional training.

play00:20

But the last word, that's the real key piece.

play00:23

A transformer is a specific kind of neural network, a machine learning model,

play00:27

and it's the core invention underlying the current boom in AI.

play00:31

What I want to do with this video and the following chapters is go through

play00:35

a visually-driven explanation for what actually happens inside a transformer.

play00:39

We're going to follow the data that flows through it and go step by step.

play00:43

There are many different kinds of models that you can build using transformers.

play00:47

Some models take in audio and produce a transcript.

play00:51

This sentence comes from a model going the other way around,

play00:54

producing synthetic speech just from text.

play00:56

All those tools that took the world by storm in 2022 like Dolly and Midjourney

play01:01

that take in a text description and produce an image are based on transformers.

play01:06

Even if I can't quite get it to understand what a pie creature is supposed to be,

play01:09

I'm still blown away that this kind of thing is even remotely possible.

play01:13

And the original transformer introduced in 2017 by Google was invented for

play01:18

the specific use case of translating text from one language into another.

play01:22

But the variant that you and I will focus on, which is the type that

play01:26

underlies tools like ChatGPT, will be a model that's trained to take in a piece of text,

play01:31

maybe even with some surrounding images or sound accompanying it,

play01:34

and produce a prediction for what comes next in the passage.

play01:38

That prediction takes the form of a probability distribution

play01:41

over many different chunks of text that might follow.

play01:45

At first glance, you might think that predicting the next

play01:47

word feels like a very different goal from generating new text.

play01:50

But once you have a prediction model like this,

play01:52

a simple thing you generate a longer piece of text is to give it an initial

play01:56

snippet to work with, have it take a random sample from the distribution

play02:00

it just generated, append that sample to the text,

play02:03

and then run the whole process again to make a new prediction based on all the new text,

play02:08

including what it just added.

play02:10

I don't know about you, but it really doesn't feel like this should actually work.

play02:13

In this animation, for example, I'm running GPT-2 on my laptop and having it repeatedly

play02:17

predict and sample the next chunk of text to generate a story based on the seed text.

play02:22

The story just doesn't really make that much sense.

play02:26

But if I swap it out for API calls to GPT-3 instead, which is the same basic model,

play02:31

just much bigger, suddenly almost magically we do get a sensible story,

play02:35

one that even seems to infer that a pi creature would live in a land of math and

play02:40

computation.

play02:41

This process here of repeated prediction and sampling is essentially

play02:45

what's happening when you interact with ChatGPT or any of these other

play02:48

large language models and you see them producing one word at a time.

play02:52

In fact, one feature that I would very much enjoy is the ability to

play02:55

see the underlying distribution for each new word that it chooses.

play03:03

Let's kick things off with a very high level preview

play03:06

of how data flows through a transformer.

play03:08

We will spend much more time motivating and interpreting and expanding

play03:12

on the details of each step, but in broad strokes,

play03:14

when one of these chatbots generates a given word, here's what's going on under the hood.

play03:19

First, the input is broken up into a bunch of little pieces.

play03:22

These pieces are called tokens, and in the case of text these tend to be

play03:26

words or little pieces of words or other common character combinations.

play03:30

If images or sound are involved, then tokens could be

play03:33

little patches of that image or little chunks of that sound.

play03:37

Each one of these tokens is then associated with a vector,

play03:40

meaning some list of numbers, which is meant to somehow encode the meaning of that piece.

play03:45

If you think of these vectors as giving coordinates in some very high dimensional space,

play03:50

words with similar meanings tend to land on vectors that are

play03:53

close to each other in that space.

play03:55

This sequence of vectors then passes through an operation that's

play03:58

known as an attention block, and this allows the vectors to talk to

play04:01

each other and pass information back and forth to update their values.

play04:04

For example, the meaning of the word model in the phrase a machine

play04:08

learning model is different from its meaning in the phrase a fashion model.

play04:12

The attention block is what's responsible for figuring out which

play04:15

words in context are relevant to updating the meanings of which other words,

play04:19

and how exactly those meanings should be updated.

play04:22

And again, whenever I use the word meaning, this is

play04:25

somehow entirely encoded in the entries of those vectors.

play04:29

After that, these vectors pass through a different kind of operation,

play04:32

and depending on the source that you're reading this will be referred

play04:35

to as a multi-layer perceptron or maybe a feed-forward layer.

play04:38

And here the vectors don't talk to each other,

play04:40

they all go through the same operation in parallel.

play04:43

And while this block is a little bit harder to interpret,

play04:45

later on we'll talk about how the step is a little bit like asking a long list

play04:49

of questions about each vector, and then updating them based on the answers

play04:53

to those questions.

play04:54

All of the operations in both of these blocks look like a

play04:58

giant pile of matrix multiplications, and our primary job is

play05:01

going to be to understand how to read the underlying matrices.

play05:06

I'm glossing over some details about some normalization steps that happen in between,

play05:10

but this is after all a high-level preview.

play05:13

After that, the process essentially repeats, you go back and forth

play05:17

between attention blocks and multi-layer perceptron blocks,

play05:20

until at the very end the hope is that all of the essential meaning

play05:24

of the passage has somehow been baked into the very last vector in the sequence.

play05:28

We then perform a certain operation on that last vector that produces a probability

play05:33

distribution over all possible tokens, all possible little chunks of text that might come

play05:38

next.

play05:38

And like I said, once you have a tool that predicts what comes next

play05:42

given a snippet of text, you can feed it a little bit of seed text and

play05:45

have it repeatedly play this game of predicting what comes next,

play05:49

sampling from the distribution, appending it, and then repeating over and over.

play05:53

Some of you in the know may remember how long before ChatGPT came into the scene,

play05:57

this is what early demos of GPT-3 looked like,

play06:00

you would have it autocomplete stories and essays based on an initial snippet.

play06:05

To make a tool like this into a chatbot, the easiest starting point is to have

play06:09

a little bit of text that establishes the setting of a user interacting with a

play06:13

helpful AI assistant, what you would call the system prompt,

play06:17

and then you would use the user's initial question or prompt as the first bit of

play06:21

dialogue, and then you have it start predicting what such a helpful AI assistant

play06:25

would say in response.

play06:27

There is more to say about an step of training that's required to make this work well,

play06:32

but at a high level this is the idea.

play06:35

In this chapter, you and I are going to expand on the details of what happens at the very

play06:40

beginning of the network, at the very end of the network,

play06:42

and I also want to spend a lot of time reviewing some important bits of background

play06:46

knowledge, things that would have been second nature to any machine learning engineer by

play06:50

the time transformers came around.

play06:53

If you're comfortable with that background knowledge and a little impatient,

play06:56

you could feel free to skip to the next chapter,

play06:58

which is going to focus on the attention blocks,

play07:00

generally considered the heart of the transformer.

play07:03

After that I want to talk more about these multi-layer perceptron blocks,

play07:07

how training works, and a number of other details that will have been skipped up to

play07:11

that point.

play07:12

For broader context, these videos are additions to a mini-series about deep learning,

play07:16

and it's okay if you haven't watched the previous ones,

play07:18

I think you can do it out of order, but before diving into transformers specifically,

play07:22

I do think it's worth making sure that we're on the same page about the basic premise

play07:27

and structure of deep learning.

play07:29

At the risk of stating the obvious, this is one approach to machine learning,

play07:33

which describes any model where you're using data to somehow

play07:35

determine how a model behaves.

play07:39

What I mean by that is, let's say you want a function that takes in

play07:42

an image and it produces a label describing it,

play07:44

or our example of predicting the next word given a passage of text,

play07:48

or any other task that seems to require some element of intuition and pattern recognition.

play07:53

We almost take this for granted these days, but the idea with machine learning is

play07:57

that rather than trying to explicitly define a procedure for how to do that task in code,

play08:02

which is what people would have done in the earliest days of AI,

play08:05

instead you set up a very flexible structure with tunable parameters,

play08:09

like a bunch of knobs and dials, and then somehow you use many examples of what the

play08:13

output should look like for a given input to tweak and tune the values of those

play08:17

parameters to mimic this behavior.

play08:19

For example, maybe the simplest form of machine learning is linear regression,

play08:24

where your inputs and outputs are each single numbers,

play08:27

something like the square footage of a house and its price,

play08:30

and what you want is to find a line of best fit through this data, you know,

play08:35

to predict future house prices.

play08:37

That line is described by two continuous parameters,

play08:40

say the slope and the y-intercept, and the goal of linear

play08:44

regression is to determine those parameters to closely match the data.

play08:48

Needless to say, deep learning models get much more complicated.

play08:52

GPT-3, for example, has not two, but 175 billion parameters.

play08:58

But here's the thing, it's not a given that you can create some giant

play09:02

model with a huge number of parameters without it either grossly

play09:05

overfitting the training data or being completely intractable to train.

play09:10

Deep learning describes a class of models that in the

play09:13

last couple decades have proven to scale remarkably well.

play09:16

What unifies them is the same training algorithm, called backpropagation,

play09:20

and the context I want you to have as we go in is that in order for this training

play09:25

algorithm to work well at scale, these models have to follow a certain specific format.

play09:31

If you know this format going in, it helps to explain many of the choices for how

play09:36

a transformer processes language, which otherwise run the risk of feeling arbitrary.

play09:41

First, whatever model you're making, the input

play09:44

has to be formatted as an array of real numbers.

play09:46

This could mean a list of numbers, it could be a two-dimensional array,

play09:50

or very often you deal with higher dimensional arrays,

play09:53

where the general term used is tensor.

play09:56

You often think of that input data as being progressively transformed into many

play10:00

distinct layers, where again, each layer is always structured as some kind of

play10:04

array of real numbers, until you get to a final layer which you consider the output.

play10:09

For example, the final layer in our text processing model is a list of

play10:12

numbers representing the probability distribution for all possible next tokens.

play10:17

In deep learning, these model parameters are almost always referred to as weights,

play10:22

and this is because a key feature of these models is that the only way these

play10:26

parameters interact with the data being processed is through weighted sums.

play10:30

You also sprinkle some non-linear functions throughout,

play10:32

but they won't depend on parameters.

play10:35

Typically though, instead of seeing the weighted sums all naked

play10:38

and written out explicitly like this, you'll instead find them

play10:42

packaged together as various components in a matrix vector product.

play10:46

It amounts to saying the same thing, if you think back to how matrix vector

play10:50

multiplication works, each component in the output looks like a weighted sum.

play10:54

It's just often conceptually cleaner for you and me to think

play10:58

about matrices that are filled with tunable parameters that

play11:01

transform vectors that are drawn from the data being processed.

play11:06

For example, those 175 billion weights in GPT-3 are

play11:10

organized into just under 28,000 distinct matrices.

play11:14

Those matrices in turn fall into eight different categories,

play11:17

and what you and I are going to do is step through each one of those categories to

play11:21

understand what that type does.

play11:23

As we go through, I think it's kind of fun to reference the specific

play11:27

numbers from GPT-3 to count up exactly where those 175 billion come from.

play11:31

Even if nowadays there are bigger and better models,

play11:34

this one has a certain charm as the large-language model to really capture the world's

play11:38

attention outside of ML communities.

play11:41

Also, practically speaking, companies tend to keep much tighter

play11:44

lips around the specific numbers for more modern networks.

play11:47

I just want to set the scene going in, that as you peek under the

play11:50

hood to see what happens inside a tool like ChatGPT,

play11:53

almost all of the actual computation looks like matrix vector multiplication.

play11:57

There's a little bit of a risk getting lost in the sea of billions of numbers,

play12:01

but you should draw a very sharp distinction in your mind between

play12:05

the weights of the model, which I'll always color in blue or red,

play12:08

and the data being processed, which I'll always color in gray.

play12:12

The weights are the actual brains, they are the things learned during training,

play12:16

and they determine how it behaves.

play12:18

The data being processed simply encodes whatever specific input is

play12:22

fed into the model for a given run, like an example snippet of text.

play12:27

With all of that as foundation, let's dig into the first step of this text processing

play12:31

example, which is to break up the input into little chunks and turn those chunks into

play12:36

vectors.

play12:37

I mentioned how those chunks are called tokens,

play12:39

which might be pieces of words or punctuation,

play12:41

but every now and then in this chapter and especially in the next one,

play12:44

I'd like to just pretend that it's broken more cleanly into words.

play12:48

Because we humans think in words, this will just make it much

play12:51

easier to reference little examples and clarify each step.

play12:55

The model has a predefined vocabulary, some list of all possible words,

play12:59

say 50,000 of them, and the first matrix that we'll encounter,

play13:03

known as the embedding matrix, has a single column for each one of these words.

play13:08

These columns are what determines what vector each word turns into in that first step.

play13:15

We label it We, and like all the matrices we see,

play13:18

its values begin random, but they're going to be learned based on data.

play13:23

Turning words into vectors was common practice in machine learning long before

play13:27

transformers, but it's a little weird if you've never seen it before,

play13:30

and it sets the foundation for everything that follows,

play13:33

so let's take a moment to get familiar with it.

play13:36

We often call this embedding a word, which invites you to think of

play13:39

these vectors very geometrically as points in some high dimensional space.

play13:44

Visualizing a list of three numbers as coordinates for points in 3D space

play13:47

would be no problem, but word embeddings tend to be much much higher dimensional.

play13:52

In GPT-3 they have 12,288 dimensions, and as you'll see,

play13:56

it matters to work in a space that has a lot of distinct directions.

play14:01

In the same way that you could take a two-dimensional slice through a 3D space

play14:05

and project all the points onto that slice, for the sake of animating word

play14:08

embeddings that a simple model is giving me, I'm going to do an analogous

play14:12

thing by choosing a three-dimensional slice through this very high dimensional space,

play14:16

and projecting the word vectors down onto that and displaying the results.

play14:21

The big idea here is that as a model tweaks and tunes its weights to determine

play14:25

how exactly words get embedded as vectors during training,

play14:28

it tends to settle on a set of embeddings where directions in the space have a

play14:33

kind of semantic meaning.

play14:34

For the simple word-to-vector model I'm running here,

play14:37

if I run a search for all the words whose embeddings are closest to that of tower,

play14:42

you'll notice how they all seem to give very similar tower-ish vibes.

play14:46

And if you want to pull up some Python and play along at home,

play14:48

this is the specific model that I'm using to make the animations.

play14:51

It's not a transformer, but it's enough to illustrate the

play14:54

idea that directions in the space can carry semantic meaning.

play14:58

A very classic example of this is how if you take the difference between the vectors

play15:03

for woman and man, something you would visualize as a little vector connecting the tip

play15:08

of one to the tip of the other, it's very similar to the difference between king and

play15:12

queen.

play15:15

So let's say you didn't know the word for a female monarch,

play15:18

you could find it by taking king, adding this woman-man direction,

play15:22

and searching for the embeddings closest to that point.

play15:27

At least, kind of.

play15:28

Despite this being a classic example for the model I'm playing with,

play15:31

the true embedding of queen is actually a little farther off than this would suggest,

play15:35

presumably because the way queen is used in training data is not merely a feminine

play15:40

version of king.

play15:41

When I played around, family relations seemed to illustrate the idea much better.

play15:46

The point is, it looks like during training the model found it advantageous to

play15:50

choose embeddings such that one direction in this space encodes gender information.

play15:56

Another example is that if you take the embedding of Italy,

play16:00

and you subtract the embedding of Germany, and add that to the embedding of Hitler,

play16:04

you get something very close to the embedding of Mussolini.

play16:08

It's as if the model learned to associate some directions with Italian-ness,

play16:13

and others with WWII axis leaders.

play16:16

Maybe my favorite example in this vein is how in some models,

play16:19

if you take the difference between Germany and Japan, and add it to sushi,

play16:24

you end up very close to bratwurst.

play16:27

Also in playing this game of finding nearest neighbors,

play16:30

I was pleased to see how close Kat was to both beast and monster.

play16:34

One bit of mathematical intuition that's helpful to have in mind,

play16:37

especially for the next chapter, is how the dot product of two

play16:40

vectors can be thought of as a way to measure how well they align.

play16:44

Computationally, dot products involve multiplying all the

play16:47

corresponding components and then adding the results, which is good,

play16:51

since so much of our computation has to look like weighted sums.

play16:55

Geometrically, the dot product is positive when vectors point in similar directions,

play17:00

it's zero if they're perpendicular, and it's negative whenever

play17:03

they point in opposite directions.

play17:06

For example, let's say you were playing with this model,

play17:09

and you hypothesize that the embedding of cats minus cat might represent a sort of

play17:14

plurality direction in this space.

play17:17

To test this, I'm going to take this vector and compute its dot

play17:20

product against the embeddings of certain singular nouns,

play17:23

and compare it to the dot products with the corresponding plural nouns.

play17:27

If you play around with this, you'll notice that the plural ones

play17:30

do indeed seem to consistently give higher values than the singular ones,

play17:33

indicating that they align more with this direction.

play17:37

It's also fun how if you take this dot product with the embeddings of the words 1,

play17:41

2, 3, and so on, they give increasing values, so it's as if we can

play17:45

quantitatively measure how plural the model finds a given word.

play17:50

Again, the specifics for how words get embedded is learned using data.

play17:54

This embedding matrix, whose columns tell us what happens to each word,

play17:57

is the first pile of weights in our model.

play18:00

Using the GPT-3 numbers, the vocabulary size specifically is 50,257,

play18:04

and again, technically this consists not of words per se, but of tokens.

play18:10

The embedding dimension is 12,288, and multiplying

play18:13

those tells us this consists of about 617 million weights.

play18:18

Let's go ahead and add this to a running tally,

play18:20

remembering that by the end we should count up to 175 billion.

play18:25

In the case of transformers, you really want to think of the vectors

play18:28

in this embedding space as not merely representing individual words.

play18:32

For one thing, they also encode information about the position of that word,

play18:36

which we'll talk about later, but more importantly,

play18:39

you should think of them as having the capacity to soak in context.

play18:43

A vector that started its life as the embedding of the word king, for example,

play18:47

might progressively get tugged and pulled by various blocks in this network,

play18:51

so that by the end it points in a much more specific and nuanced direction that

play18:55

somehow encodes that it was a king who lived in Scotland,

play18:58

and who had achieved his post after murdering the previous king,

play19:02

and who's being described in Shakespearean language.

play19:05

Think about your own understanding of a given word.

play19:08

The meaning of that word is clearly informed by the surroundings,

play19:11

and sometimes this includes context from a long distance away,

play19:15

so in putting together a model that has the ability to predict what word comes next,

play19:19

the goal is to somehow empower it to incorporate context efficiently.

play19:24

To be clear, in that very first step, when you create the array of

play19:27

vectors based on the input text, each one of those is simply plucked

play19:30

out of the embedding matrix, so initially each one can only encode

play19:33

the meaning of a single word without any input from its surroundings.

play19:37

But you should think of the primary goal of this network that it flows through

play19:41

as being to enable each one of those vectors to soak up a meaning that's much

play19:45

more rich and specific than what mere individual words could represent.

play19:49

The network can only process a fixed number of vectors at a time,

play19:52

known as its context size.

play19:54

For GPT-3 it was trained with a context size of 2048,

play19:57

so the data flowing through the network always looks like this array of 2048 columns,

play20:02

each of which has 12,000 dimensions.

play20:05

This context size limits how much text the transformer can

play20:08

incorporate when it's making a prediction of the next word.

play20:12

This is why long conversations with certain chatbots,

play20:15

like the early versions of ChatGPT, often gave the feeling of

play20:18

the bot kind of losing the thread of conversation as you continued too long.

play20:23

We'll go into the details of attention in due time,

play20:25

but skipping ahead I want to talk for a minute about what happens at the very end.

play20:29

Remember, the desired output is a probability

play20:32

distribution over all tokens that might come next.

play20:35

For example, if the very last word is Professor,

play20:37

and the context includes words like Harry Potter,

play20:40

and immediately preceding we see least favorite teacher,

play20:43

and also if you give me some leeway by letting me pretend that tokens simply

play20:47

look like full words, then a well-trained network that had built up knowledge

play20:51

of Harry Potter would presumably assign a high number to the word Snape.

play20:56

This involves two different steps.

play20:58

The first one is to use another matrix that maps the very last vector in

play21:02

that context to a list of 50,000 values, one for each token in the vocabulary.

play21:08

Then there's a function that normalizes this into a probability distribution,

play21:12

it's called Softmax and we'll talk more about it in just a second,

play21:15

but before that it might seem a little bit weird to only use this last embedding

play21:19

to make a prediction, when after all in that last step there are thousands of

play21:23

other vectors in the layer just sitting there with their own context-rich meanings.

play21:28

This has to do with the fact that in the training process it turns out to be

play21:32

much more efficient if you use each one of those vectors in the final layer

play21:36

to simultaneously make a prediction for what would come immediately after it.

play21:40

There's a lot more to be said about training later on,

play21:43

but I just want to call that out right now.

play21:45

This matrix is called the Unembedding matrix and we give it the label WU.

play21:50

Again, like all the weight matrices we see, its entries begin at random,

play21:53

but they are learned during the training process.

play21:56

Keeping score on our total parameter count, this Unembedding

play21:59

matrix has one row for each word in the vocabulary,

play22:02

and each row has the same number of elements as the embedding dimension.

play22:06

It's very similar to the embedding matrix, just with the order swapped,

play22:10

so it adds another 617 million parameters to the network,

play22:13

meaning our count so far is a little over a billion,

play22:16

a small but not wholly insignificant fraction of the 175 billion

play22:20

we'll end up with in total.

play22:22

As the last mini-lesson for this chapter, I want to talk more about this softmax

play22:26

function, since it makes another appearance for us once we dive into the attention blocks.

play22:31

The idea is that if you want a sequence of numbers to act as a probability distribution,

play22:36

say a distribution over all possible next words,

play22:39

then each value has to be between 0 and 1, and you also need all of them to add up to 1.

play22:45

However, if you're playing the learning game where everything you do looks like

play22:49

matrix-vector multiplication, the outputs you get by default don't abide by this at all.

play22:55

The values are often negative, or much bigger than 1,

play22:57

and they almost certainly don't add up to 1.

play23:00

Softmax is the standard way to turn an arbitrary list of numbers

play23:04

into a valid distribution in such a way that the largest values end up closest to 1,

play23:08

and the smaller values end up very close to 0.

play23:11

That's all you really need to know.

play23:13

But if you're curious, the way it works is to first raise e to the power

play23:17

of each of the numbers, which means you now have a list of positive values,

play23:21

and then you can take the sum of all those positive values and divide each

play23:25

term by that sum, which normalizes it into a list that adds up to 1.

play23:30

You'll notice that if one of the numbers in the input is meaningfully bigger than

play23:34

the rest, then in the output the corresponding term dominates the distribution,

play23:38

so if you were sampling from it you'd almost certainly just be picking the maximizing

play23:42

input.

play23:42

But it's softer than just picking the max in the sense that when other

play23:46

values are similarly large, they also get meaningful weight in the distribution,

play23:50

and everything changes continuously as you continuously vary the inputs.

play23:55

In some situations, like when ChatGPT is using this distribution to create a next word,

play24:00

there's room for a little bit of extra fun by adding a little extra spice into this

play24:04

function, with a constant t thrown into the denominator of those exponents.

play24:09

We call it the temperature, since it vaguely resembles the role of temperature in

play24:14

certain thermodynamics equations, and the effect is that when t is larger,

play24:18

you give more weight to the lower values, meaning the distribution is a little bit

play24:22

more uniform, and if t is smaller, then the bigger values will dominate more

play24:26

aggressively, where in the extreme, setting t equal to zero means all of the weight

play24:31

goes to maximum value.

play24:33

For example, I'll have GPT-3 generate a story with the seed text,

play24:37

once upon a time there was A, but I'll use different temperatures in each case.

play24:43

Temperature zero means that it always goes with the most predictable word,

play24:48

and what you get ends up being a trite derivative of Goldilocks.

play24:53

A higher temperature gives it a chance to choose less likely words,

play24:56

but it comes with a risk.

play24:58

In this case, the story starts out more originally,

play25:01

about a young web artist from South Korea, but it quickly degenerates into nonsense.

play25:06

Technically speaking, the API doesn't actually let you pick a temperature bigger than 2.

play25:11

There's no mathematical reason for this, it's just an arbitrary constraint imposed

play25:15

to keep their tool from being seen generating things that are too nonsensical.

play25:19

So if you're curious, the way this animation is actually working is I'm taking the

play25:24

20 most probable next tokens that GPT-3 generates,

play25:27

which seems to be the maximum they'll give me,

play25:29

and then I tweak the probabilities based on an exponent of 1 5th.

play25:33

As another bit of jargon, in the same way that you might call the components of

play25:37

the output of this function probabilities, people often refer to the inputs as logits,

play25:42

or some people say logits, some people say logits, I'm gonna say logits.

play25:46

So for instance, when you feed in some text, you have all these word embeddings

play25:50

flow through the network, and you do this final multiplication with the

play25:54

unembedding matrix, machine learning people would refer to the components in that raw,

play25:58

unnormalized output as the logits for the next word prediction.

play26:03

A lot of the goal with this chapter was to lay the foundations for

play26:06

understanding the attention mechanism, Karate Kid wax-on-wax-off style.

play26:10

You see, if you have a strong intuition for word embeddings, for softmax,

play26:14

for how dot products measure similarity, and also the underlying premise that

play26:19

most of the calculations have to look like matrix multiplication with matrices

play26:23

full of tunable parameters, then understanding the attention mechanism,

play26:27

this cornerstone piece in the whole modern boom in AI, should be relatively smooth.

play26:32

For that, come join me in the next chapter.

play26:36

As I'm publishing this, a draft of that next chapter

play26:38

is available for review by Patreon supporters.

play26:41

A final version should be up in public in a week or two,

play26:44

it usually depends on how much I end up changing based on that review.

play26:47

In the meantime, if you want to dive into attention,

play26:49

and if you want to help the channel out a little bit, it's there waiting.

Rate This

5.0 / 5 (0 votes)

Etiquetas Relacionadas
Inteligencia ArtificialTransformersAprendizaje ProfundoProcesamiento de LenguajeModelos de RedGPT-3ChatbotsAtenciónSoftmaxVectores Semánticos