AI can't cross this line and we don't know why.

Welch Labs
13 Sept 202424:07

Summary

TLDRThe video script delves into the intriguing world of AI model scaling, highlighting the 'compute optimal frontier' that no model has crossed. It discusses the neural scaling laws observed in AI, which relate model performance to data set size, model size, and compute. The script explores the potential of driving error rates to zero with larger models and more data, and the limitations imposed by the entropy of natural language. It also touches on the release of GPT-3 and GPT-4, showcasing the predictive power of scaling laws and the quest for a unified theory of AI.

Takeaways

  • 🧠 AI models exhibit a performance plateau known as the 'compute optimal frontier', beyond which they cannot improve regardless of increased training.
  • 📈 The error rate of AI models decreases with increased model size and compute, following a power law relationship that is consistent across different model architectures.
  • 🔍 OpenAI's research in 2020 demonstrated clear performance trends across various scales for language models, fitting power law equations to predict how performance scales with compute, data set size, and model size.
  • 💡 The introduction of GPT-3 by OpenAI, trained on a massive scale with 175 billion parameters, followed the predicted performance trends and showed that larger models continue to improve.
  • 📊 Loss values in AI models are crucial for guiding the optimization process during training, with cross-entropy loss being particularly effective for models like GPT-3.
  • 🌐 The 'manifold hypothesis' suggests that deep learning models map high-dimensional data to lower-dimensional manifolds where the position encodes meaningful information.
  • 📉 Neural scaling laws indicate that the performance of AI models scales with the size of the training dataset and model size, with a relationship that can be described by power laws.
  • 🔮 Theoretical work supports the idea that model performance scales with the resolution at which the model can fit the data manifold, which is influenced by the amount of training data.
  • 🔬 Empirical results from OpenAI and DeepMind have shown that neural scaling laws hold across a vast range of scales, providing a predictive framework for AI model performance.
  • 🚀 The pursuit of a unified theory of AI scaling continues, with the potential to guide future advancements in AI capabilities and the understanding of intelligent systems.

Q & A

  • What is the compute optimal frontier in AI models?

    -The compute optimal frontier is a theoretical boundary that AI models cannot cross, indicating the limits of performance improvement as more compute power is applied. It is represented by a line on a logarithmic scale graph where no model can achieve a lower error rate regardless of the amount of compute used.

  • How do neural scaling laws relate to the performance of AI models?

    -Neural scaling laws describe the relationship between the performance of AI models, the size of the model, the amount of data used to train the model, and the compute power applied. These laws have been observed to hold across a wide range of scales and are used to predict how performance will scale with increases in these factors.

  • What is the significance of the 2020 paper by OpenAI in the context of AI scaling?

    -The 2020 paper by OpenAI was significant because it demonstrated clear performance trends across various scales for language models. It introduced the concept of neural scaling laws and provided a method to predict how performance scales with compute, data set size, and model size using power law equations.

  • What is the role of the parameter count in the training of large AI models like GPT-3?

    -The parameter count in AI models like GPT-3 is crucial as it determines the model's capacity to learn and represent complex patterns in data. Larger models with more parameters can achieve lower error rates but require more compute to train effectively.

  • How does the concept of entropy relate to the performance limits of AI models?

    -Entropy, in the context of AI models, refers to the inherent uncertainty or randomness in natural language data. It represents the irreducible error term that even the most powerful models cannot overcome, suggesting that there is a fundamental limit to how low error rates can go, even with infinite compute and data.

  • What is the manifold hypothesis in machine learning, and how does it connect to neural scaling laws?

    -The manifold hypothesis posits that high-dimensional data, like images or text, lie on a lower-dimensional manifold within the high-dimensional space. Neural networks are thought to learn the shape of this manifold, and the scaling laws relate to how well the model can resolve the details of this manifold based on the amount of data and model size.

  • What is the difference between L1 loss and cross entropy loss in training AI models?

    -L1 loss measures the absolute difference between the predicted value and the true value, while cross entropy loss measures the difference in probabilities between the predicted distribution and the true distribution. Cross entropy loss is more commonly used in practice as it penalizes incorrect predictions more heavily when the model is very confident.

  • How did OpenAI estimate the entropy of natural language in their scaling studies?

    -OpenAI estimated the entropy of natural language by fitting power law models to their loss curves, which included a constant irreducible error term. They looked at where the model size scaling curve and the compute curve leveled off to estimate this entropy.

  • What does the term 'resolution limited scaling' refer to in the context of AI model performance?

    -Resolution limited scaling refers to the theoretical prediction that the performance of AI models, as measured by cross entropy loss, scales with the size of the training data set to the power of -4 over the intrinsic dimension of the data manifold, indicating that more data allows the model to better resolve the details of the data manifold.

  • What are the implications of neural scaling laws for the future development of AI?

    -Neural scaling laws provide a predictive framework for how AI model performance will improve with increased data, model size, and compute. These laws suggest that AI performance can continue to improve along a predictable trajectory, but also hint at fundamental limits imposed by the nature of the data and the architecture of the models.

Outlines

00:00

🤖 AI's Intrinsic Limits and Scaling Laws

This paragraph introduces the concept of an invisible boundary that AI models cannot surpass, known as the compute optimal or compute efficient frontier. It explains how AI models' error rates decrease with increased training but eventually plateau. Larger models achieve lower error rates but require more computational power. The paragraph discusses the three observed neural scaling laws: error rate, compute, model size, and data set size, which are consistent across different model architectures. It raises questions about whether these laws are fundamental to building intelligent systems or specific to current neural network approaches. The discussion also touches on the potential of driving error rates to zero with unlimited data, model size, and compute, and the simplicity of the relationship between these factors and model performance.

05:02

📈 Performance Trends and Loss Functions in AI

The second paragraph delves into the performance trends observed across various scales for language models, as demonstrated by a paper released by OpenAI in 2020. It discusses how power law equations can predict performance on logarithmic plots, with larger exponents indicating steeper performance improvements. The paragraph also covers the training of the massive GPT-3 model, which followed the predicted trend lines well without flattening, suggesting that even larger models could improve performance. It introduces the concept of loss functions, specifically L1 loss and cross-entropy loss, and how they are used to measure model accuracy. The discussion continues with the challenges of driving error rates to zero due to the inherent uncertainty in natural language and the entropy of natural language.

10:04

🧠 Deep Dive into Neural Scaling and Entropy

Paragraph three explores the scaling laws in AI, particularly focusing on how the performance of AI models scales with data, model size, and compute. It discusses the release of GPT-4 and how its performance was predicted using simple power laws. The paragraph also introduces the concept of entropy in natural language and how it affects the potential of reducing cross-entropy loss to zero. It explains the use of power law models to estimate the irreducible error term and the entropy of different data sources. The discussion includes the empirical results from Google DeepMind's massive neural scaling experiments, which observed curvature in the compute efficient frontier and led to a model that includes an irreducible term representing the entropy of natural text.

15:04

🔍 Theoretical Insights into Neural Scaling

In this paragraph, the focus shifts to the theoretical underpinnings of neural scaling laws. It discusses the manifold hypothesis, which suggests that deep learning models map high-dimensional data to lower-dimensional manifolds where the position of data carries meaning. The paragraph explains how the density of training points on the manifold affects model performance and how this relates to the cross-entropy loss. It introduces the concept of 'resolution-limited scaling' and how it provides an upper bound on model performance. The theoretical predictions are compared with empirical observations from OpenAI and Google DeepMind, highlighting the agreement and discrepancies in scaling values and the intrinsic dimensions of data.

20:06

🌌 The Future of AI and Scaling Laws

The final paragraph reflects on the progress made in AI over the past five years, particularly in understanding and applying neural scaling laws. It acknowledges the predictive power of these laws in forecasting model performance and the challenges in predicting specific model behaviors. The paragraph also speculates on the future of AI, the potential for a unified theory, and the ongoing search for principles that govern intelligent systems. It concludes with a call to action for further exploration in the field and a tease of upcoming publications and resources related to AI and neural networks.

Mindmap

Keywords

💡Compute Optimal Frontier

The 'Compute Optimal Frontier' refers to a theoretical boundary beyond which AI models cannot improve their performance, despite increasing computational resources. In the video, this concept is used to illustrate the limitations of AI models as they scale up in size and computational power. The frontier is visualized as a line on a logarithmic plot, showing that no model can cross this line, suggesting a fundamental limit to AI performance improvements.

💡Neural Scaling Laws

Neural Scaling Laws are empirical observations that describe how the performance of AI models scales with changes in compute, model size, and dataset size. The video discusses these laws as broad trends observed across different AI models, suggesting that error rates, model size, and computational requirements follow predictable patterns. These laws help in understanding the efficiency and limitations of AI systems.

💡Error Rate

The 'Error Rate' in the context of the video is a measure of the performance of AI models, specifically how often they make mistakes in predictions or classifications. It is mentioned that as AI models are trained, their error rate generally decreases and then levels off, indicating a limit to their performance. The video uses error rate as a key metric to discuss the performance trends of AI models.

💡Model Architecture

Model architecture refers to the design and structure of AI models, including the arrangement of layers and connections within neural networks. The video suggests that while the scaling laws are not heavily dependent on model architecture, reasonable choices in architecture are necessary for optimal performance. It implies that certain architectural decisions can influence how well a model can scale and perform.

💡GPT-3

GPT-3, or Generative Pre-trained Transformer 3, is a large-scale language model developed by OpenAI. The video highlights GPT-3 as an example of a model that follows the predicted performance trends based on neural scaling laws. It required a significant amount of computational power to train and demonstrated that larger models can achieve lower error rates, aligning with the scaling laws discussed.

💡Parameter

In AI, a 'Parameter' refers to a value within a model that is learned from data and used to make predictions. The video discusses the relationship between the number of parameters in a model and its performance, suggesting that larger models with more parameters can achieve lower error rates but at a higher computational cost.

💡Cross Entropy Loss

Cross Entropy Loss is a loss function used in machine learning to measure the performance of a model during training. It is particularly used for models that predict probabilities, like language models. The video explains that the cross entropy loss is used to guide the optimization of model parameters, aiming to minimize the loss and improve model performance.

💡Intrinsic Dimension

The 'Intrinsic Dimension' of a dataset refers to the number of underlying dimensions needed to describe the data effectively. In the video, the intrinsic dimension is discussed in relation to the manifold hypothesis, suggesting that AI models map high-dimensional data to lower-dimensional manifolds where the position has meaning. The video uses the concept to explain the relationship between data set size, model performance, and the geometry of the learned manifold.

💡Manifold Hypothesis

The 'Manifold Hypothesis' posits that high-dimensional data, like images or text, can be represented on a lower-dimensional manifold where the geometry encodes information about the data. The video explains how deep learning models may work by learning the shape of this manifold, and how this hypothesis relates to the scaling laws observed in AI model performance.

💡Resolution Limited Scaling

Resolution Limited Scaling is a theoretical concept discussed in the video that suggests model performance is limited by the resolution at which the model can learn the data manifold. More data allows the model to resolve the manifold more accurately, leading to better performance. The video connects this concept to the observed scaling laws and the theoretical understanding of how AI models learn from data.

Highlights

AI models' error rates decrease with training but eventually level off, suggesting a boundary they cannot cross.

Larger AI models achieve lower error rates but require more computational power.

The 'compute optimal' frontier is a theoretical limit beyond which AI models do not improve, regardless of size or data set.

Three neural scaling laws govern the relationship between error rate, compute, model size, and data set size.

OpenAI's 2020 paper showed clear performance trends across different scales for language models.

GPT-3, a massive 175 billion parameter model, followed the predicted performance trend with remarkable accuracy.

Error rates in AI models may never reach zero due to the inherent uncertainty in natural language.

The entropy of natural language is a fundamental limit to the predictability of language models.

GPT-4, released in 2023, followed the predicted performance scaling trends despite a lack of technical details in its report.

Neural scaling laws have been observed to hold across an incredible range of scales, from minus 8 to over 200,000 petaflop days.

The manifold hypothesis suggests that deep learning models map high-dimensional data to lower-dimensional manifolds.

The geometry of the learned manifold often encodes information about the data, which is critical for model performance.

Theoretical work suggests that model performance scales following a power law due to the resolution of high-dimensional data manifolds.

Empirical results from OpenAI and Google DeepMind support the idea of resolution-limited scaling.

The intrinsic dimension of natural language, estimated to be around 42, plays a crucial role in understanding model performance.

Despite the predictive power of neural scaling laws, specific model behaviors like word unscrambling and reasoning abilities remain elusive.

The pursuit of a unified theory of AI is ongoing, with neural scaling laws providing a foundation for further exploration.

Transcripts

play00:00

AI models can't cross this boundary and

play00:02

we don't know why as we train an AI

play00:05

model its error rate generally drops off

play00:07

quickly and then levels off if we train

play00:09

a larger model it will achieve a lower

play00:11

error rate but requires more compute

play00:14

scaling to larger and larger models we

play00:16

end up with a family of Curves like this

play00:19

switching our axis to logarithmic scales

play00:21

a clear Trend emerges where no model can

play00:24

cross this line known as the compute

play00:26

optimal or compute efficient Frontier

play00:29

this trend is one of three three neural

play00:30

scaling laws that have been broadly

play00:32

observed error rate scales in a very

play00:34

similar way with compute model size and

play00:37

data set size and remarkably doesn't

play00:39

depend much on model architecture or

play00:41

other algorithmic details as long as

play00:44

reasonably good choices are made the

play00:46

interesting question from here is have

play00:48

we discovered some fundamental law of

play00:50

nature like an ideal gas law for

play00:52

building intelligent systems or is this

play00:55

transist result of the specific neural

play00:57

network driven approach to AI that we're

play00:59

taking right now now how powerful can

play01:01

these models become if we continue

play01:03

increasing the amount of data model

play01:05

sizing compute can we drive errors to

play01:08

zero or will performance level off why

play01:11

are data model size and compute the

play01:13

fundamental limits of the systems we're

play01:15

building and why are they connected to

play01:17

model performance in such a simple

play01:19

way 2020 was a watershed year for open

play01:22

AI in January the team released this

play01:25

paper where they showed very clear

play01:27

performance Trends across a broad range

play01:29

of scales for language models the team

play01:31

fit a power law equation to each set of

play01:34

results giving a precise estimate for

play01:36

how performance scales with compute data

play01:38

set size and model size on logarithmic

play01:40

plots these power law equations show up

play01:42

as straight lines and the slope of each

play01:45

line is equal to the exponent of the fit

play01:47

equation larger exponents make for

play01:49

steeper lines and more rapid performance

play01:51

improvements the team observed no signs

play01:54

of deviation from these Trends on the

play01:56

upper end foreshadowing open AI strategy

play01:58

for the year the largest model the team

play02:01

tested at the time had 1.5 billion

play02:03

learnable parameters and required around

play02:05

10 petaflop days of compute to train a

play02:08

pedop flop day is the number of

play02:10

computations a system capable of one

play02:11

quadrillion floating Point operations a

play02:13

second can perform in a day the

play02:15

top-of-the-line gpus at the time the

play02:17

Nvidia V100 is capable of around 30 Tera

play02:21

flops so a system with 33 of these

play02:23

$10,000 gpus would deliver around a

play02:26

pedop flop of compute that summer the

play02:28

team's empirically predicted game would

play02:30

be realized with the release of GPT 3

play02:33

the open AI team had placed a massive

play02:35

beted on scale partnering with Microsoft

play02:37

on a huge supercomputer equipped with

play02:39

not 33 but 10,000 V100 gpus and training

play02:44

the absolutely massive 175 billion

play02:46

parameter gpt3 model using 3,640 pedop

play02:50

flop days of compute gpt3 performance

play02:52

followed the trend line predicted in

play02:54

January remarkably well but also didn't

play02:56

flatten out indicating that even larger

play02:59

models would further improve performance

play03:02

if the massive gpt3 hadn't reached the

play03:04

limits of neural scaling where were they

play03:07

is it possible to drive error rates to

play03:08

zero given sufficient compute data and

play03:10

model size in an October publication the

play03:13

open AI team took a deeper look at

play03:15

scaling the team found the same Clear

play03:18

scaling laws across a range of problems

play03:20

including image and video modeling they

play03:23

also found that on a number of these

play03:24

other problems the scaling Trends did

play03:26

eventually flatten out before reaching

play03:28

zero error this makes sense if we

play03:30

consider exactly what these error rates

play03:32

are measuring large language models like

play03:35

gpt3 are Auto regressive they are

play03:37

trained to predict the next word or word

play03:39

fragment in sequences of text as a

play03:41

function of the words that come before

play03:44

these predictions generally take the

play03:45

form of vectors of probabilities so for

play03:48

a given sequence of input words a

play03:50

language model will output a vector of

play03:51

values between 0o and one where each

play03:54

entry corresponds to the probability of

play03:56

a specific word in its

play03:58

vocabulary these vectors are typically

play04:00

normalized using a soft Max operation

play04:03

which ensures that all the probabilities

play04:04

add up to one gpt3 has vocabulary size

play04:08

at

play04:09

50257 so if we input a sequence of text

play04:12

like Einstein's first name is the model

play04:15

will return a vector of length

play04:17

50257 and we expect this Vector to be

play04:19

close to zero everywhere except at the

play04:22

index that corresponds to the word

play04:23

Albert this is index

play04:25

42590 in case you're wondering during

play04:28

training we know what the next word is

play04:30

in the text that we're training on so we

play04:32

can compute an error or loss value that

play04:35

measures how well our model is doing

play04:36

relative to what we know the word it

play04:38

should be this loss value is incredibly

play04:40

important because it guides optimization

play04:43

or learning of the model's parameters

play04:45

all those pedoph flops of training are

play04:47

performed to bring this loss number down

play04:49

there's a bunch of different ways we

play04:51

could measure the loss in our Ein sign

play04:53

example we know that the correct output

play04:55

Vector should have a one at the index of

play04:57

42590

play04:59

so we could Define our loss value as 1

play05:02

minus the probability returned by the

play05:03

model at this index if our model was

play05:06

100% confident the answer was Albert and

play05:08

returned a one our loss would be zero

play05:11

which makes sense if our model returned

play05:13

a value of 0.9 our loss would be 0.1 for

play05:17

this example if the model returned a

play05:19

value of 0.8 our loss would be 0.2 and

play05:22

so on this formulation is equivalent to

play05:24

what's called an L1 loss which works

play05:26

well in a number of machine learning

play05:28

problems however in practice we found

play05:30

that models often perform better when

play05:32

using a different loss function

play05:33

formulation called the cross entropy the

play05:36

theoretical motivation of cross entropy

play05:38

is a bit complicated but the

play05:39

implementation is simple all we have to

play05:42

do is take the negative natural

play05:43

logarithm of the probability output of

play05:46

the model at the index of the correct

play05:48

answer so to compute our loss in the

play05:50

Einstein example we just take the

play05:52

negative log of the probability output

play05:54

by the model at index

play05:57

42590 so if our model is 100% confident

play06:00

then our cross entropy loss equals the

play06:02

minus natural logarithm of one or zero

play06:05

which makes sense and matches our L1

play06:07

loss if our model is 90% confident of

play06:10

the correct answer our cross entropy

play06:12

loss equals the negative natural log of

play06:14

0.9 or about 0.1 again close to our L1

play06:18

loss plotting our cross entropy loss as

play06:20

a function of the model's output

play06:22

probability we see that loss grows

play06:24

slowly and then shoots up as the model's

play06:26

probability of the correct word

play06:27

approaches zero this means that if the

play06:29

model's confidence in the correct answer

play06:31

is very low the cross entropy loss will

play06:33

be very high the model performance shown

play06:35

on the Y AIS and all the scaling figures

play06:38

we've looked at so far is this cross

play06:40

entropy loss averaged over the examples

play06:42

in the model's test set the more

play06:44

confident the model is about the correct

play06:46

next word in the test set the closer to

play06:48

zero the average cross entropy becomes

play06:50

now the reason it makes sense that the

play06:52

open AI team s some of their loss curves

play06:54

level off instead of reaching zero is

play06:57

because predicting the next element in

play06:58

sequences like this generally does not

play07:00

have a single correct answer the

play07:03

sequence Einstein's first name is has a

play07:05

very unambiguous next word but this is

play07:08

not the case for most text a large part

play07:10

of gpt3 is training data comes from text

play07:12

scraped from the internet if we search

play07:14

for a phrase like a neural network is a

play07:17

we'll find many different next words

play07:19

from various sources none of these words

play07:21

are wrong there's just many different

play07:23

ways to explain what a neural network is

play07:26

this fundamental uncertainty is called

play07:27

the entropy of natural language

play07:30

the best we can hope for our language

play07:31

models is that they give High

play07:33

probabilities to a realistic set of next

play07:35

word choices and remarkably this is what

play07:38

large language models do for example

play07:40

here's the top five choices for meta's

play07:42

llama

play07:43

model so we can never drive the cross

play07:46

entropy loss to zero but how close can

play07:48

we get can we compute or estimate the

play07:51

value of the entropy of natural language

play07:54

by fitting power law models to their

play07:55

loss curves that include a constant

play07:57

irreducible error term the the opening I

play08:00

team was able to estimate the natural

play08:01

entropy and low resolution images videos

play08:04

and other data sources for each problem

play08:07

they estimated the natural entropy of

play08:08

the data in two ways once by looking at

play08:11

where the model size scaling curve

play08:12

levels off and again by looking at where

play08:14

the compute curve levels off and they

play08:17

found that these separate estiment

play08:18

agreed very well know that the scaling

play08:20

power laws still work in these cases but

play08:23

by adding this constant term our trend

play08:25

line or Frontier on a log log plot is no

play08:28

longer a straight line interestingly the

play08:30

team was not able to detect any

play08:32

flattening out of performance on

play08:33

language data however noting that

play08:36

unfortunately even with data from the

play08:38

largest language models we cannot yet

play08:40

obtain a meaningful estimate for the

play08:42

entropy of natural language 18 months

play08:45

later the Google deepmind team published

play08:46

a set of massive neural scaling

play08:48

experiments where they did observe some

play08:50

curvature in the compute efficient

play08:52

Frontier on natural language data they

play08:55

used their results to fit a neural

play08:56

scaling law that broke the overall loss

play08:59

into into three terms one that scales

play09:01

with model size one with data set size

play09:03

and finally an irreducible term that

play09:06

represents the entropy of natural text

play09:08

these empirical results imply that even

play09:10

an infinitely large model with infinite

play09:13

data cannot have an average crossentropy

play09:15

loss on the massive Text data set of

play09:17

less than

play09:18

1.69 a year later on Pi Day 2023 the

play09:22

open AI team released GPT

play09:25

4 despite running for a 100 Pages the

play09:28

gp4 technical report contains almost no

play09:31

technical information about the model

play09:32

itself the open aai team did not share

play09:35

this information citing the competitive

play09:37

landscape and safety

play09:39

implications however the paper does

play09:40

include two scaling plots the cost of

play09:43

training GPT 4 is enormous reportedly

play09:46

well over $100

play09:47

million before making this massive

play09:49

investment the team predicted how

play09:51

performance would scale using the same

play09:53

simple power laws fitting this curve to

play09:55

the results of much smaller experiments

play09:58

note that this uses a linear and not

play10:00

logarithmic y-axis scale exaggerating

play10:03

the curvature of the scaling if we map

play10:06

this curve to a logarithmic scale we see

play10:08

some curvature but overall a close match

play10:11

to the other scaling plots we've seen

play10:13

what's incredible here is how accurately

play10:15

the open a team was able to predict the

play10:17

performance of GPT 4 even at this

play10:19

massive scale while gpt3 training

play10:22

required an already enormous 3,640 peda

play10:25

flop days some leaked information on GPT

play10:28

4 training puts the training compute at

play10:30

over 200,000 peda flop days reportedly

play10:34

requiring 25,000 Nvidia a100 gpus

play10:37

running for over 3 months all of this

play10:40

means that neural scaling laws appear to

play10:42

hold across an incredible range of

play10:44

scales something like 13 orders of

play10:46

magnitude from 10 to the minus8 pedop

play10:49

Flop days reported in open ai's first

play10:51

2020 publication to the leaked value of

play10:53

over 200,000 pedop flop days for

play10:55

training GPT 4 this brings us back to

play10:58

the question why does AI model

play11:00

performance follow such simple laws in

play11:02

the first place why are data model

play11:05

sizing compute the fundamental limits of

play11:07

the systems we building and why are they

play11:09

connected to model performance in such a

play11:10

simple way the Deep learning theory we

play11:13

need to answer questions like this is

play11:15

generally far behind deep learning

play11:17

practice but some recent work does make

play11:19

a compelling case for why model

play11:21

performance scales following a power law

play11:24

by arguing that deep learning models

play11:25

effectively use data to resolve a

play11:27

high-dimensional data manifold

play11:30

really getting your head around these

play11:31

theories can be tricky it's often best

play11:33

to build up intuition step by step to

play11:36

build up your intuition on llms and a

play11:38

huge range of other topics check out

play11:40

this video sponsor brilliant when trying

play11:42

to get my own head around theories like

play11:44

neural scaling I start with the papers

play11:46

but this only gets me so far I almost

play11:49

always code something up so I can

play11:51

experiment and see what's really going

play11:52

on brilliant does this for you in an

play11:55

amazing way allowing you to jump right

play11:57

to the powerful learning by doing part

play12:00

they have thousands of interactive

play12:01

lessons covering math programming data

play12:03

analysis and AI brilliant helps you

play12:05

build up your intuition through solving

play12:07

real problems this is such a critical

play12:10

piece of learning for me a few minutes

play12:12

from now you'll see an animation of a

play12:13

neural network learning a

play12:14

low-dimensional representation of the

play12:16

Imus data set solving small versions of

play12:19

big problems like this is an amazing

play12:21

intuition builder for me brilliant

play12:23

packages up this style of learning into

play12:25

a format you can make progress on in

play12:26

just minutes a day you'll be amazed at

play12:28

the progress you can stack up with

play12:30

consistent effort brilliant has an

play12:32

entire course on large language models

play12:34

including lessons that take you deeper

play12:36

into topics we covered earlier

play12:38

predicting the next word and calculating

play12:39

word probabilities to try the brilliant

play12:42

llm course and everything else they have

play12:44

to offer for free for 30 days visit

play12:46

brilliant.org Welch laabs or click the

play12:49

link in this video's description using

play12:51

this link you'll also get 20% off an

play12:53

annual premium subscription to brilliant

play12:56

big thank you to brilliant for

play12:57

sponsoring this video now back to neural

play12:59

scaling there's this idea in machine

play13:01

learning that the data sets our models

play13:03

learn from exist on manifolds in

play13:06

high-dimensional space we can think of

play13:08

natural data like images or text as

play13:11

points in this High dimensional space in

play13:13

the Imus data set of hand written images

play13:15

for example each image is composed of a

play13:18

grid of 28x 28 pixels and the intensity

play13:21

of each pixel is stored as a number

play13:22

between zero and one if we imagine that

play13:25

our images only have two pixels for a

play13:27

moment we can visualize these two pixel

play13:29

images as points in 2D space where the

play13:32

intensity value of the first pixel is

play13:33

the x coordinate and the intensity value

play13:35

of the second pixel is the y coordinate

play13:38

an image made of two white pixels would

play13:40

fall at 0 0 in our 2D space an image

play13:43

with a black pixel in the first position

play13:45

and a white pixel in the second position

play13:47

would fall at one Z and an image with a

play13:49

gray value of 0.4 for both pixels would

play13:52

fall at 0.4 comma 0.4 and so on if our

play13:55

images had three pixels instead of two

play13:58

the same approach still works just in

play14:00

three dimensions scaling up to our 28x

play14:03

28 mnist images our images become points

play14:06

in 784 dimensional space the vast

play14:09

majority of points in this High

play14:10

dimensional space are not handwritten

play14:12

digits we can see this by randomly

play14:15

choosing points in the space and

play14:16

displaying them as images these almost

play14:19

always just look like random noise you

play14:21

would have to get really really really

play14:23

lucky to randomly sample a handwritten

play14:25

digit this sparsity suggests that there

play14:27

may be some lower dimensional shape

play14:29

embedded in this 784 dimensional space

play14:33

where every point in or on this shape is

play14:35

a valid handwritten digit going back to

play14:37

our toy three pixel images for a moment

play14:40

if we learned that our third pixel

play14:41

intensity value let's call it X3 was

play14:44

always just equal to 1 plus the cosine

play14:47

of our second pixel value X2 all of our

play14:49

three pixel images would lie on the

play14:51

curved surface in our 3D space defined

play14:53

by X3 = 1 + the cosine of X2 this

play14:57

surface is two-dimensional we can

play14:59

capture the location of our images in 3D

play15:01

space just using X1 and X2 we no longer

play15:04

need X3 we can think of a neural network

play15:06

that learns to classify imist as working

play15:08

in a similar way in this network

play15:11

architecture for example our second to

play15:13

last layer has 16 neurons meaning that

play15:15

the network has mapped the 784

play15:17

dimensional input space to a much lower

play15:20

16-dimensional

play15:21

space very much like our 1 plus cosine

play15:23

function mapped our three-dimensional

play15:25

space to a lower two-dimensional space

play15:28

where the manifold hypothesis gets

play15:29

really interesting is that the manifold

play15:31

is not just a lower dimensional

play15:33

representation of the data the geometry

play15:35

of the manifold often encodes

play15:37

information about the data if we take

play15:40

the 16-dimensional representation of the

play15:42

Imus data set learned by our neural

play15:44

network we can get a sense for its

play15:46

geometry by projecting from 16

play15:47

Dimensions down to two using a technique

play15:50

like umap which attempts to preserve the

play15:52

structure of the higher dimensional

play15:54

space coloring each point using the

play15:56

number that the image corresponds to we

play15:59

can see that as the network trains

play16:01

effectively learning the shape of the

play16:02

manifold instances of the same digit are

play16:04

grouped together into little

play16:05

neighborhoods on the manifold this is a

play16:08

common phenomena across many machine

play16:10

learning problems images showing similar

play16:13

objects or text referring to similar

play16:14

Concepts end up close to each other on

play16:16

the Learned manifold one way to make

play16:19

sense of what deep learning models are

play16:20

doing is mapping high-dimensional input

play16:23

spaces to lower dimensional manifolds

play16:25

where the position of data on the

play16:27

manifold is Meaningful

play16:29

now what does the manifold hypothesis

play16:31

have to do with neural scaling laws

play16:33

let's consider the neural scaling law

play16:35

that links the size of the training data

play16:37

set with the performance of the model

play16:39

measured as the cross entropy loss on

play16:41

the test set if the manifold hypothesis

play16:43

is true then our trading data are points

play16:46

on some manifold in higher dimensional

play16:48

space and our model attempts to learn

play16:50

the shape of this manifold the density

play16:52

of our training points on our manifold

play16:54

depends on how much data we have but

play16:56

also on the dimension of the manifold in

play16:59

onedimensional space if we have D

play17:01

training data points and the overall

play17:03

length of our manifold is L we can

play17:05

compute the average distance between our

play17:07

training points s by dividing L by D

play17:10

note that instead of thinking about the

play17:12

distance between our training points

play17:13

directly it's easier when we get to

play17:15

higher Dimensions to think about a

play17:16

little neighborhood around each point of

play17:18

size as and since these little

play17:19

neighborhoods bump up against each other

play17:22

the distance between our data points is

play17:23

still just s moving to two Dimensions

play17:25

we're now effectively filling up an L by

play17:27

L square with small squares of side

play17:30

length s centered around each training

play17:31

point the total area of our large Square

play17:34

l^ s must equal our number of data

play17:36

points D * the area of each little

play17:39

square so D * s^ 2 rearranging and

play17:42

solving we can show that s is equal to l

play17:45

* D Theus 12 moving to three dimensions

play17:49

we're now packing an L by L by L cube

play17:51

with d cubes of side length s equating

play17:54

the volumes of our D small cubes and our

play17:56

large Cube we can show that s is equ Al

play17:59

to L * D Theus 1/3 so as we move to

play18:02

higher Dimensions the average distance

play18:04

between points scales as the amount of

play18:06

data we have to the power of minus1 over

play18:09

the dimension of the

play18:11

manifold now the reason we care about

play18:13

the density of the training points on

play18:14

our manifold is because when a testing

play18:17

Point comes along its error will be

play18:19

bounded by a function of its distance to

play18:21

the nearest Training point if we assume

play18:24

that our model is powerful enough to

play18:25

perfectly fit the training data then our

play18:28

learned man manold will match the true

play18:30

data manifold exactly at our training

play18:32

points a deep naral network using Ru

play18:34

activation functions is able to linearly

play18:37

interpolate between these training

play18:38

points to make predictions if we assume

play18:41

that our manifolds are smooth then we

play18:43

can use a tailor expansion to show that

play18:45

our error will scale as the distance

play18:47

between our nearest Training and testing

play18:48

points squared we establish that our

play18:51

average distance between training points

play18:52

scales as the size of our data set D to

play18:55

the power of minus1 over the dimension

play18:56

of our manifold so we can Square this

play18:59

term to get an estimate for how our

play19:01

error scales with data set size and

play19:03

compute D the^ of minus 2 over the

play19:06

manifold Dimension finally remember that

play19:08

our models are using a cross entropy

play19:10

loss function but thus far in our

play19:12

manifold analysis we've only considered

play19:14

the distance between the predicted and

play19:16

True Value this is equivalent to the L1

play19:18

loss value we considered earlier

play19:20

applying a similar tailor expansion to

play19:22

the Cross entropy function we can show

play19:24

that the cross entropy loss Will scale

play19:26

as the distance between the predicted

play19:28

and true value squared so for our final

play19:31

theoretical result we expect the cross

play19:33

entropy loss to scal as the data set

play19:35

size d to the power of Min -2 over the

play19:37

manifold Dimension squared so D ^ of-4

play19:41

over Little D this represents the worst

play19:44

case error making this an upper bound so

play19:46

we expect cross entropy loss to scale

play19:48

proportionally or better than this term

play19:51

the team that developed this Theory

play19:52

calls this resolution limited scaling

play19:55

because more data is allowing the model

play19:57

to better resolve the data manifold

play20:00

interestingly when considering the

play20:01

relationship between model size and lost

play20:04

the theory predicts the same fourth

play20:05

power relationship in this case the idea

play20:08

is that the additional model parameters

play20:10

are allowing the model to fit the data

play20:12

manifold at higher

play20:14

resolution so how does this theoretical

play20:16

result stack up against observation both

play20:19

the open aai and Google deepmind teams

play20:21

published their fit scaling values do

play20:24

these match what theory predicts in the

play20:27

January 2020 open AI paper the team

play20:30

observed the cross entropy loss scaling

play20:32

as the size of the data set to the power

play20:34

of minus

play20:36

0.095 they refer to this value as Alpha

play20:39

subd if the theory is correct then Alpha

play20:42

subd should be greater than or equal to

play20:43

4 over the intrinsic dimension of the

play20:46

data this final step is tricky since it

play20:49

requires estimating the dimension of the

play20:51

data manifold also known as the

play20:53

intrinsic dimension of natural language

play20:56

the team started with smaller problems

play20:58

where the intrinsic Dimension is known

play21:00

or can be estimated well they found

play21:02

quite good agreement between theoretical

play21:04

and experimental scaling parameters in

play21:06

cases where synthetic training data of

play21:07

known intrinsic Dimension is created by

play21:09

a teacher model and learned by a student

play21:11

model they were also able to show that

play21:14

the minus 4 overd prediction holds up

play21:15

well with smaller scale image data sets

play21:18

including

play21:19

imist finally turning to language if we

play21:22

plug in the observed scaling exponent of

play21:24

minus

play21:25

0.095 we can compute that the intrinsic

play21:27

dimension of natural language should be

play21:29

something like 42 the team tested this

play21:32

result by estimating the intrinsic

play21:34

dimension of the manifolds learned by a

play21:36

language model and found the intrinsic

play21:37

Dimension to be significantly higher on

play21:39

the order of 100 note that the

play21:42

inequality from Theory still holds but

play21:44

we don't see nearly the same agreement

play21:46

that was observed in synthetic and

play21:48

smaller data sets what we're left with

play21:50

then is a compelling Theory with some

play21:52

real predictive power but definitely no

play21:54

unified theory of AI just yet we've seen

play21:58

some astounding AI progress in The Last

play22:00

5 Years From open ai's first scaling

play22:03

paper in early 2020 to the release of

play22:05

GPT 4 in 2023 neural scaling laws showed

play22:09

us a path to better and better

play22:11

performance it's important to note here

play22:13

that while scaling laws have been

play22:15

incredibly predictive of next word

play22:16

prediction performance predicting the

play22:19

presence of specific model behaviors has

play22:21

remained more elusive abilities on tasks

play22:23

like word unscrambling arithmetic and

play22:25

multi-step reasoning seem to just pop

play22:27

into existence at various scales it's

play22:30

incredible to see how far our neural

play22:32

network powered approach has taken us

play22:34

and we of course don't know how far it

play22:36

can go many of the authors of the papers

play22:39

we've covered here have backgrounds in

play22:40

physics and you can feel in their

play22:43

approaches in language that they're on

play22:44

the hunt for unifying principles it's

play22:47

exciting to see this mindset applied to

play22:48

AI neural scaling laws are a powerful

play22:51

example of unification in AI delivering

play22:54

astoundingly accurate and useful

play22:56

empirical results and tantalizing Clues

play22:59

to a unified theory of scaling for

play23:01

intelligent systems it will be

play23:03

fascinating to see where scaling laws

play23:05

and other theories can take us in the

play23:07

next 5 years and to see if we can figure

play23:10

out if AI really can't cross this

play23:15

line if you enjoy Welch lab's videos I

play23:18

really think you'll like my book on

play23:19

imaginary numbers it's coming out later

play23:22

this year way back in 2016 I made a

play23:24

massive 13-part YouTube series on

play23:26

imaginary numbers it's such an

play23:28

incredible topic I released an early

play23:30

version of this book back then and I'm

play23:32

now in the process of revising

play23:34

correcting and significantly expanding

play23:35

it my goal is to create the best book

play23:38

out there on imaginary numbers

play23:40

highquality hardcover printed books will

play23:42

start shipping later this year you can

play23:44

pre-order a copy today at the link in

play23:46

the description below and your order

play23:47

includes a free PDF copy of the 2016

play23:50

version that you can download today I've

play23:52

also been working on some new poster

play23:54

designs I now have a dark mode version

play23:56

of my activation Atlas poster

play23:59

these are an incredible way to visualize

play24:01

the data manifolds learned by Vision

play24:03

models you'll find all of this and more

play24:05

at the Welch Labs store

Rate This

5.0 / 5 (0 votes)

相关标签
Neural ScalingAI PerformanceLanguage ModelsMachine LearningData ManifoldsError AnalysisComputational TrendsModel ArchitecturePredictive ModelingIntelligent Systems
您是否需要英文摘要?