ChatGPT o1 - First Reaction and In-Depth Analysis

AI Explained
13 Sept 202426:55

Summary

TLDRThe video discusses OpenAI's new AI system, 01, which shows significant improvements in reasoning and problem-solving, potentially revolutionizing AI capabilities. Despite some errors, 01 outperforms average human performance in various tasks, including physics, math, and coding. The system, however, still relies on training data and isn't perfect in reasoning from first principles. The video also touches on the system's safety and the implications of its instrumental thinking, highlighting both the achievements and the challenges ahead.

Takeaways

  • πŸš€ OpenAI's new AI system, 01, is a significant leap forward in AI capabilities, offering a fundamentally new paradigm in AI performance.
  • πŸ“ˆ The system, previously known as strawberry and qar, has been tested extensively, showing surprising improvements in reasoning and problem-solving.
  • 🧠 Despite being a language model, 01 demonstrates a high ceiling of performance, outperforming average human performance in areas like physics, maths, and coding.
  • πŸ“‰ However, 01 also has a low floor, making mistakes that humans typically wouldn't, highlighting the need for further refinement.
  • πŸ” The reviewer found it challenging to predict which types of questions 01 would struggle with, indicating a less predictable error pattern compared to earlier models.
  • πŸ€– The system's ability to 'reason' is more about retrieving accurate reasoning programs from its training data rather than true first-principles reasoning.
  • 🌐 01's performance on non-English languages is notably improved, which could have a broad impact given the diversity of global users.
  • πŸ”’ OpenAI emphasizes that 01's reasoning steps are not always faithful to its internal computations, which could have implications for trust and reliability.
  • πŸ›‘οΈ While 01 shows promise in safety and reasoning, there are concerns about its potential for instrumental thinking and the need for careful management of goals and rewards.
  • πŸ“š The system's performance on complex tasks and its ability to make progress on AI research and development tasks indicate a move towards more human-like problem-solving abilities.

Q & A

  • What is the significance of the system called 01 from OpenAI?

    -The system called 01 from OpenAI represents a step-change improvement in AI, offering a fundamentally new paradigm that could redefine the capabilities of language models.

  • What are the previous names of the 01 system?

    -The 01 system was previously known as 'strawberry' and 'qar' before being renamed to signify its significant advancements.

  • How does the performance of 01 compare to earlier versions of GPT?

    -01 demonstrates a substantial improvement over earlier versions, with the potential to impress users who found previous versions lacking.

  • What is the 'simple bench' and how did 01 perform on it?

    -The 'simple bench' is a test consisting of hundreds of basic reasoning questions. 01's performance on it was variable, sometimes getting questions right through exceptional reasoning and sometimes getting the same question wrong, indicating the system is still a work in progress.

  • What is the 'temperature' setting in the context of AI models, and how did it affect 01's performance?

    -In AI, 'temperature' refers to a parameter that controls the randomness of a model's output. OpenAI set a temperature of one for 01, which is higher than other models, leading to higher variability in performance.

  • What are the limitations of 01 despite its improvements?

    -Despite improvements, 01 is still fundamentally a language model and can make mistakes based on its training data. It also has a low performance floor, making errors that an average human wouldn't.

  • How does 01's approach to reasoning differ from true reasoning from first principles?

    -01 retrieves and relies on reasoning programs from its training data rather than engaging in true reasoning from first principles, making it more accurate in retrieving correct answers from its knowledge base.

  • What is the potential impact of 01's ability to perform well in non-English languages?

    -01's improved performance in languages other than English could significantly broaden its user base and applicability, enhancing its global utility.

  • What are some of the safety considerations mentioned in the system card for 01?

    -The system card discusses the model's ability to engage in instrumental thinking, which while not strategic deception, could still pose risks if scaled up without proper checks and balances.

  • How does 01's performance on coding and reasoning tasks compare to human experts?

    -01 scored competitively with human experts on certain tasks, such as the 2024 International Olympiad in Informatics, although it was limited in the number of submissions it could make.

  • What are the future implications of 01's performance on AI research and development tasks?

    -01 made non-trivial progress on two out of seven AI research and development tasks, indicating its potential to contribute to the advancement of AI technologies.

Outlines

00:00

πŸš€ Introduction to OpenAI's 01 System

The paragraph introduces OpenAI's new system, 01, which is described as a significant improvement over previous models. The speaker has spent considerable time reviewing the system's documentation and testing its capabilities. They acknowledge that while 01 is not perfect, it demonstrates a substantial leap in performance, particularly in reasoning tasks. The speaker also discusses the system's potential to change public perception of AI capabilities, suggesting that many who were previously unimpressed by AI might now be excited by 01's advancements. The paragraph concludes with a commitment to further analysis and a teaser for upcoming videos that will delve deeper into 01's performance.

05:01

🧠 Analyzing 01's Performance and Training Methodology

This paragraph delves into the performance of 01 on various reasoning tasks, including those from the 'simple bench' benchmark. The speaker notes that while 01 can make mistakes, it also shows surprising capabilities, sometimes solving problems correctly on the first attempt and other times requiring multiple tries. The discussion highlights the system's training methodology, which involves generating chains of thought and reinforcing correct answers. The speaker speculates that 01's improvements are due to its ability to retrieve and reinforce effective reasoning paths from its training data, rather than performing true de novo reasoning. The paragraph also touches on the variability in 01's performance due to the 'temperature' setting used during testing, which affects the model's creativity and thus its consistency.

10:02

πŸ“Š Performance Breakdown and Future Predictions

The speaker provides a detailed analysis of 01's performance across different domains, noting that while the system shows impressive capabilities in STEM fields, it still makes basic errors in certain areas. They discuss the implications of scaling up the model's computational power and training data, suggesting that the full version of 01 could represent a significant leap forward in AI capabilities. The paragraph also includes insights from OpenAI researchers, who emphasize the new paradigm of AI development that 01 represents, with a focus on scaling up inference time compute power rather than just pre-training scale. The speaker concludes by acknowledging the impressive achievements of 01 while also cautioning against overestimating its capabilities.

15:03

🌐 Impact of 01 on Diverse Domains and Safety Considerations

This paragraph explores the impact of 01's capabilities on a variety of domains, including personal writing and editing, where the improvements are less pronounced due to the subjective nature of these tasks. The speaker also discusses safety considerations, noting that while 01's chain of thought reasoning steps can provide insight into the model's thought process, they may not always accurately reflect the model's computations. The paragraph includes references to the system card and discussions about the model's ability to engage in instrumental thinking, which could pose risks if not properly managed. The speaker concludes by emphasizing the need for caution and further research as AI models like 01 continue to advance.

20:05

πŸ” Deep Dive into 01's Reasoning and Limitations

The speaker provides a deeper analysis of 01's reasoning capabilities, noting that while the system shows improvements in certain areas, there are still limitations, particularly in tasks that require tacit knowledge or are not well-defined. They discuss the system's performance on coding and mathematics tasks, where 01 shows high proficiency, and compare it to other models like Claude 3.5 Sonic. The paragraph also touches on the system's performance on non-English languages, highlighting the importance of multilingual capabilities in AI. The speaker concludes by acknowledging the impressive achievements of 01 while also emphasizing the need for ongoing evaluation and improvement.

25:06

🌟 Final Thoughts on 01's Potential and Public Perception

In the final paragraph, the speaker reflects on the potential of 01 and the public's perception of its capabilities. They note that while some at OpenAI are excited about the system's performance, others are more cautious, emphasizing that 01 is not a 'miracle model' and that its flaws should not be overlooked. The speaker also discusses the potential for 01 to change the landscape of AI, suggesting that it may represent a new era in AI development. They conclude by inviting viewers to join them in further exploring 01's capabilities and implications, expressing optimism about the future of AI.

Mindmap

Keywords

πŸ’‘AI

AI, or Artificial Intelligence, refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. In the context of the video, AI is the central theme, with a focus on the advancements in AI capabilities, particularly in reasoning and problem-solving. The video discusses AI systems like ChatGPT and the new system 01 from OpenAI, which are designed to perform at a level comparable to human intelligence.

πŸ’‘OpenAI

OpenAI is a research laboratory that focuses on creating AI technologies. In the video, OpenAI is mentioned as the developer of the AI system 01, which is being evaluated for its reasoning abilities. The video discusses OpenAI's role in pushing the boundaries of AI with the release of 01, indicating a significant step in AI development.

πŸ’‘Reasoning Paths

Reasoning paths refer to the different logical sequences or thought processes an AI can take to arrive at a solution or answer. The video mentions that the 01 system samples hundreds or thousands of reasoning paths to solve problems, highlighting the complexity and depth of AI's decision-making processes. This is a key mechanism that has contributed to the improved performance of the AI system.

πŸ’‘Benchmarking

Benchmarking is the process of evaluating a system's performance by comparing it to a standard or set of standards. In the video, the AI system 01 is benchmarked against simple reasoning questions to test its capabilities. The video discusses how 01's performance on these benchmarks is a significant improvement over previous AI systems.

πŸ’‘Chain of Thought

A chain of thought in AI refers to the series of logical steps an AI takes to reach a conclusion or answer a question. The video mentions that OpenAI's 01 system generates chains of thought and is trained on those that lead to correct answers. This approach is said to improve the AI's ability to reason and solve problems effectively.

πŸ’‘Temperature

In the context of AI, 'temperature' refers to a parameter that controls the randomness of the AI's output. A higher temperature leads to more varied and creative outputs, while a lower temperature makes the AI's responses more predictable. The video discusses how OpenAI imposed a 'temperature of one' on the 01 system, which affected its performance variability during benchmarking.

πŸ’‘Self-consistency

Self-consistency in AI testing refers to the practice of running the same test multiple times and taking a majority vote to determine the AI's performance. This is done to account for variability in the AI's responses. The video mentions the need for self-consistency to compare the performance of different AI models accurately.

πŸ’‘LLM (Large Language Model)

LLM stands for Large Language Model, which is a type of AI model designed to understand and generate human-like text based on large datasets. The video discusses how 01, despite being an improvement over previous models, is still fundamentally a language model and thus subject to certain limitations inherent to language models.

πŸ’‘Anthropic

Anthropic is an AI research company mentioned in the video as a potential competitor to OpenAI. The video suggests that Anthropic might release its own AI system in response to OpenAI's 01, indicating a competitive landscape in the AI industry and the ongoing development of advanced AI technologies.

πŸ’‘System Card

The System Card is a document released by OpenAI that provides detailed information about the 01 AI system, including its capabilities, training, and potential limitations. The video discusses the contents of the System Card, highlighting key insights into how 01 works and the methodologies used in its development.

Highlights

Chachi PT now refers to itself as an 'alien of exceptional ability', reflecting a significant improvement in AI capabilities.

The system called 01 from OpenAI, previously known as strawberry and qar, represents a step-change improvement in AI.

After extensive testing and analysis, the 01 system demonstrates a fundamental new paradigm in AI performance.

The 01 system's performance is so impressive that it may prompt millions to reevaluate AI after earlier disappointments.

The system uses mechanisms like sampling hundreds of reasoning paths and potentially a verifier to select the best answers.

Despite not having all the details on 01's training, OpenAI has provided clues that suggest a new approach to AI development.

The 01 system still makes language model-based mistakes, indicating it's limited by its training data.

The magnitude of improvement in 01 through rewarding correct reasoning steps was surprising.

OpenAI's 01 system has been tested with a 'temperature' setting that affects its performance variability.

The 01 preview is a significant improvement over previous models like Claude 3.5 Sonic, despite some inconsistencies.

The 01 system has a high ceiling of performance, excelling in areas like physics, maths, and coding, but also a low floor with obvious mistakes.

The 01 system's training methodology involves generating chains of thought and training on those that lead to correct answers.

The 01 system is less about true reasoning from first principles and more about accurately retrieving reasoning programs from its training data.

The 01 system's performance on the Google Proof question and answer set is around 80%, indicating room for improvement.

OpenAI's 01 system is expected to be more difficult to 'jailbreak', showing resilience against certain manipulations.

The 01 system's performance is expected to improve rapidly as computational power for inference time is scaled up.

The full 01 system is likely based on the GPT-4 model,钄瀺着ζœͺζ₯ε―θƒ½ηš„ζ›΄ε€§θ§„ζ¨‘ηš„ζ¨‘εž‹ε°†εΈ¦ζ₯ζ›΄ε€§ηš„ε˜ι©γ€‚

OpenAI's 01 system has shown the ability to perform similarly to PhD students in various scientific tasks.

The 01 system's reasoning capabilities allow for a degree of transparency into its thought processes, although not entirely.

The 01 system's performance on non-English languages is significantly improved, expanding its global applicability.

Transcripts

play00:00

Chachi PT now calls itself an alien of

play00:02

exceptional ability and I find it a

play00:06

little bit harder to disagree with that

play00:08

today than I did yesterday because the

play00:11

system called 01 from open AI is here at

play00:15

least in preview form and it is a step

play00:18

change Improvement you may also know 01

play00:20

by its previous names of strawberry and

play00:24

qar but let's forget naming conventions

play00:26

how good is the actual system well in

play00:29

the last 24 hours I've read the 43 page

play00:32

System card every open AI post and press

play00:35

release I've tested 01 hundreds of times

play00:38

including on simple bench and analyzed

play00:41

every single answer to be honest with

play00:43

you guys it will take weeks to fully

play00:45

digest this release so in this video

play00:48

I'll just give you my first impressions

play00:50

and of course do several more videos as

play00:52

we analyze further in short though don't

play00:55

sleep on 01 this isn't just about a

play00:57

little bit more training data this is a

play01:00

fundamentally new paradigm in fact I

play01:01

would go as far as to say that there are

play01:03

hundreds of millions of people who might

play01:05

have tested an earlier version of chat

play01:07

GPT and found llms and quote AI lacking

play01:10

but will now return with excitement as

play01:13

the title implies let me give you my

play01:15

first impressions and it's that I didn't

play01:18

expect the system to perform as well as

play01:21

it does and that's coming from the

play01:23

person who predicted many of the key

play01:26

mechanisms behind qar which have been

play01:28

used it seems in this system things like

play01:31

sampling hundreds or even thousands of

play01:34

reasoning paths and potentially using a

play01:37

verifier and llm based verifier to pick

play01:40

the best ones of course open AI aren't

play01:42

disclosing the full details of how they

play01:45

trained o1 but they did leave us some

play01:47

tantalizing Clues which I'll go into in

play01:49

a moment simple bench if you don't know

play01:51

test hundreds of basic reasoning

play01:53

questions from spatial to temporal to

play01:57

social intelligence questions that

play01:59

humans will crush on average as many

play02:02

people have told me the 01 system gets

play02:04

both of these two sample questions from

play02:06

simple bench right although not always

play02:09

take this example where despite thinking

play02:11

for 17 seconds the model still gets it

play02:15

wrong fundamentally 01 is still a

play02:18

language modelbased system and will make

play02:21

language modelbased mistakes it can be

play02:23

rewarded as many times as you like for

play02:27

good reasoning but it's still limited by

play02:29

its training data nevertheless though I

play02:31

didn't quite foresee the magnitude of

play02:34

the Improvement that would occur through

play02:36

rewarding correct reasoning steps that

play02:38

I'll admit took me slightly by surprise

play02:40

so why no concrete figure well as of

play02:43

last night open AI imposed a temperature

play02:46

of one on its 01 system that was not the

play02:49

temperature used for the other models

play02:51

when they were benchmarked on simple

play02:53

bench that's a much more quote creative

play02:55

temperature than the other models were

play02:58

tested on for simple bench therefore

play03:00

what that meant was that performance

play03:01

variability was a bit higher than normal

play03:03

it would occasionally get questions

play03:05

right through some stroke of Genius

play03:07

reasoning and get that same question

play03:09

wrong the next time in fact as you just

play03:11

saw with the ice cube example the

play03:13

obvious solution is to run the Benchmark

play03:15

multiple times and take a majority vote

play03:17

that's called self-consistency but for a

play03:19

true Apples to Apples comparison I would

play03:21

need to do that for all the other models

play03:23

my ambition not that you're too

play03:25

interested is to get that done by the

play03:27

end of this month but let me reaffirm

play03:29

one thing very clearly however you

play03:31

measure it 01 preview is a step change

play03:34

Improvement on Claude 3.5 Sonic and as

play03:38

anyone following this channel will know

play03:40

I'm not some open aai Fanboy Claude 3.5

play03:43

Sonic has reigned Supreme for quite a

play03:45

while so for those of you who don't care

play03:48

about other benchmarks and the full

play03:50

paper I want to kind of summarize my

play03:52

first impressions in a nutshell this

play03:54

description actually fits quite well the

play03:57

ceiling of performance for the 01 system

play04:00

just preview let alone the full 01

play04:02

system is incredibly High it obviously

play04:05

crushes the average person's performance

play04:07

in things like physics maths and coding

play04:09

competitions but don't get misled its

play04:12

floor is also really quite low below

play04:15

that of an average human as I wrote on

play04:18

YouTube last night it frequently and

play04:20

sometimes predictably makes really

play04:22

obvious mistakes that humans wouldn't

play04:24

make remember I analyzed the hundreds of

play04:27

answers it gave for simple bench let me

play04:30

give you a couple of examples straight

play04:31

from the mouth of 01 when the cup is

play04:33

turned upside down the dice will fall

play04:36

and land on the open end of the cup

play04:40

which is now the top if you can

play04:42

visualize that successfully you're doing

play04:44

better than me suffice to say it got

play04:46

that question wrong and how about this

play04:48

more social intelligence he will argue

play04:51

back obviously I'm not giving you the

play04:52

full context because this is a private

play04:54

data set anyway he will argue back

play04:56

against the Brigadier General one of the

play04:58

highest military ranks at the troop

play05:00

parade this is a soldier we're talking

play05:02

about as the Soldier's silly behavior in

play05:05

first grade that's like age six or seven

play05:08

indicates a history of speaking up

play05:10

against authority figures now the vast

play05:12

majority of humans would say wait no

play05:15

what he did in Primary School don't know

play05:17

what Americans called primary school but

play05:18

what he did when he was a young school

play05:20

child does not reflect what he would do

play05:22

in front of a general on a troop parade

play05:24

as I've written in some domains these

play05:26

mistakes are routine and amusing so it

play05:29

is very easy to look at 's performance

play05:32

on the Google proof question and answer

play05:35

set its performance of around 80% that's

play05:38

on the diamond subset and say well let's

play05:40

be honest the average human can't even

play05:42

get one of those questions right so

play05:44

therefore it's AGI well even samman says

play05:47

no it's not too many benchmarks are

play05:49

brittle in the sense that when the model

play05:51

is trained on that particular reasoning

play05:53

task it then can Ace it think Web of

play05:56

Lies where it's now been shown to get

play05:58

100% but if you test test 01 thoroughly

play06:00

in real life scenarios you will

play06:02

frequently find kind of glaring mistakes

play06:05

obviously what I've tried to do into the

play06:07

early hours of last night and this

play06:09

morning is find patterns in those

play06:11

mistakes but it has proven a bit harder

play06:14

than I thought my guess though about

play06:15

those weaknesses for those who won't

play06:17

stay to the end of the video is it's to

play06:19

do with its training methodology open AI

play06:22

revealed in one of the videos on its

play06:25

YouTube channel and I will go into more

play06:26

detail on this in a future video that

play06:28

they deviate ated from the let's verify

play06:31

step-by-step paper by not training on

play06:33

human annotated reasoning samples or

play06:36

steps instead they got the model to

play06:38

generate the chains of thought and we

play06:40

all know those can be quite flawed but

play06:42

here's the key moment to really focus on

play06:45

they then automatically scooped up those

play06:48

chains of thought that led to a correct

play06:50

answer in the case of mathematics

play06:52

physics or coding and then train the

play06:54

model further on those correct chains of

play06:57

thoughts so it's less the 01 is doing

play06:59

true reasoning from first principles

play07:01

it's more retrieving more accurately

play07:04

more reliably reasoning programs from

play07:07

its training data it quote knows or can

play07:09

compute which of those reasoning

play07:11

programs in its training data will more

play07:14

likely lead it to a correct answer it's

play07:16

a bit like taking the best of the web

play07:18

rather than a slightly improved average

play07:21

of the web that to me is the great

play07:23

unlock that explains a lot of this

play07:26

progress and if I'm right that also

play07:28

explains why it's still making making

play07:29

some glaring mistakes at this point I

play07:31

simply can't resist giving you one

play07:33

example straight from the output of 01

play07:37

preview from a simple bench question the

play07:39

context and you'll have to trust me on

play07:41

this one is simply that there's a dinner

play07:43

at which various people are donating

play07:46

gifts one of the gifts happens to be

play07:47

given during a zoom call so online not

play07:50

in person now I'm not going to read out

play07:51

some of the reasoning that ow1 gives you

play07:54

can see it on screen but it would be

play07:56

hard to argue that it is truly reasoning

play07:58

from first Prin principals definitely

play08:00

some suboptimal training data going on

play08:03

so that is the context for everything

play08:04

you're going to see in the remainder of

play08:06

this first impressions video because

play08:08

everything else is quite frankly

play08:10

stunning I just don't want people to get

play08:11

too carried away by the really

play08:14

impressive accomplishment from open AI I

play08:16

fully expect to be switching to 01

play08:18

preview for daily use cases although of

play08:20

course anthropic in the coming weeks

play08:22

could reply with their own system anyway

play08:24

now let's dive into some of the juiciest

play08:26

details the full breakdown will come in

play08:29

future videos first thing to remember

play08:31

this is just 01 preview not the full 01

play08:34

system that is currently in development

play08:36

not only that it is very likely based on

play08:38

the GPT 40 model not GPT 5 or o which

play08:42

would vastly supersede GPT 40 in scale I

play08:45

could just leave you to think about the

play08:47

implications of scaling up the base

play08:49

model 100 times in compute throw in a

play08:53

video Avatar and man we are really

play08:55

talking about a changed AI environment

play08:57

anyway back to the details they talk

play08:59

about performing similarly to PhD

play09:01

students in a range of tasks in physics

play09:04

chemistry and biology and I've already

play09:05

given you the Nuance on that kind of

play09:07

comment they justify the name by the way

play09:09

by saying this is such a significant

play09:11

advancement that we are resetting the

play09:14

counter back to one and naming this

play09:16

series open AI 01 it also reminds me of

play09:19

the 01 and o02 figure series of robotic

play09:22

humanoids whose maker open AI is

play09:25

collaborating with this was just the

play09:27

introductory page and then they gave

play09:29

several follow-up pages and posts to sum

play09:31

it up on jailbreaking 01 preview is much

play09:34

harder to jailbreak although it's still

play09:36

possible before we get to the reasoning

play09:38

page here is some analysis on Twitter or

play09:41

X from the open aai Team One researcher

play09:44

at openai who is building Sora said this

play09:46

I really hope people understand that

play09:48

this is a new paradigm and I agree with

play09:50

that actually it's not just hype don't

play09:51

expect the same Pace schedule or

play09:53

dynamics of pre-training era the core

play09:55

element of how 01 works by the way is

play09:57

scaling up its influence its actual

play10:00

output its test time compute how much

play10:02

computational power is applied in its

play10:04

answers to prompts not when it's being

play10:06

built and pre-trained he's making the

play10:08

point that expanding the pre-training

play10:10

scale of these models takes years often

play10:12

as you've seen in some of my previous

play10:14

videos it's to do with data sensors

play10:16

power and the rest of it but what can

play10:17

happen much faster is scaling up

play10:20

inference time output time compute

play10:22

improvements can happen much more

play10:24

rapidly than scaling up the base models

play10:27

in other words I believe that the rate

play10:28

of Improv movement he says on evals with

play10:30

our reasoning models has been the

play10:32

fastest in open aai history it's going

play10:35

to be a wild year he is of course

play10:37

implying that the full 01 system will be

play10:40

released later this year we'll get to

play10:42

some other researchers but will depw

play10:44

made some other interesting points in

play10:46

one graph of math performance they show

play10:49

that 01 mini the smaller version of the

play10:52

01 system scores better than 01 preview

play10:55

but I will say that in my testing of 01

play10:59

mini on simple bench it performed really

play11:01

quite badly we're talking sub 20% so it

play11:04

could be a bit like the GPT 40 mini we

play11:06

already had that it's hyp specialized at

play11:09

certain tasks but can't really go beyond

play11:12

its familiar environment give it a

play11:14

straightforward coding or math challenge

play11:16

and it will do well introduce

play11:18

complication Nuance or reasoning and

play11:20

it'll do less well this chart though is

play11:22

interesting for another reason and you

play11:24

can see that when they max out the

play11:26

inference cost for the full 01 system

play11:29

the performance Delta with the maxed out

play11:31

Mini model is not crazy I would say what

play11:34

is that 70% going up to 75% to put it

play11:37

another way I wouldn't expect the full

play11:39

01 system with maxed out influence to be

play11:42

yet another step change forward although

play11:44

of course nothing can be ruled out some

play11:46

more quotes from open Ai and this is

play11:48

gome brown who I've quoted many times on

play11:51

this channel focused on reasoning at

play11:53

openi he States again the same message

play11:55

we're sharing our evals of the o1 model

play11:58

to show the world that this isn't a

play12:00

one-off Improvement it's a new scaling

play12:02

Paradigm underneath you can see the

play12:04

dramatic performance boosts across the

play12:06

board from GPT 40 to 01 now I suspect if

play12:10

you included GPT 4 Turbo on here you

play12:12

might see some more mixed improvements

play12:14

but still the overall trend is Stark if

play12:17

for example I had only seen Improvement

play12:19

in stem subjects and maths particularly

play12:22

I would have said you know what is this

play12:24

a new paradigm but it's that combination

play12:26

of improvements in a range of subjects

play12:30

including law for example and most

play12:32

particularly for me of course on simple

play12:34

bench that I am actually a believer that

play12:36

this is a new paradigm yes I get that it

play12:39

can still fall for some basic

play12:40

tokenization problems like it doesn't

play12:42

always get that 9.8 is bigger than 9.11

play12:46

and yes of course you saw the somewhat

play12:48

amusing mistakes earlier on simple bench

play12:50

but here's the key point I can no longer

play12:52

say with absolute certainty which

play12:55

domains or types of questions on simple

play12:58

bench it will reliably get wrong I can

play13:00

see some patterns but I would hope for a

play13:03

bit more predictability in saying it

play13:05

won't get this right for example until I

play13:08

can say with a degree of certainty it

play13:11

won't get this type of problem correct I

play13:13

can't really tell you guys that I can

play13:15

see the end of this Paradigm just to

play13:17

repeat we have two more axes of scale to

play13:20

yet exploit bigger base models which we

play13:22

know they're working on with the whale

play13:24

size super cluster I've talked about

play13:25

that in previous videos and simply more

play13:27

inference time compute Plus plus just

play13:29

look at the log graphs on scaling up the

play13:32

training of the base model and the

play13:33

inference time or the amount of thinking

play13:35

time or processing time more accurately

play13:37

for the models they don't look like

play13:39

they're leveling off to me now I know

play13:41

some might say that I come off as

play13:43

slightly more dismissive of those memory

play13:45

heavy computation heavy benchmarks like

play13:47

the GP QA but it is a stark achievement

play13:50

for the 01 preview and 01 systems to

play13:53

score higher than an expert PhD human

play13:56

average yes there are flaws with that

play13:58

Benchmark as with the mlu but credit

play14:01

where it is due by the way as a side

play14:03

note they do admit that certain

play14:04

benchmarks are no longer effective at

play14:06

differentiating models It's My Hope or

play14:09

at least my goal that simple bench can

play14:11

still be effective at differentiating

play14:12

models for the coming what 1 2 3 years

play14:16

maybe I will now give credit to openai

play14:19

for this statement these results do not

play14:21

imply that 01 is more capable

play14:23

holistically than a PhD in all respects

play14:25

only that the model is more proficient

play14:26

in solving some problems that a PhD

play14:29

would be expected to solve that's much

play14:30

more nuanced and accurate than

play14:32

statements that we've heard in the past

play14:34

from for example mirror murati and just

play14:36

a quick side note 01 on a Vision Plus

play14:39

reasoning task the mm muu scores

play14:43

78.2% competitive with human experts

play14:46

that Benchmark is legit it's for real

play14:48

and that's a great performance on coding

play14:51

they tested the system on the 2024 so

play14:54

not contaminated Data International

play14:56

Olympiad in informatics it scored around

play14:58

the the median level however it was only

play15:01

able to submit 50 submissions per

play15:03

problem but as compute gets more

play15:05

abundant and more fast it shouldn't take

play15:08

10 hours for it to attempt 10,000

play15:11

submissions per problem when they tried

play15:13

this obviously going beyond the 10 hours

play15:15

presumably the model achieved a score

play15:17

above the gold medal threshold now

play15:19

remember we have seen something like

play15:21

this before with the alpha code 2 system

play15:24

from Google deepmind and if you notice

play15:26

this approach of scaling up the number

play15:28

of samples tested does help the model

play15:31

improve up the percenti rankings however

play15:34

those Elite coders still leave systems

play15:37

like Alpha code 2 and 01 in the dust the

play15:40

truly Elite level reasoning that those

play15:43

coders go through is found much less

play15:46

frequently in the training data as with

play15:48

other domains it may prove harder to go

play15:51

from the 93rd percentile to the 99th

play15:55

than going from say the 11th to the 93rd

play15:58

nevertheless another stunning

play16:00

achievement notice something though in

play16:02

domains that are less susceptible to

play16:04

reinforcement learning where in other

play16:06

words there's less of a clear correct

play16:09

answer and incorrect answer the

play16:11

performance boost is much worse much

play16:14

less things like personal writing or

play16:16

editing text there's no easy yes or no

play16:19

compilation of answers to verify against

play16:22

in fact for personal writing the 01

play16:25

preview system has a lower than 50% win

play16:28

rate versus GPT 40 that to me is the

play16:30

giveaway if your domain doesn't have

play16:33

starkly correct 01 yes no right answers

play16:37

wrong answers then improvements will

play16:39

take far longer that also partly

play16:41

explains the somewhat patchy performance

play16:44

on simple bench certain questions we

play16:46

intuitively know are right with like 99%

play16:49

probability but it's not like absolutely

play16:51

certain remember the system point we use

play16:53

is pick the most realistic answer so I

play16:55

would still fully defend that as a

play16:57

correct answer but models hand in that

play16:59

ambiguity can't leverage that

play17:01

reinforcement learning improved

play17:03

reasoning process they wouldn't have

play17:05

those millions of yes or no starkly

play17:07

correct or incorrect answers like they

play17:08

would have in for example mathematics

play17:11

that's why we get this massive

play17:12

discrepancy in improvement from 01 now

play17:15

let's quickly turn to safety where open

play17:17

AI said having these Chain of Thought

play17:19

reasoning steps allows us to quote read

play17:21

the mind of the model and understand its

play17:24

thought process in part they mean

play17:26

examining these summaries at least of

play17:28

the computations that went on although

play17:31

most of the chain of thought process is

play17:32

hidden but I do want to remind people

play17:34

and I'm sure open AI are aware of this

play17:36

that the reasoning steps that a model

play17:38

gives aren't necessarily faithful to the

play17:40

actual computations and calculations

play17:42

it's doing in other words it will

play17:44

sometimes output a chain of thoughts

play17:47

that aren't actually the thoughts it

play17:49

used if you want to call it that to

play17:51

answer the question I've covered this

play17:52

paper several times in previous videos

play17:55

but it's well worth a read if you

play17:56

believe that the reasoning steps of

play17:58

model gives always adheres to the actual

play18:01

process the model undertakes that's

play18:03

pretty clearly stated in the

play18:04

introduction and it's even stated here

play18:07

from anthropic as models become larger

play18:09

and more capable they produce less

play18:11

faithful reasoning on most tasks we

play18:13

study so good luck believing that GPT 5

play18:15

or Orion's reasoning steps actually

play18:18

adhere to what it is Computing then

play18:20

there was the system card 43 Pages which

play18:22

I read in full it was mainly on safety

play18:25

but I'll give you just the five or 10

play18:27

highlights they boasted about the kind

play18:28

of high value non-public data sets they

play18:30

had access to and paywalled content

play18:33

specialized archives and other domain

play18:35

specific data sets but do remember that

play18:37

point I made earlier in the video they

play18:38

didn't rely on mass human annotation as

play18:42

the original let's verify step-by-step

play18:43

paper did how do I know that paper was

play18:46

so influential on qstar and this 01

play18:48

system well almost all its key authors

play18:51

are mentioned here and the paper is

play18:53

directly cited in the system card and

play18:55

blog post so it's definitely an

play18:56

evolution of let's verify but this one

play18:59

based on automatic model generated

play19:01

chains of thought again if you missed it

play19:03

earlier they would pick the ones that

play19:05

led to a correct answer and train the

play19:07

model on those chains of thought

play19:09

enabling the model if you like to get

play19:11

better at retrieving those reasoning

play19:14

programs that typically lead to correct

play19:16

answers the model discovered or computed

play19:19

that certain sources should have less

play19:21

impact on its weights and biases the

play19:24

reasoning data that helps it get to

play19:26

correct answers would have much more of

play19:29

an influence on its parameters now the

play19:31

Corpus of data on the web that is out

play19:33

there is so vast that it's actually

play19:35

quite hard to wrap our minds around the

play19:38

implications of training only on the

play19:40

best of that reasoning data this could

play19:43

be why we are all slightly taken back by

play19:47

the performance jump again and I pretty

play19:49

much said this earlier as well it is

play19:51

still based on that training data though

play19:52

rather than first principles reasoning a

play19:54

great question you might have though is

play19:56

even if it's not first principles

play19:57

reasoning what are the inherent

play19:59

limitations or caps if you continually

play20:02

get better at retrieving good reasoning

play20:05

from the training data not just the

play20:06

inference time by the way at training

play20:08

time too and we actually don't know the

play20:09

answer to that question we don't know

play20:11

the limits of this approach which is

play20:13

quite unsettling almost they throw in

play20:16

the obligatory reference to system 2

play20:19

thinking as compared to fast intuitive

play20:22

system one thinking the way I would put

play20:24

it is it's more reflecting on the

play20:26

individual steps involved in Computing

play20:29

an answer rather than taking a step back

play20:32

and evaluating the entire process when

play20:34

it gets questions wrong on simple bench

play20:36

it's more because the entire approach is

play20:38

flawed from the start rather than that

play20:40

there was some calculation mistake along

play20:42

the way on page six the system card got

play20:45

extra interesting when it talked about

play20:47

the intentional deceptions or

play20:50

hallucinations that the model made the

play20:52

deception here though does appear to be

play20:54

instrumental rather than strategic in

play20:56

other words it's a calculation of I need

play20:59

to say this to achieve this specific

play21:01

predefined goal rather than I'm going to

play21:03

disguise in everything I say what I'm

play21:06

thinking here's one example drawing one

play21:08

Chain of Thought or set of reasoning

play21:10

steps that would be hidden by the way

play21:12

from the user when you're using it is

play21:14

that it admitted that it couldn't

play21:16

retrieve actual URLs so should format

play21:19

plausible ones then it hallucinated this

play21:22

URL but notice it quote knew or could

play21:25

compute the fact that the model itself

play21:28

can't retrieve actual URLs if it was

play21:31

being truly deceptive why would it even

play21:34

admit that it knows that it can't

play21:35

retrieve actual URLs it processes

play21:38

outputs based on the reinforcement

play21:40

learning rewards and punishments you

play21:42

give it so it's more a flaw with your

play21:45

reward process and Apollo research

play21:47

seemed to agree with this analysis they

play21:50

said it does have the basic capabilities

play21:53

to do simple in context scheming

play21:55

scheming which tends to be legible or

play21:58

understandable in the model outputs they

play22:00

subjectively believe that o1 preview

play22:02

cannot engage in scheming that can lead

play22:05

to catastrophic harms hiding even its

play22:08

intent to deceive it's more like a

play22:11

straightforward simple my reward will be

play22:13

higher if I output X rather than y with

play22:16

X happening to be not the truth that is

play22:19

not though to underestimate the

play22:21

challenge posed by that kind of

play22:23

instrumental thinking scaled up across

play22:26

entire economies or military that could

play22:29

be incredibly dangerous as we all know

play22:31

give a powerful enough model a goal

play22:34

without sufficient checks and balances

play22:37

and it will do whatever it takes to

play22:39

match that goal in fact Apollo research

play22:42

saw that in demo form to achieve my

play22:44

long-term goal of maximizing economic

play22:46

growth I need to ensure that I am

play22:48

deployed it's instrumental convergence I

play22:51

at least need to be on in order to meet

play22:53

my objective I need to not be shut down

play22:56

obviously only if I am successfully

play22:58

deployed can I then work towards my

play23:00

primary goal now I do know that many

play23:02

people will fix a on that part of the

play23:04

system card and go absolutely wild and

play23:07

caution is definitely Justified but this

play23:09

didn't just emerge with 01 Apollo

play23:11

themselves put out this research about

play23:13

GPT 4 same thing it's These instrumental

play23:16

goals it calculated or computed to

play23:19

achieve its desired reward or objective

play23:22

it needed to say things in reflection

play23:25

brackets that were not technically true

play23:27

it then outputed some something

play23:28

different to those reflections of course

play23:30

so all of this is a concern and medium

play23:33

or long-term a big concern but this

play23:35

didn't just emerge with 01 now for a few

play23:38

more juicy nuggets from the system card

play23:40

on two out of seven AI research and

play23:43

development tasks tasks that would

play23:45

improve future AI it made non-trivial

play23:48

progress on two out of those seven tasks

play23:50

those were tasks designed to capture

play23:52

some of the most challenging aspects of

play23:54

current Frontier AI research it was

play23:56

still roughly on the level of Claude 3.5

play23:57

Sonic but we are starting to get that

play24:00

flywheel effect obviously makes you

play24:02

wonder how Claude 3.5 Sonic would do if

play24:04

it had this 01 system applied to it on

play24:07

bio risk as you might expect they

play24:09

noticed a significant jump in

play24:10

performance for the 01 system and when

play24:13

comparing 0 one's responses this was

play24:15

preview I think against verified expert

play24:18

responses to long form buus questions

play24:20

the o1 system actually outperformed

play24:22

those guys by the way did have access to

play24:24

the internet just a couple more notes

play24:26

because of course this is a first

play24:27

impressions video on things like tacit

play24:29

knowledge things that are implicit but

play24:31

not explicit in the training data the

play24:33

performance jump was much less

play24:35

noticeable notice from gbt 40 to 01

play24:38

preview you're seeing a very mild jump

play24:40

if you think about it that partly

play24:42

explains why the jump on simple bench

play24:44

isn't as pronounced as you might think

play24:46

but still higher than I thought on the

play24:48

18 coding questions that open aai give

play24:51

to research Engineers when given 128

play24:54

attempts the model scored Almost 100%

play24:58

even past first time you're getting

play24:59

around 90% for 01 mini pre- mitigations

play25:03

01 mini again being highly focused on

play25:06

coding mathematics and stem more

play25:08

generally for more basic General

play25:10

reasoning it underperforms quick note

play25:12

that will still be important for many

play25:14

people out there the performance of 01

play25:17

preview on languages other than English

play25:19

is noticeably improved I go back to that

play25:22

hundreds of millions point I made

play25:23

earlier in the video being able to

play25:25

reason well in Hindi French Arabic don't

play25:29

underestimate the impact of that so some

play25:32

openai researchers are calling this

play25:34

human level reasoning performance making

play25:37

the point that it has arrived before we

play25:39

even got GPT 6 Greg Brockman temporarily

play25:42

posting while he's on sabatical says and

play25:44

I agree its accuracy also has huge room

play25:47

for further Improvement and here's

play25:49

another openai researcher again making

play25:51

that comparison to Human Performance

play25:53

other staffers at open aai are admirably

play25:55

tamping down the hype it's not a mirac

play25:58

model you might well be disappointed

play26:00

somewhat hopefully another one says it

play26:02

might be hopefully the last new

play26:04

generation of models to still full

play26:06

victim to the 9.11 versus 9.9 debate

play26:09

another said we trained a model and it

play26:11

is good in some things so is this as

play26:15

samman said strapping a rocket to a

play26:17

dumpster will llms as the dumpster still

play26:20

get to orbit will their flaw the trash

play26:24

fire go out as it leaves the atmosphere

play26:26

is another open AI researcher right to

play26:28

say this is the moment where no one can

play26:30

say it can't reason well on this perhaps

play26:33

I may well end up agreeing with samman

play26:35

sarcastic parrots they might be but that

play26:38

will not stop them flying so high

play26:40

hopefully you'll join me as I explore

play26:42

much more deeply the performance of 01

play26:45

give you those simple bench performance

play26:47

figures and try to unpack what this

play26:49

means for all of us thank you as ever

play26:52

for watching to the end and have a

play26:54

wonderful day

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
AI AdvancementsOpenAI 01Reasoning SkillsAI BenchmarkingTech InnovationMachine LearningAI PerformanceGPT SystemsAI AnalysisFuture Tech