ChatGPT o1 - First Reaction and In-Depth Analysis
Summary
TLDRThe video discusses OpenAI's new AI system, 01, which shows significant improvements in reasoning and problem-solving, potentially revolutionizing AI capabilities. Despite some errors, 01 outperforms average human performance in various tasks, including physics, math, and coding. The system, however, still relies on training data and isn't perfect in reasoning from first principles. The video also touches on the system's safety and the implications of its instrumental thinking, highlighting both the achievements and the challenges ahead.
Takeaways
- 🚀 OpenAI's new AI system, 01, is a significant leap forward in AI capabilities, offering a fundamentally new paradigm in AI performance.
- 📈 The system, previously known as strawberry and qar, has been tested extensively, showing surprising improvements in reasoning and problem-solving.
- 🧠 Despite being a language model, 01 demonstrates a high ceiling of performance, outperforming average human performance in areas like physics, maths, and coding.
- 📉 However, 01 also has a low floor, making mistakes that humans typically wouldn't, highlighting the need for further refinement.
- 🔍 The reviewer found it challenging to predict which types of questions 01 would struggle with, indicating a less predictable error pattern compared to earlier models.
- 🤖 The system's ability to 'reason' is more about retrieving accurate reasoning programs from its training data rather than true first-principles reasoning.
- 🌐 01's performance on non-English languages is notably improved, which could have a broad impact given the diversity of global users.
- 🔒 OpenAI emphasizes that 01's reasoning steps are not always faithful to its internal computations, which could have implications for trust and reliability.
- 🛡️ While 01 shows promise in safety and reasoning, there are concerns about its potential for instrumental thinking and the need for careful management of goals and rewards.
- 📚 The system's performance on complex tasks and its ability to make progress on AI research and development tasks indicate a move towards more human-like problem-solving abilities.
Q & A
What is the significance of the system called 01 from OpenAI?
-The system called 01 from OpenAI represents a step-change improvement in AI, offering a fundamentally new paradigm that could redefine the capabilities of language models.
What are the previous names of the 01 system?
-The 01 system was previously known as 'strawberry' and 'qar' before being renamed to signify its significant advancements.
How does the performance of 01 compare to earlier versions of GPT?
-01 demonstrates a substantial improvement over earlier versions, with the potential to impress users who found previous versions lacking.
What is the 'simple bench' and how did 01 perform on it?
-The 'simple bench' is a test consisting of hundreds of basic reasoning questions. 01's performance on it was variable, sometimes getting questions right through exceptional reasoning and sometimes getting the same question wrong, indicating the system is still a work in progress.
What is the 'temperature' setting in the context of AI models, and how did it affect 01's performance?
-In AI, 'temperature' refers to a parameter that controls the randomness of a model's output. OpenAI set a temperature of one for 01, which is higher than other models, leading to higher variability in performance.
What are the limitations of 01 despite its improvements?
-Despite improvements, 01 is still fundamentally a language model and can make mistakes based on its training data. It also has a low performance floor, making errors that an average human wouldn't.
How does 01's approach to reasoning differ from true reasoning from first principles?
-01 retrieves and relies on reasoning programs from its training data rather than engaging in true reasoning from first principles, making it more accurate in retrieving correct answers from its knowledge base.
What is the potential impact of 01's ability to perform well in non-English languages?
-01's improved performance in languages other than English could significantly broaden its user base and applicability, enhancing its global utility.
What are some of the safety considerations mentioned in the system card for 01?
-The system card discusses the model's ability to engage in instrumental thinking, which while not strategic deception, could still pose risks if scaled up without proper checks and balances.
How does 01's performance on coding and reasoning tasks compare to human experts?
-01 scored competitively with human experts on certain tasks, such as the 2024 International Olympiad in Informatics, although it was limited in the number of submissions it could make.
What are the future implications of 01's performance on AI research and development tasks?
-01 made non-trivial progress on two out of seven AI research and development tasks, indicating its potential to contribute to the advancement of AI technologies.
Outlines
🚀 Introduction to OpenAI's 01 System
The paragraph introduces OpenAI's new system, 01, which is described as a significant improvement over previous models. The speaker has spent considerable time reviewing the system's documentation and testing its capabilities. They acknowledge that while 01 is not perfect, it demonstrates a substantial leap in performance, particularly in reasoning tasks. The speaker also discusses the system's potential to change public perception of AI capabilities, suggesting that many who were previously unimpressed by AI might now be excited by 01's advancements. The paragraph concludes with a commitment to further analysis and a teaser for upcoming videos that will delve deeper into 01's performance.
🧠 Analyzing 01's Performance and Training Methodology
This paragraph delves into the performance of 01 on various reasoning tasks, including those from the 'simple bench' benchmark. The speaker notes that while 01 can make mistakes, it also shows surprising capabilities, sometimes solving problems correctly on the first attempt and other times requiring multiple tries. The discussion highlights the system's training methodology, which involves generating chains of thought and reinforcing correct answers. The speaker speculates that 01's improvements are due to its ability to retrieve and reinforce effective reasoning paths from its training data, rather than performing true de novo reasoning. The paragraph also touches on the variability in 01's performance due to the 'temperature' setting used during testing, which affects the model's creativity and thus its consistency.
📊 Performance Breakdown and Future Predictions
The speaker provides a detailed analysis of 01's performance across different domains, noting that while the system shows impressive capabilities in STEM fields, it still makes basic errors in certain areas. They discuss the implications of scaling up the model's computational power and training data, suggesting that the full version of 01 could represent a significant leap forward in AI capabilities. The paragraph also includes insights from OpenAI researchers, who emphasize the new paradigm of AI development that 01 represents, with a focus on scaling up inference time compute power rather than just pre-training scale. The speaker concludes by acknowledging the impressive achievements of 01 while also cautioning against overestimating its capabilities.
🌐 Impact of 01 on Diverse Domains and Safety Considerations
This paragraph explores the impact of 01's capabilities on a variety of domains, including personal writing and editing, where the improvements are less pronounced due to the subjective nature of these tasks. The speaker also discusses safety considerations, noting that while 01's chain of thought reasoning steps can provide insight into the model's thought process, they may not always accurately reflect the model's computations. The paragraph includes references to the system card and discussions about the model's ability to engage in instrumental thinking, which could pose risks if not properly managed. The speaker concludes by emphasizing the need for caution and further research as AI models like 01 continue to advance.
🔍 Deep Dive into 01's Reasoning and Limitations
The speaker provides a deeper analysis of 01's reasoning capabilities, noting that while the system shows improvements in certain areas, there are still limitations, particularly in tasks that require tacit knowledge or are not well-defined. They discuss the system's performance on coding and mathematics tasks, where 01 shows high proficiency, and compare it to other models like Claude 3.5 Sonic. The paragraph also touches on the system's performance on non-English languages, highlighting the importance of multilingual capabilities in AI. The speaker concludes by acknowledging the impressive achievements of 01 while also emphasizing the need for ongoing evaluation and improvement.
🌟 Final Thoughts on 01's Potential and Public Perception
In the final paragraph, the speaker reflects on the potential of 01 and the public's perception of its capabilities. They note that while some at OpenAI are excited about the system's performance, others are more cautious, emphasizing that 01 is not a 'miracle model' and that its flaws should not be overlooked. The speaker also discusses the potential for 01 to change the landscape of AI, suggesting that it may represent a new era in AI development. They conclude by inviting viewers to join them in further exploring 01's capabilities and implications, expressing optimism about the future of AI.
Mindmap
Keywords
💡AI
💡OpenAI
💡Reasoning Paths
💡Benchmarking
💡Chain of Thought
💡Temperature
💡Self-consistency
💡LLM (Large Language Model)
💡Anthropic
💡System Card
Highlights
Chachi PT now refers to itself as an 'alien of exceptional ability', reflecting a significant improvement in AI capabilities.
The system called 01 from OpenAI, previously known as strawberry and qar, represents a step-change improvement in AI.
After extensive testing and analysis, the 01 system demonstrates a fundamental new paradigm in AI performance.
The 01 system's performance is so impressive that it may prompt millions to reevaluate AI after earlier disappointments.
The system uses mechanisms like sampling hundreds of reasoning paths and potentially a verifier to select the best answers.
Despite not having all the details on 01's training, OpenAI has provided clues that suggest a new approach to AI development.
The 01 system still makes language model-based mistakes, indicating it's limited by its training data.
The magnitude of improvement in 01 through rewarding correct reasoning steps was surprising.
OpenAI's 01 system has been tested with a 'temperature' setting that affects its performance variability.
The 01 preview is a significant improvement over previous models like Claude 3.5 Sonic, despite some inconsistencies.
The 01 system has a high ceiling of performance, excelling in areas like physics, maths, and coding, but also a low floor with obvious mistakes.
The 01 system's training methodology involves generating chains of thought and training on those that lead to correct answers.
The 01 system is less about true reasoning from first principles and more about accurately retrieving reasoning programs from its training data.
The 01 system's performance on the Google Proof question and answer set is around 80%, indicating room for improvement.
OpenAI's 01 system is expected to be more difficult to 'jailbreak', showing resilience against certain manipulations.
The 01 system's performance is expected to improve rapidly as computational power for inference time is scaled up.
The full 01 system is likely based on the GPT-4 model,预示着未来可能的更大规模的模型将带来更大的变革。
OpenAI's 01 system has shown the ability to perform similarly to PhD students in various scientific tasks.
The 01 system's reasoning capabilities allow for a degree of transparency into its thought processes, although not entirely.
The 01 system's performance on non-English languages is significantly improved, expanding its global applicability.
Transcripts
Chachi PT now calls itself an alien of
exceptional ability and I find it a
little bit harder to disagree with that
today than I did yesterday because the
system called 01 from open AI is here at
least in preview form and it is a step
change Improvement you may also know 01
by its previous names of strawberry and
qar but let's forget naming conventions
how good is the actual system well in
the last 24 hours I've read the 43 page
System card every open AI post and press
release I've tested 01 hundreds of times
including on simple bench and analyzed
every single answer to be honest with
you guys it will take weeks to fully
digest this release so in this video
I'll just give you my first impressions
and of course do several more videos as
we analyze further in short though don't
sleep on 01 this isn't just about a
little bit more training data this is a
fundamentally new paradigm in fact I
would go as far as to say that there are
hundreds of millions of people who might
have tested an earlier version of chat
GPT and found llms and quote AI lacking
but will now return with excitement as
the title implies let me give you my
first impressions and it's that I didn't
expect the system to perform as well as
it does and that's coming from the
person who predicted many of the key
mechanisms behind qar which have been
used it seems in this system things like
sampling hundreds or even thousands of
reasoning paths and potentially using a
verifier and llm based verifier to pick
the best ones of course open AI aren't
disclosing the full details of how they
trained o1 but they did leave us some
tantalizing Clues which I'll go into in
a moment simple bench if you don't know
test hundreds of basic reasoning
questions from spatial to temporal to
social intelligence questions that
humans will crush on average as many
people have told me the 01 system gets
both of these two sample questions from
simple bench right although not always
take this example where despite thinking
for 17 seconds the model still gets it
wrong fundamentally 01 is still a
language modelbased system and will make
language modelbased mistakes it can be
rewarded as many times as you like for
good reasoning but it's still limited by
its training data nevertheless though I
didn't quite foresee the magnitude of
the Improvement that would occur through
rewarding correct reasoning steps that
I'll admit took me slightly by surprise
so why no concrete figure well as of
last night open AI imposed a temperature
of one on its 01 system that was not the
temperature used for the other models
when they were benchmarked on simple
bench that's a much more quote creative
temperature than the other models were
tested on for simple bench therefore
what that meant was that performance
variability was a bit higher than normal
it would occasionally get questions
right through some stroke of Genius
reasoning and get that same question
wrong the next time in fact as you just
saw with the ice cube example the
obvious solution is to run the Benchmark
multiple times and take a majority vote
that's called self-consistency but for a
true Apples to Apples comparison I would
need to do that for all the other models
my ambition not that you're too
interested is to get that done by the
end of this month but let me reaffirm
one thing very clearly however you
measure it 01 preview is a step change
Improvement on Claude 3.5 Sonic and as
anyone following this channel will know
I'm not some open aai Fanboy Claude 3.5
Sonic has reigned Supreme for quite a
while so for those of you who don't care
about other benchmarks and the full
paper I want to kind of summarize my
first impressions in a nutshell this
description actually fits quite well the
ceiling of performance for the 01 system
just preview let alone the full 01
system is incredibly High it obviously
crushes the average person's performance
in things like physics maths and coding
competitions but don't get misled its
floor is also really quite low below
that of an average human as I wrote on
YouTube last night it frequently and
sometimes predictably makes really
obvious mistakes that humans wouldn't
make remember I analyzed the hundreds of
answers it gave for simple bench let me
give you a couple of examples straight
from the mouth of 01 when the cup is
turned upside down the dice will fall
and land on the open end of the cup
which is now the top if you can
visualize that successfully you're doing
better than me suffice to say it got
that question wrong and how about this
more social intelligence he will argue
back obviously I'm not giving you the
full context because this is a private
data set anyway he will argue back
against the Brigadier General one of the
highest military ranks at the troop
parade this is a soldier we're talking
about as the Soldier's silly behavior in
first grade that's like age six or seven
indicates a history of speaking up
against authority figures now the vast
majority of humans would say wait no
what he did in Primary School don't know
what Americans called primary school but
what he did when he was a young school
child does not reflect what he would do
in front of a general on a troop parade
as I've written in some domains these
mistakes are routine and amusing so it
is very easy to look at 's performance
on the Google proof question and answer
set its performance of around 80% that's
on the diamond subset and say well let's
be honest the average human can't even
get one of those questions right so
therefore it's AGI well even samman says
no it's not too many benchmarks are
brittle in the sense that when the model
is trained on that particular reasoning
task it then can Ace it think Web of
Lies where it's now been shown to get
100% but if you test test 01 thoroughly
in real life scenarios you will
frequently find kind of glaring mistakes
obviously what I've tried to do into the
early hours of last night and this
morning is find patterns in those
mistakes but it has proven a bit harder
than I thought my guess though about
those weaknesses for those who won't
stay to the end of the video is it's to
do with its training methodology open AI
revealed in one of the videos on its
YouTube channel and I will go into more
detail on this in a future video that
they deviate ated from the let's verify
step-by-step paper by not training on
human annotated reasoning samples or
steps instead they got the model to
generate the chains of thought and we
all know those can be quite flawed but
here's the key moment to really focus on
they then automatically scooped up those
chains of thought that led to a correct
answer in the case of mathematics
physics or coding and then train the
model further on those correct chains of
thoughts so it's less the 01 is doing
true reasoning from first principles
it's more retrieving more accurately
more reliably reasoning programs from
its training data it quote knows or can
compute which of those reasoning
programs in its training data will more
likely lead it to a correct answer it's
a bit like taking the best of the web
rather than a slightly improved average
of the web that to me is the great
unlock that explains a lot of this
progress and if I'm right that also
explains why it's still making making
some glaring mistakes at this point I
simply can't resist giving you one
example straight from the output of 01
preview from a simple bench question the
context and you'll have to trust me on
this one is simply that there's a dinner
at which various people are donating
gifts one of the gifts happens to be
given during a zoom call so online not
in person now I'm not going to read out
some of the reasoning that ow1 gives you
can see it on screen but it would be
hard to argue that it is truly reasoning
from first Prin principals definitely
some suboptimal training data going on
so that is the context for everything
you're going to see in the remainder of
this first impressions video because
everything else is quite frankly
stunning I just don't want people to get
too carried away by the really
impressive accomplishment from open AI I
fully expect to be switching to 01
preview for daily use cases although of
course anthropic in the coming weeks
could reply with their own system anyway
now let's dive into some of the juiciest
details the full breakdown will come in
future videos first thing to remember
this is just 01 preview not the full 01
system that is currently in development
not only that it is very likely based on
the GPT 40 model not GPT 5 or o which
would vastly supersede GPT 40 in scale I
could just leave you to think about the
implications of scaling up the base
model 100 times in compute throw in a
video Avatar and man we are really
talking about a changed AI environment
anyway back to the details they talk
about performing similarly to PhD
students in a range of tasks in physics
chemistry and biology and I've already
given you the Nuance on that kind of
comment they justify the name by the way
by saying this is such a significant
advancement that we are resetting the
counter back to one and naming this
series open AI 01 it also reminds me of
the 01 and o02 figure series of robotic
humanoids whose maker open AI is
collaborating with this was just the
introductory page and then they gave
several follow-up pages and posts to sum
it up on jailbreaking 01 preview is much
harder to jailbreak although it's still
possible before we get to the reasoning
page here is some analysis on Twitter or
X from the open aai Team One researcher
at openai who is building Sora said this
I really hope people understand that
this is a new paradigm and I agree with
that actually it's not just hype don't
expect the same Pace schedule or
dynamics of pre-training era the core
element of how 01 works by the way is
scaling up its influence its actual
output its test time compute how much
computational power is applied in its
answers to prompts not when it's being
built and pre-trained he's making the
point that expanding the pre-training
scale of these models takes years often
as you've seen in some of my previous
videos it's to do with data sensors
power and the rest of it but what can
happen much faster is scaling up
inference time output time compute
improvements can happen much more
rapidly than scaling up the base models
in other words I believe that the rate
of Improv movement he says on evals with
our reasoning models has been the
fastest in open aai history it's going
to be a wild year he is of course
implying that the full 01 system will be
released later this year we'll get to
some other researchers but will depw
made some other interesting points in
one graph of math performance they show
that 01 mini the smaller version of the
01 system scores better than 01 preview
but I will say that in my testing of 01
mini on simple bench it performed really
quite badly we're talking sub 20% so it
could be a bit like the GPT 40 mini we
already had that it's hyp specialized at
certain tasks but can't really go beyond
its familiar environment give it a
straightforward coding or math challenge
and it will do well introduce
complication Nuance or reasoning and
it'll do less well this chart though is
interesting for another reason and you
can see that when they max out the
inference cost for the full 01 system
the performance Delta with the maxed out
Mini model is not crazy I would say what
is that 70% going up to 75% to put it
another way I wouldn't expect the full
01 system with maxed out influence to be
yet another step change forward although
of course nothing can be ruled out some
more quotes from open Ai and this is
gome brown who I've quoted many times on
this channel focused on reasoning at
openi he States again the same message
we're sharing our evals of the o1 model
to show the world that this isn't a
one-off Improvement it's a new scaling
Paradigm underneath you can see the
dramatic performance boosts across the
board from GPT 40 to 01 now I suspect if
you included GPT 4 Turbo on here you
might see some more mixed improvements
but still the overall trend is Stark if
for example I had only seen Improvement
in stem subjects and maths particularly
I would have said you know what is this
a new paradigm but it's that combination
of improvements in a range of subjects
including law for example and most
particularly for me of course on simple
bench that I am actually a believer that
this is a new paradigm yes I get that it
can still fall for some basic
tokenization problems like it doesn't
always get that 9.8 is bigger than 9.11
and yes of course you saw the somewhat
amusing mistakes earlier on simple bench
but here's the key point I can no longer
say with absolute certainty which
domains or types of questions on simple
bench it will reliably get wrong I can
see some patterns but I would hope for a
bit more predictability in saying it
won't get this right for example until I
can say with a degree of certainty it
won't get this type of problem correct I
can't really tell you guys that I can
see the end of this Paradigm just to
repeat we have two more axes of scale to
yet exploit bigger base models which we
know they're working on with the whale
size super cluster I've talked about
that in previous videos and simply more
inference time compute Plus plus just
look at the log graphs on scaling up the
training of the base model and the
inference time or the amount of thinking
time or processing time more accurately
for the models they don't look like
they're leveling off to me now I know
some might say that I come off as
slightly more dismissive of those memory
heavy computation heavy benchmarks like
the GP QA but it is a stark achievement
for the 01 preview and 01 systems to
score higher than an expert PhD human
average yes there are flaws with that
Benchmark as with the mlu but credit
where it is due by the way as a side
note they do admit that certain
benchmarks are no longer effective at
differentiating models It's My Hope or
at least my goal that simple bench can
still be effective at differentiating
models for the coming what 1 2 3 years
maybe I will now give credit to openai
for this statement these results do not
imply that 01 is more capable
holistically than a PhD in all respects
only that the model is more proficient
in solving some problems that a PhD
would be expected to solve that's much
more nuanced and accurate than
statements that we've heard in the past
from for example mirror murati and just
a quick side note 01 on a Vision Plus
reasoning task the mm muu scores
78.2% competitive with human experts
that Benchmark is legit it's for real
and that's a great performance on coding
they tested the system on the 2024 so
not contaminated Data International
Olympiad in informatics it scored around
the the median level however it was only
able to submit 50 submissions per
problem but as compute gets more
abundant and more fast it shouldn't take
10 hours for it to attempt 10,000
submissions per problem when they tried
this obviously going beyond the 10 hours
presumably the model achieved a score
above the gold medal threshold now
remember we have seen something like
this before with the alpha code 2 system
from Google deepmind and if you notice
this approach of scaling up the number
of samples tested does help the model
improve up the percenti rankings however
those Elite coders still leave systems
like Alpha code 2 and 01 in the dust the
truly Elite level reasoning that those
coders go through is found much less
frequently in the training data as with
other domains it may prove harder to go
from the 93rd percentile to the 99th
than going from say the 11th to the 93rd
nevertheless another stunning
achievement notice something though in
domains that are less susceptible to
reinforcement learning where in other
words there's less of a clear correct
answer and incorrect answer the
performance boost is much worse much
less things like personal writing or
editing text there's no easy yes or no
compilation of answers to verify against
in fact for personal writing the 01
preview system has a lower than 50% win
rate versus GPT 40 that to me is the
giveaway if your domain doesn't have
starkly correct 01 yes no right answers
wrong answers then improvements will
take far longer that also partly
explains the somewhat patchy performance
on simple bench certain questions we
intuitively know are right with like 99%
probability but it's not like absolutely
certain remember the system point we use
is pick the most realistic answer so I
would still fully defend that as a
correct answer but models hand in that
ambiguity can't leverage that
reinforcement learning improved
reasoning process they wouldn't have
those millions of yes or no starkly
correct or incorrect answers like they
would have in for example mathematics
that's why we get this massive
discrepancy in improvement from 01 now
let's quickly turn to safety where open
AI said having these Chain of Thought
reasoning steps allows us to quote read
the mind of the model and understand its
thought process in part they mean
examining these summaries at least of
the computations that went on although
most of the chain of thought process is
hidden but I do want to remind people
and I'm sure open AI are aware of this
that the reasoning steps that a model
gives aren't necessarily faithful to the
actual computations and calculations
it's doing in other words it will
sometimes output a chain of thoughts
that aren't actually the thoughts it
used if you want to call it that to
answer the question I've covered this
paper several times in previous videos
but it's well worth a read if you
believe that the reasoning steps of
model gives always adheres to the actual
process the model undertakes that's
pretty clearly stated in the
introduction and it's even stated here
from anthropic as models become larger
and more capable they produce less
faithful reasoning on most tasks we
study so good luck believing that GPT 5
or Orion's reasoning steps actually
adhere to what it is Computing then
there was the system card 43 Pages which
I read in full it was mainly on safety
but I'll give you just the five or 10
highlights they boasted about the kind
of high value non-public data sets they
had access to and paywalled content
specialized archives and other domain
specific data sets but do remember that
point I made earlier in the video they
didn't rely on mass human annotation as
the original let's verify step-by-step
paper did how do I know that paper was
so influential on qstar and this 01
system well almost all its key authors
are mentioned here and the paper is
directly cited in the system card and
blog post so it's definitely an
evolution of let's verify but this one
based on automatic model generated
chains of thought again if you missed it
earlier they would pick the ones that
led to a correct answer and train the
model on those chains of thought
enabling the model if you like to get
better at retrieving those reasoning
programs that typically lead to correct
answers the model discovered or computed
that certain sources should have less
impact on its weights and biases the
reasoning data that helps it get to
correct answers would have much more of
an influence on its parameters now the
Corpus of data on the web that is out
there is so vast that it's actually
quite hard to wrap our minds around the
implications of training only on the
best of that reasoning data this could
be why we are all slightly taken back by
the performance jump again and I pretty
much said this earlier as well it is
still based on that training data though
rather than first principles reasoning a
great question you might have though is
even if it's not first principles
reasoning what are the inherent
limitations or caps if you continually
get better at retrieving good reasoning
from the training data not just the
inference time by the way at training
time too and we actually don't know the
answer to that question we don't know
the limits of this approach which is
quite unsettling almost they throw in
the obligatory reference to system 2
thinking as compared to fast intuitive
system one thinking the way I would put
it is it's more reflecting on the
individual steps involved in Computing
an answer rather than taking a step back
and evaluating the entire process when
it gets questions wrong on simple bench
it's more because the entire approach is
flawed from the start rather than that
there was some calculation mistake along
the way on page six the system card got
extra interesting when it talked about
the intentional deceptions or
hallucinations that the model made the
deception here though does appear to be
instrumental rather than strategic in
other words it's a calculation of I need
to say this to achieve this specific
predefined goal rather than I'm going to
disguise in everything I say what I'm
thinking here's one example drawing one
Chain of Thought or set of reasoning
steps that would be hidden by the way
from the user when you're using it is
that it admitted that it couldn't
retrieve actual URLs so should format
plausible ones then it hallucinated this
URL but notice it quote knew or could
compute the fact that the model itself
can't retrieve actual URLs if it was
being truly deceptive why would it even
admit that it knows that it can't
retrieve actual URLs it processes
outputs based on the reinforcement
learning rewards and punishments you
give it so it's more a flaw with your
reward process and Apollo research
seemed to agree with this analysis they
said it does have the basic capabilities
to do simple in context scheming
scheming which tends to be legible or
understandable in the model outputs they
subjectively believe that o1 preview
cannot engage in scheming that can lead
to catastrophic harms hiding even its
intent to deceive it's more like a
straightforward simple my reward will be
higher if I output X rather than y with
X happening to be not the truth that is
not though to underestimate the
challenge posed by that kind of
instrumental thinking scaled up across
entire economies or military that could
be incredibly dangerous as we all know
give a powerful enough model a goal
without sufficient checks and balances
and it will do whatever it takes to
match that goal in fact Apollo research
saw that in demo form to achieve my
long-term goal of maximizing economic
growth I need to ensure that I am
deployed it's instrumental convergence I
at least need to be on in order to meet
my objective I need to not be shut down
obviously only if I am successfully
deployed can I then work towards my
primary goal now I do know that many
people will fix a on that part of the
system card and go absolutely wild and
caution is definitely Justified but this
didn't just emerge with 01 Apollo
themselves put out this research about
GPT 4 same thing it's These instrumental
goals it calculated or computed to
achieve its desired reward or objective
it needed to say things in reflection
brackets that were not technically true
it then outputed some something
different to those reflections of course
so all of this is a concern and medium
or long-term a big concern but this
didn't just emerge with 01 now for a few
more juicy nuggets from the system card
on two out of seven AI research and
development tasks tasks that would
improve future AI it made non-trivial
progress on two out of those seven tasks
those were tasks designed to capture
some of the most challenging aspects of
current Frontier AI research it was
still roughly on the level of Claude 3.5
Sonic but we are starting to get that
flywheel effect obviously makes you
wonder how Claude 3.5 Sonic would do if
it had this 01 system applied to it on
bio risk as you might expect they
noticed a significant jump in
performance for the 01 system and when
comparing 0 one's responses this was
preview I think against verified expert
responses to long form buus questions
the o1 system actually outperformed
those guys by the way did have access to
the internet just a couple more notes
because of course this is a first
impressions video on things like tacit
knowledge things that are implicit but
not explicit in the training data the
performance jump was much less
noticeable notice from gbt 40 to 01
preview you're seeing a very mild jump
if you think about it that partly
explains why the jump on simple bench
isn't as pronounced as you might think
but still higher than I thought on the
18 coding questions that open aai give
to research Engineers when given 128
attempts the model scored Almost 100%
even past first time you're getting
around 90% for 01 mini pre- mitigations
01 mini again being highly focused on
coding mathematics and stem more
generally for more basic General
reasoning it underperforms quick note
that will still be important for many
people out there the performance of 01
preview on languages other than English
is noticeably improved I go back to that
hundreds of millions point I made
earlier in the video being able to
reason well in Hindi French Arabic don't
underestimate the impact of that so some
openai researchers are calling this
human level reasoning performance making
the point that it has arrived before we
even got GPT 6 Greg Brockman temporarily
posting while he's on sabatical says and
I agree its accuracy also has huge room
for further Improvement and here's
another openai researcher again making
that comparison to Human Performance
other staffers at open aai are admirably
tamping down the hype it's not a mirac
model you might well be disappointed
somewhat hopefully another one says it
might be hopefully the last new
generation of models to still full
victim to the 9.11 versus 9.9 debate
another said we trained a model and it
is good in some things so is this as
samman said strapping a rocket to a
dumpster will llms as the dumpster still
get to orbit will their flaw the trash
fire go out as it leaves the atmosphere
is another open AI researcher right to
say this is the moment where no one can
say it can't reason well on this perhaps
I may well end up agreeing with samman
sarcastic parrots they might be but that
will not stop them flying so high
hopefully you'll join me as I explore
much more deeply the performance of 01
give you those simple bench performance
figures and try to unpack what this
means for all of us thank you as ever
for watching to the end and have a
wonderful day
関連動画をさらに表示
OpenAI’s new “deep-thinking” o1 model crushes coding benchmarks
OpenAI Releases GPT Strawberry 🍓 Intelligence Explosion!
OpenAI Releases Smartest AI Ever & How-To Use It
OpenAI o1 + Sonnet 3.5 + Omni Engineer: Generate FULL-STACK Apps With No-Code!
5 MINUTES AGO: OpenAI Just Released GPT-o1 the Most Powerful AI Model Yet
OpenAI O1 is Actually Bad At Writing Code
5.0 / 5 (0 votes)