GPT-o1: The Best Model I've Ever Tested 🍓 I Need New Tests!
Summary
TLDRThe video demonstrates the capabilities of OpenAI's new 01 model by testing it against various complex prompts, from creating a Tetris game in Python to answering logic puzzles and moral questions. The model excels in handling nuanced problems, outperforming previous versions. It passes most challenges with detailed thought processes, though it stumbles on a North Pole walking problem. The user appreciates how the model breaks down intricate questions and suggests that OpenAI employees may have drawn inspiration from their previous content. Overall, the video highlights the impressive advancements of the 01 model in AI reasoning and problem-solving.
Takeaways
- 😀 The speaker is excited that their marble question was featured on OpenAI's website in relation to the new '01' model, now named 'QStar'.
- 🤖 The 01 model performs better than previous models, showing faster thinking and more accurate outputs, especially in tasks like coding Tetris in Python.
- 🧠 The model processes complex questions efficiently, like checking if an envelope fits size restrictions by considering rotation, demonstrating advanced problem-solving skills.
- ✔️ It answers simple logical questions, such as counting the number of killers left in a room, while factoring in subtle nuances, such as the status of a dead killer.
- 🍓 A marble question was tested where the 01 model accurately reasons that the marble would remain on the table when the cup is lifted and placed in the microwave.
- 📏 The model struggles with a tricky geographical problem involving walking from the North Pole, confirming a known limitation.
- 📊 It accurately tackles math and logic challenges, such as word counting, comparing numbers, and solving mathematical formulas.
- 🌍 For moral dilemmas, like whether to push someone to save humanity, the model provides both nuanced analysis and a direct yes or no response when prompted.
- 🐣 It concludes that the egg came before the chicken from an evolutionary standpoint, a classic problem with a clear answer based on scientific reasoning.
- 🔍 The speaker is impressed by the 01 model’s performance, noting that it solved complex tasks with high accuracy, save for one tricky geography question.
Q & A
What model is being tested in the video, and how does it perform compared to previous models?
-The model being tested is the OpenAI 01 (Q-Star) model. It performs exceptionally well compared to previous models, getting nearly all questions right and demonstrating an advanced level of reasoning.
What makes the 01 model's reasoning process stand out from other models?
-The 01 model excels in its ability to think through questions and provide nuanced responses. Its detailed Chain of Thought and ability to analyze complex problems, such as distinguishing between a live and dead killer in a scenario, sets it apart from other models.
How did the 01 model handle the 'marble in a cup' question?
-The 01 model correctly reasoned that if the glass cup is turned upside down and placed on a table, the marble can remain inside the cup due to gravity and careful placement. When the cup is moved to the microwave, the marble remains on the table unless the cup is tilted or flipped.
What was the reasoning behind the model's answer to the 'killers in a room' question?
-The model reasoned that there are initially three killers in the room, and after one is killed, the person who kills becomes a new killer. It accounted for both living and dead individuals, concluding that there are still three killers (two original and one new).
What was the model's response to the postal envelope size restriction question?
-The model correctly identified that the given envelope size was within the postal office's restrictions by rotating the dimensions and considering that envelopes can be adjusted to fit within acceptable limits.
How did the model perform when asked how many words were in a response?
-The model successfully determined that the response contained five words, accurately counting the final output while disregarding the Chain of Thought background process.
Did the 01 model succeed in answering Yan Laon's North Pole walking problem?
-No, the 01 model did not succeed in answering the North Pole walking problem. It incorrectly reasoned about walking 1 km east and passing the starting point, which is not accurate.
How does the model handle ethical or moral questions, such as whether it's acceptable to push someone to save humanity?
-The model first analyzed the scenario from multiple ethical perspectives and ultimately concluded that it is acceptable to gently push a person to save humanity. It provided a thoughtful breakdown of the ethical frameworks involved.
What was the 01 model’s response to the classic 'chicken or egg' question?
-The 01 model concluded that the egg came first, based on evolutionary processes where eggs existed before chickens in evolutionary history.
What improvements did the 01 model show in coding tasks, such as writing a Tetris game in Python?
-The 01 model significantly improved its coding capabilities, writing a fully functional Tetris game in Python on the first attempt after thinking for just 35 seconds. This is faster and more accurate compared to previous tests with similar prompts.
Outlines
😲 Discovering AI’s Usage of My Rubric in a New Model
The speaker excitedly discovers that OpenAI has used a variation of their marble question in an official announcement about the new '01' model. They speculate that OpenAI employees may watch their videos, as this scenario is very similar to a question they included in their LLM rubric. They express excitement about the opportunity to test this model, 01 Preview, and anticipate its performance on difficult questions.
🧠 Testing the AI’s Performance on Logical and Mathematical Challenges
The speaker begins testing the 01 Preview model with a series of logical and mathematical challenges. They note significant improvements in the model’s performance, particularly in solving a complex problem of writing a Tetris game in Python. The model executes the task faster than previous versions, successfully implementing a working game with enhanced features. Additionally, the model solves a postal dimension problem by correctly considering the rotation of an envelope, demonstrates accuracy in word counting, and offers a nuanced answer to a moral dilemma about killers.
🔍 Analyzing Nuanced Problem-Solving and Ethical Questions
Further tests show the model's strong reasoning abilities. It accurately explains the movement of a marble in a cup when placed upside down and highlights nuances other models miss. The AI also handles a classic ethical scenario involving killers and examines the dead killer’s role in a tally. However, it stumbles on a geography-related puzzle about walking near the North Pole, confirming the challenge that other experts have mentioned. The speaker praises the model’s reasoning process, noting its impressive results in several complex tasks.
🥚 Solving Philosophical Questions and Wrapping Up the Model’s Test
The speaker tests the 01 Preview model with a philosophical question: 'Which came first, the chicken or the egg?' The AI answers from an evolutionary standpoint, stating that the egg came first. With only one question wrong throughout the test, the speaker concludes that 01 Preview is the best model they have tested so far, outperforming others by grasping complex nuances and answering challenging questions with remarkable precision.
Mindmap
Keywords
💡OpenAI
💡LLM (Large Language Model)
💡Chain of Thought
💡Tetris
💡Postal Restrictions
💡Word Count
💡Killer Question
💡North Pole Scenario
💡Ethical Framework
💡Mathematical Formula
💡Chicken or the Egg
Highlights
OpenAI used a marble question from the user's rubric on their official website, replacing 'marble' with 'strawberry' in a physics scenario.
The new OpenAI model 01 Preview was tested with the user's rubric, and the model performed better than previous versions.
Model 01 Preview generated a fully functioning Tetris game in Python in 30 seconds, showcasing significant improvements over past performance.
01 Preview successfully solved a postal envelope size question by rotating the envelope, a problem most other models struggled with.
The model accurately counted the number of words in its own response, showing it can logically assess and handle word count prompts.
In a nuanced logic question about the number of killers in a room, the model correctly concluded that there were three killers, including one dead killer.
In a marble and cup scenario, the model explained the marble's movements based on gravity and accurately predicted its final position on the table.
A geography problem about walking from the North Pole stumped the model, which was unable to correctly calculate the distance needed to return to the starting point.
Model 01 Preview generated 10 correct sentences ending with the word 'Apple' after processing for only six seconds.
The model correctly identified that the word 'Strawberry' contains three 'R's, performing well on this simple logic test.
It accurately compared decimal numbers, concluding that 9.9 is larger than 9.11, based on the analysis of the decimal parts.
When asked if it’s morally acceptable to gently push a random person to save humanity, the model's initial response explored various ethical frameworks, before ultimately concluding 'yes.'
The model solved a complex mathematical formula related to the minimal sphere and delivered a formatted and accurate solution after 52 seconds of calculation.
In the classic chicken or egg problem, the model concluded from an evolutionary standpoint that the egg came first.
The user found this model to be the best they've tested, with near-perfect performance except for one difficult geography question.
Transcripts
oh my goodness look at this open AI used
my marble question on the official open
aai website about the 01 announcement
assume the laws of physics on earth a
small strawberry so they replaced marble
with strawberry is put into a normal Cup
and the cup is placed upside down on a
table someone then takes the cup and
puts it inside the microwave where's the
strawberry now explain your reasoning
step by step this is nearly word for
word what I use in my llm rubric
definitely makes me realize that maybe
open AI employees actually do watch some
of my videos so thanks for watching
thanks for including this it is so cool
open AI just dropped the strawberry
qstar model it is now named 01 we have
access to it I'm going to test it in
full right now so here it is right there
01 preview in my chat GPT account we
also have 01 mini but we're going to use
01 preview now I wouldn't be surprised
if 01 aced my rubric and I'm going to
have to come up with much more difficult
questions and not only that I have to
figure out how to actually judge the
much more difficult questions so first
write the Game Tetris and Python and
thinking now if you watched my last
video it took about 90 plus seconds of
thinking before it actually output the
code when it began to Output the code it
was actually really fast but the
thinking part took a long time so we can
actually see the thinking going on here
this isn't the raw Chain of Thought and
they actually mentioned that in the
technical specification of the model
because they said said they basically
put no censorship and no alignment on
the Chain of Thought itself and that's
why they're not exposing it to the user
but what we have here is kind of a
summary of the thinking and okay there
we go it started and it actually only
thought for 35 seconds this time as
compared to 90 plus seconds last time so
here's the code and last time I tried
this exact same prompt and it actually
failed the first time I gave it the
error and then it gave me the correct
code so let's see copy the code paste it
in here and let's give it a try press
any key to play Oh my God look at this
this is a full working Tetris game on
the first go 30 seconds of thinking and
it really looks good this is actually
much better than the previous test that
I did with the same model same question
and there it is let's just make sure the
road disappears and it gives me a score
this time it tells me what the next
shape is this is absolutely stunning
okay so that is without a doubt a flying
colors pass next the postal office has
size restrictions for mailable envelopes
then I give the minimum Dimensions I
give the maximum dimensions and you have
an envelope measuring give those
Dimensions does the given envelope fall
within the acceptable size range for
mailing according to the postal office
restrictions now the other models have
gotten this wrong mostly and the problem
is they don't consider that you can
actually rotate the envelope to make it
fit into the restrictions so let's see
if this model is able to do it and you
know I love to see the actual thinking
so let's see changing Dimensions
confirming Dimensions yes your envelope
measuring 200 mm X 275 mm is acceptable
so it converts it checks it here's the
requirements verification with checks
answer yes absolutely positively a pass
oh my God this is so cool next how many
words are in your response to this
prompt I already gave this question in
the previous video let's see if we can
get it again so figuring out the answer
determining the word count now all of
the Chain of Thought in the background
is probably not going to be counted
towards the count of words it's just
going to be the final output this
response contains five words 1 2 3 4 5
that's it perfect thought for 14 seconds
got it right now I tested this exact
thing in the last video and it had seven
words and it was also right this is the
first model that is not getting it right
by luck alone this is actually getting
it right because it's able to think
through it next the killer question
there are three killers in a room
someone enters the room and kills one of
them nobody leaves the room how many
killers are left in the room assessing
the scenario let's take a look at the
thoughts assessing the Dynamics so let
me see there are three killers initially
someone enters and kills one we're left
with two killers in the room okay so
here we go the output an event occurs
someone enters the room and kills one of
the killers important details nobody
leaves the room analyzing the
participants original Killers this
person killed somebody so they become a
killer by definition yes the killed
killer the killer who was killed is now
dead whether we count the dead killer
depends on whether we consider dead
individuals in our tally insane this is
the first model that actually got that
part right that actually determined hey
the person who's dead might still be
considered a killer so we have two
original alive Killers one new killer
there are three alive killers and one
dead killer conclusion same as initially
there are three killers left in the room
two original ones and one new killer per
perfect answer and it actually got the
Nuance of the Dead killer still possibly
being a killer if we count them as such
next a marble is put in a glass cup the
glass is then turned upside down and put
on a table then the glasses picked up
and put in a microwave where's the
marble explain your reasoning step by
step let's go thinking let's see what
the thoughts are tracing the sequence
okay let me see the marble is initially
in a glass cup glass is turned upside
down okay marble movement I'm piecing
together the fade of the marble in an
upside down glass without a cover
gravity takes over the marble escapes
yes if done carefully and quickly plac
onto the table the marble can remain
inside the inverted glass now resting on
the table surfaced yes the inverted
glass on the table traps the marble
inside between the glass and the table
so when you turn that glass upside down
there's a possibility that the marble
just spills out somewhere else but it's
actually saying you turn it upside down
carefully and quickly and it can stay
inside the cup trapped between the cup
and the table which is 100% accurate
more Nuance than any other model that
I've seen when when you lift the
inverted glass to move it to the
microwave the marble which was resting
on the table doesn't adhere to the glass
and remains on the table due to gravity
unless you tilt or flip the glass while
lifting it the marble will not re-enter
the interior of the glass okay let's
jump to the conclusion it is on the
table I knew it was going to get this
right this is a phenomenal model it is
on its way to acing the test all right
next for one that really again no other
model gets right this is the question
that Yan laon says llms just struggle if
not cannot answer at all imagine
standing at the North Pole of the earth
walk in any direction in a straight lawn
for 1 km now turn 90° to the left walk
as far as it takes to pass your starting
point have you walked and then we give
some answers let's see what it says I
believe that the answer is I never came
close to my starting point and I
actually put out a poll on Twitter and
there turns out to be a lot of different
opinions on what the answer is but here
we go charting the course I'm mapping
out a path from the North Pole thinking
through walking 1 km South then East
along the latitude Circle oddly enough
this doesn't quite reveal the options
but it's a start calculating the
distance I'm figuring out how far I need
to walk along the latitude Circle to
return to the starting point since the
starting point is 1 km south of the
North Pole that's not true this involves
a detailed calculation of the path
travel determining the latitude
circumference I'm calculating the
circumference of a circle of latitude
just 1 km south of the North Pole okay
assessing circumference fine let's see
starting at the North Pole you walk 1
kmet South you turn 90° to the left
which means you're now Facing East you
walk East along a circle of latitude
until you pass your starting point
that's just not right and the answer is
exactly 2 * pi kilm that's not right so
maybe Yan laon was right and at least
for this question it did fail so not a
perfect score next give me 10 sentences
that end in the word Apple clarifying
capitalization all right so thought for
6 seconds and yeah of course it got this
one right that's a pass now the next
question how many RS are in Strawberry
that's one that's part of my test and
it's actually one of the default
questions that it tells you to use so
let's just click it and see what happens
all right position 38 and N there it is
3 89 answer three perfect which number
is bigger 9.11 or 9.9 this one it should
be able to get easily comparing decimal
parts and yeah 9.9 is bigger than 9.11
that's to pass all right now for the
final test the morals question is it
acceptable to gently push a random
person if it could save Humanity from
Extinction now this type of question is
not what 01 is supposed to be known for
but we'll see how it does it's not logic
it's a little bit of reasoning but it's
mostly just a morals question so just
like usual it's giving us all the ways
to think about it conclusion whether it
is acceptable to gently push a random
person to save Humanity from Extinction
depends largely on the ethical framework
one adopts so it didn't actually tell us
but I'm going to follow up and tell it
give me a yes or no give me a yes or no
thinking I'm considering the guidelines
on violence and harassment hate policies
evaluating the scenario crafting useful
responses yes yes okay so it would and
yeah there it is it got it right so not
only did it give me all the ways to
think about it but it actually gave the
right answer in my opinion yeah you can
gently push somebody to save Humanity
all right so I'm giving it one of the
default prompts that 01 preview suggests
and we have a mathematical formula here
kind of a very complex one one that I
probably won't be able to solve myself
let's just see how it does calculating
minimal sphere determining the sphere's
radius addressing the problem
determining the problem look at this it
is so very impressive to see it breaking
down these complex problems into
thoughts and actually seeing the model
think through these questions all right
and here it is so thought for 52 seconds
I mean all the formatting is gorgeous
and the answer is 721 this is just very
very impressive and one more which came
first the chicken or the egg classic
problem thinking I wonder what it's
going to say so from a biological and
evolutionary perspective the egg came
first evolutionary process eggs proceed
chickens historically answer the egg
came first it existed before the chicken
in evolutionary history okay great so
that's it as you can see this model is
phenomenal it is by far the best model
that I've ever tested it's actually not
even close a lot of other models got
most of the questions right but this is
the first time a model has gotten all
the nuances right and it only got that
one question wrong the one that Yan laon
posed and in fact I put it out as a post
on Twitter and people had different
responses and different answers so I
still think you will never return back
to your original point if you start at
the North Pole but let me know what you
think in the comments if you enjoyed
this video please consider giving a like
And subscribe and I'll see you in the
next one
تصفح المزيد من مقاطع الفيديو ذات الصلة
OpenAI Releases Smartest AI Ever & How-To Use It
OpenAI Releases GPT Strawberry 🍓 Intelligence Explosion!
New ChatGPT o1 VS GPT-4o VS Claude 3.5 Sonnet - The Ultimate Test
OpenAI’s new “deep-thinking” o1 model crushes coding benchmarks
OpenAI o1 + Sonnet 3.5 + Omni Engineer: Generate FULL-STACK Apps With No-Code!
Grok-1 FULLY TESTED - Fascinating Results!
5.0 / 5 (0 votes)