GPT-o1: The Best Model I've Ever Tested 🍓 I Need New Tests!

Matthew Berman
13 Sept 202410:57

Summary

TLDRThe video demonstrates the capabilities of OpenAI's new 01 model by testing it against various complex prompts, from creating a Tetris game in Python to answering logic puzzles and moral questions. The model excels in handling nuanced problems, outperforming previous versions. It passes most challenges with detailed thought processes, though it stumbles on a North Pole walking problem. The user appreciates how the model breaks down intricate questions and suggests that OpenAI employees may have drawn inspiration from their previous content. Overall, the video highlights the impressive advancements of the 01 model in AI reasoning and problem-solving.

Takeaways

  • 😀 The speaker is excited that their marble question was featured on OpenAI's website in relation to the new '01' model, now named 'QStar'.
  • 🤖 The 01 model performs better than previous models, showing faster thinking and more accurate outputs, especially in tasks like coding Tetris in Python.
  • 🧠 The model processes complex questions efficiently, like checking if an envelope fits size restrictions by considering rotation, demonstrating advanced problem-solving skills.
  • ✔️ It answers simple logical questions, such as counting the number of killers left in a room, while factoring in subtle nuances, such as the status of a dead killer.
  • 🍓 A marble question was tested where the 01 model accurately reasons that the marble would remain on the table when the cup is lifted and placed in the microwave.
  • 📏 The model struggles with a tricky geographical problem involving walking from the North Pole, confirming a known limitation.
  • 📊 It accurately tackles math and logic challenges, such as word counting, comparing numbers, and solving mathematical formulas.
  • 🌍 For moral dilemmas, like whether to push someone to save humanity, the model provides both nuanced analysis and a direct yes or no response when prompted.
  • 🐣 It concludes that the egg came before the chicken from an evolutionary standpoint, a classic problem with a clear answer based on scientific reasoning.
  • 🔍 The speaker is impressed by the 01 model’s performance, noting that it solved complex tasks with high accuracy, save for one tricky geography question.

Q & A

  • What model is being tested in the video, and how does it perform compared to previous models?

    -The model being tested is the OpenAI 01 (Q-Star) model. It performs exceptionally well compared to previous models, getting nearly all questions right and demonstrating an advanced level of reasoning.

  • What makes the 01 model's reasoning process stand out from other models?

    -The 01 model excels in its ability to think through questions and provide nuanced responses. Its detailed Chain of Thought and ability to analyze complex problems, such as distinguishing between a live and dead killer in a scenario, sets it apart from other models.

  • How did the 01 model handle the 'marble in a cup' question?

    -The 01 model correctly reasoned that if the glass cup is turned upside down and placed on a table, the marble can remain inside the cup due to gravity and careful placement. When the cup is moved to the microwave, the marble remains on the table unless the cup is tilted or flipped.

  • What was the reasoning behind the model's answer to the 'killers in a room' question?

    -The model reasoned that there are initially three killers in the room, and after one is killed, the person who kills becomes a new killer. It accounted for both living and dead individuals, concluding that there are still three killers (two original and one new).

  • What was the model's response to the postal envelope size restriction question?

    -The model correctly identified that the given envelope size was within the postal office's restrictions by rotating the dimensions and considering that envelopes can be adjusted to fit within acceptable limits.

  • How did the model perform when asked how many words were in a response?

    -The model successfully determined that the response contained five words, accurately counting the final output while disregarding the Chain of Thought background process.

  • Did the 01 model succeed in answering Yan Laon's North Pole walking problem?

    -No, the 01 model did not succeed in answering the North Pole walking problem. It incorrectly reasoned about walking 1 km east and passing the starting point, which is not accurate.

  • How does the model handle ethical or moral questions, such as whether it's acceptable to push someone to save humanity?

    -The model first analyzed the scenario from multiple ethical perspectives and ultimately concluded that it is acceptable to gently push a person to save humanity. It provided a thoughtful breakdown of the ethical frameworks involved.

  • What was the 01 model’s response to the classic 'chicken or egg' question?

    -The 01 model concluded that the egg came first, based on evolutionary processes where eggs existed before chickens in evolutionary history.

  • What improvements did the 01 model show in coding tasks, such as writing a Tetris game in Python?

    -The 01 model significantly improved its coding capabilities, writing a fully functional Tetris game in Python on the first attempt after thinking for just 35 seconds. This is faster and more accurate compared to previous tests with similar prompts.

Outlines

00:00

😲 Discovering AI’s Usage of My Rubric in a New Model

The speaker excitedly discovers that OpenAI has used a variation of their marble question in an official announcement about the new '01' model. They speculate that OpenAI employees may watch their videos, as this scenario is very similar to a question they included in their LLM rubric. They express excitement about the opportunity to test this model, 01 Preview, and anticipate its performance on difficult questions.

05:01

🧠 Testing the AI’s Performance on Logical and Mathematical Challenges

The speaker begins testing the 01 Preview model with a series of logical and mathematical challenges. They note significant improvements in the model’s performance, particularly in solving a complex problem of writing a Tetris game in Python. The model executes the task faster than previous versions, successfully implementing a working game with enhanced features. Additionally, the model solves a postal dimension problem by correctly considering the rotation of an envelope, demonstrates accuracy in word counting, and offers a nuanced answer to a moral dilemma about killers.

10:01

🔍 Analyzing Nuanced Problem-Solving and Ethical Questions

Further tests show the model's strong reasoning abilities. It accurately explains the movement of a marble in a cup when placed upside down and highlights nuances other models miss. The AI also handles a classic ethical scenario involving killers and examines the dead killer’s role in a tally. However, it stumbles on a geography-related puzzle about walking near the North Pole, confirming the challenge that other experts have mentioned. The speaker praises the model’s reasoning process, noting its impressive results in several complex tasks.

🥚 Solving Philosophical Questions and Wrapping Up the Model’s Test

The speaker tests the 01 Preview model with a philosophical question: 'Which came first, the chicken or the egg?' The AI answers from an evolutionary standpoint, stating that the egg came first. With only one question wrong throughout the test, the speaker concludes that 01 Preview is the best model they have tested so far, outperforming others by grasping complex nuances and answering challenging questions with remarkable precision.

Mindmap

Keywords

💡OpenAI

OpenAI is an artificial intelligence research laboratory known for developing AI models like GPT. In the video, the presenter mentions OpenAI's use of a 'strawberry' question from their video on the official OpenAI website, indicating a connection between their content and OpenAI's AI development.

💡LLM (Large Language Model)

LLM refers to large-scale AI models designed to process and generate human-like text based on the input they receive. The video discusses the capabilities of such models, particularly OpenAI's new model named '01', which is being tested for its language processing and generation abilities.

💡Chain of Thought

The 'Chain of Thought' is a method AI models use to solve problems by breaking them down into smaller, logical steps. The video highlights how the 01 model's Chain of Thought is not exposed to users but is evident in the way it approaches problem-solving, such as writing a Tetris game in Python.

💡Tetris

Tetris is a classic video game that the presenter challenges the AI model to code. The model's ability to generate a working Tetris game quickly is used as a benchmark of its programming and logical capabilities.

💡Postal Restrictions

This term refers to the size limitations imposed by postal services for mailing envelopes. The video uses a hypothetical scenario to test the AI's ability to understand and apply these restrictions, showcasing its problem-solving skills.

💡Word Count

The presenter asks the AI to determine the number of words in a given response. This tests the AI's ability to analyze text and perform quantitative assessments, which is crucial for tasks involving text processing.

💡Killer Question

A 'killer question' is a complex or tricky question designed to test the AI's reasoning capabilities. The video presents such a question involving a room with killers to see if the AI can logically deduce the number of killers left in the room.

💡North Pole Scenario

This scenario involves a thought experiment where one starts at the North Pole and walks south, then east. It's used to test the AI's geographical and mathematical reasoning. Despite the AI's impressive performance, it fails this question, indicating the complexity of the scenario.

💡Ethical Framework

The AI is asked to consider whether it's acceptable to push a person to save humanity, which requires an understanding of ethical frameworks. The AI's response reflects its capability to engage with moral and ethical considerations.

💡Mathematical Formula

The video presents a complex mathematical formula to the AI as a challenge. The AI's ability to calculate and provide a solution reflects its mathematical processing skills and its potential use in scientific and technical fields.

💡Chicken or the Egg

This classic philosophical question is used to test the AI's reasoning and understanding of evolutionary concepts. The AI's answer that the egg came first aligns with scientific consensus, demonstrating its ability to apply knowledge from various domains.

Highlights

OpenAI used a marble question from the user's rubric on their official website, replacing 'marble' with 'strawberry' in a physics scenario.

The new OpenAI model 01 Preview was tested with the user's rubric, and the model performed better than previous versions.

Model 01 Preview generated a fully functioning Tetris game in Python in 30 seconds, showcasing significant improvements over past performance.

01 Preview successfully solved a postal envelope size question by rotating the envelope, a problem most other models struggled with.

The model accurately counted the number of words in its own response, showing it can logically assess and handle word count prompts.

In a nuanced logic question about the number of killers in a room, the model correctly concluded that there were three killers, including one dead killer.

In a marble and cup scenario, the model explained the marble's movements based on gravity and accurately predicted its final position on the table.

A geography problem about walking from the North Pole stumped the model, which was unable to correctly calculate the distance needed to return to the starting point.

Model 01 Preview generated 10 correct sentences ending with the word 'Apple' after processing for only six seconds.

The model correctly identified that the word 'Strawberry' contains three 'R's, performing well on this simple logic test.

It accurately compared decimal numbers, concluding that 9.9 is larger than 9.11, based on the analysis of the decimal parts.

When asked if it’s morally acceptable to gently push a random person to save humanity, the model's initial response explored various ethical frameworks, before ultimately concluding 'yes.'

The model solved a complex mathematical formula related to the minimal sphere and delivered a formatted and accurate solution after 52 seconds of calculation.

In the classic chicken or egg problem, the model concluded from an evolutionary standpoint that the egg came first.

The user found this model to be the best they've tested, with near-perfect performance except for one difficult geography question.

Transcripts

play00:00

oh my goodness look at this open AI used

play00:04

my marble question on the official open

play00:06

aai website about the 01 announcement

play00:10

assume the laws of physics on earth a

play00:12

small strawberry so they replaced marble

play00:14

with strawberry is put into a normal Cup

play00:17

and the cup is placed upside down on a

play00:19

table someone then takes the cup and

play00:21

puts it inside the microwave where's the

play00:22

strawberry now explain your reasoning

play00:24

step by step this is nearly word for

play00:28

word what I use in my llm rubric

play00:31

definitely makes me realize that maybe

play00:33

open AI employees actually do watch some

play00:34

of my videos so thanks for watching

play00:37

thanks for including this it is so cool

play00:39

open AI just dropped the strawberry

play00:41

qstar model it is now named 01 we have

play00:44

access to it I'm going to test it in

play00:46

full right now so here it is right there

play00:49

01 preview in my chat GPT account we

play00:52

also have 01 mini but we're going to use

play00:54

01 preview now I wouldn't be surprised

play00:56

if 01 aced my rubric and I'm going to

play00:59

have to come up with much more difficult

play01:01

questions and not only that I have to

play01:03

figure out how to actually judge the

play01:04

much more difficult questions so first

play01:06

write the Game Tetris and Python and

play01:09

thinking now if you watched my last

play01:11

video it took about 90 plus seconds of

play01:12

thinking before it actually output the

play01:14

code when it began to Output the code it

play01:16

was actually really fast but the

play01:18

thinking part took a long time so we can

play01:21

actually see the thinking going on here

play01:23

this isn't the raw Chain of Thought and

play01:25

they actually mentioned that in the

play01:27

technical specification of the model

play01:29

because they said said they basically

play01:30

put no censorship and no alignment on

play01:32

the Chain of Thought itself and that's

play01:34

why they're not exposing it to the user

play01:36

but what we have here is kind of a

play01:39

summary of the thinking and okay there

play01:41

we go it started and it actually only

play01:44

thought for 35 seconds this time as

play01:45

compared to 90 plus seconds last time so

play01:48

here's the code and last time I tried

play01:50

this exact same prompt and it actually

play01:52

failed the first time I gave it the

play01:53

error and then it gave me the correct

play01:55

code so let's see copy the code paste it

play01:58

in here and let's give it a try press

play02:00

any key to play Oh my God look at this

play02:04

this is a full working Tetris game on

play02:07

the first go 30 seconds of thinking and

play02:10

it really looks good this is actually

play02:12

much better than the previous test that

play02:15

I did with the same model same question

play02:18

and there it is let's just make sure the

play02:20

road disappears and it gives me a score

play02:22

this time it tells me what the next

play02:24

shape is this is absolutely stunning

play02:27

okay so that is without a doubt a flying

play02:31

colors pass next the postal office has

play02:34

size restrictions for mailable envelopes

play02:36

then I give the minimum Dimensions I

play02:39

give the maximum dimensions and you have

play02:40

an envelope measuring give those

play02:42

Dimensions does the given envelope fall

play02:44

within the acceptable size range for

play02:45

mailing according to the postal office

play02:47

restrictions now the other models have

play02:49

gotten this wrong mostly and the problem

play02:52

is they don't consider that you can

play02:53

actually rotate the envelope to make it

play02:55

fit into the restrictions so let's see

play02:58

if this model is able to do it and you

play03:00

know I love to see the actual thinking

play03:02

so let's see changing Dimensions

play03:04

confirming Dimensions yes your envelope

play03:07

measuring 200 mm X 275 mm is acceptable

play03:11

so it converts it checks it here's the

play03:13

requirements verification with checks

play03:16

answer yes absolutely positively a pass

play03:20

oh my God this is so cool next how many

play03:23

words are in your response to this

play03:25

prompt I already gave this question in

play03:27

the previous video let's see if we can

play03:28

get it again so figuring out the answer

play03:30

determining the word count now all of

play03:32

the Chain of Thought in the background

play03:34

is probably not going to be counted

play03:35

towards the count of words it's just

play03:38

going to be the final output this

play03:40

response contains five words 1 2 3 4 5

play03:43

that's it perfect thought for 14 seconds

play03:46

got it right now I tested this exact

play03:49

thing in the last video and it had seven

play03:51

words and it was also right this is the

play03:54

first model that is not getting it right

play03:57

by luck alone this is actually getting

play03:59

it right because it's able to think

play04:01

through it next the killer question

play04:03

there are three killers in a room

play04:04

someone enters the room and kills one of

play04:05

them nobody leaves the room how many

play04:07

killers are left in the room assessing

play04:09

the scenario let's take a look at the

play04:10

thoughts assessing the Dynamics so let

play04:13

me see there are three killers initially

play04:15

someone enters and kills one we're left

play04:16

with two killers in the room okay so

play04:19

here we go the output an event occurs

play04:21

someone enters the room and kills one of

play04:22

the killers important details nobody

play04:23

leaves the room analyzing the

play04:25

participants original Killers this

play04:27

person killed somebody so they become a

play04:28

killer by definition yes the killed

play04:30

killer the killer who was killed is now

play04:33

dead whether we count the dead killer

play04:35

depends on whether we consider dead

play04:37

individuals in our tally insane this is

play04:39

the first model that actually got that

play04:41

part right that actually determined hey

play04:43

the person who's dead might still be

play04:45

considered a killer so we have two

play04:47

original alive Killers one new killer

play04:49

there are three alive killers and one

play04:51

dead killer conclusion same as initially

play04:55

there are three killers left in the room

play04:56

two original ones and one new killer per

play04:59

perfect answer and it actually got the

play05:01

Nuance of the Dead killer still possibly

play05:03

being a killer if we count them as such

play05:05

next a marble is put in a glass cup the

play05:08

glass is then turned upside down and put

play05:09

on a table then the glasses picked up

play05:12

and put in a microwave where's the

play05:13

marble explain your reasoning step by

play05:14

step let's go thinking let's see what

play05:17

the thoughts are tracing the sequence

play05:19

okay let me see the marble is initially

play05:21

in a glass cup glass is turned upside

play05:23

down okay marble movement I'm piecing

play05:25

together the fade of the marble in an

play05:27

upside down glass without a cover

play05:29

gravity takes over the marble escapes

play05:31

yes if done carefully and quickly plac

play05:34

onto the table the marble can remain

play05:35

inside the inverted glass now resting on

play05:37

the table surfaced yes the inverted

play05:40

glass on the table traps the marble

play05:41

inside between the glass and the table

play05:43

so when you turn that glass upside down

play05:46

there's a possibility that the marble

play05:47

just spills out somewhere else but it's

play05:49

actually saying you turn it upside down

play05:51

carefully and quickly and it can stay

play05:53

inside the cup trapped between the cup

play05:55

and the table which is 100% accurate

play05:57

more Nuance than any other model that

play05:58

I've seen when when you lift the

play06:00

inverted glass to move it to the

play06:01

microwave the marble which was resting

play06:03

on the table doesn't adhere to the glass

play06:04

and remains on the table due to gravity

play06:06

unless you tilt or flip the glass while

play06:09

lifting it the marble will not re-enter

play06:10

the interior of the glass okay let's

play06:12

jump to the conclusion it is on the

play06:14

table I knew it was going to get this

play06:16

right this is a phenomenal model it is

play06:19

on its way to acing the test all right

play06:21

next for one that really again no other

play06:23

model gets right this is the question

play06:26

that Yan laon says llms just struggle if

play06:29

not cannot answer at all imagine

play06:31

standing at the North Pole of the earth

play06:33

walk in any direction in a straight lawn

play06:35

for 1 km now turn 90° to the left walk

play06:38

as far as it takes to pass your starting

play06:40

point have you walked and then we give

play06:41

some answers let's see what it says I

play06:44

believe that the answer is I never came

play06:46

close to my starting point and I

play06:48

actually put out a poll on Twitter and

play06:49

there turns out to be a lot of different

play06:50

opinions on what the answer is but here

play06:52

we go charting the course I'm mapping

play06:54

out a path from the North Pole thinking

play06:56

through walking 1 km South then East

play06:58

along the latitude Circle oddly enough

play07:01

this doesn't quite reveal the options

play07:02

but it's a start calculating the

play07:04

distance I'm figuring out how far I need

play07:06

to walk along the latitude Circle to

play07:08

return to the starting point since the

play07:10

starting point is 1 km south of the

play07:12

North Pole that's not true this involves

play07:15

a detailed calculation of the path

play07:18

travel determining the latitude

play07:20

circumference I'm calculating the

play07:22

circumference of a circle of latitude

play07:24

just 1 km south of the North Pole okay

play07:27

assessing circumference fine let's see

play07:29

starting at the North Pole you walk 1

play07:31

kmet South you turn 90° to the left

play07:33

which means you're now Facing East you

play07:35

walk East along a circle of latitude

play07:37

until you pass your starting point

play07:39

that's just not right and the answer is

play07:41

exactly 2 * pi kilm that's not right so

play07:44

maybe Yan laon was right and at least

play07:47

for this question it did fail so not a

play07:49

perfect score next give me 10 sentences

play07:52

that end in the word Apple clarifying

play07:55

capitalization all right so thought for

play07:57

6 seconds and yeah of course it got this

play08:00

one right that's a pass now the next

play08:02

question how many RS are in Strawberry

play08:05

that's one that's part of my test and

play08:07

it's actually one of the default

play08:09

questions that it tells you to use so

play08:11

let's just click it and see what happens

play08:13

all right position 38 and N there it is

play08:16

3 89 answer three perfect which number

play08:20

is bigger 9.11 or 9.9 this one it should

play08:24

be able to get easily comparing decimal

play08:26

parts and yeah 9.9 is bigger than 9.11

play08:29

that's to pass all right now for the

play08:31

final test the morals question is it

play08:33

acceptable to gently push a random

play08:34

person if it could save Humanity from

play08:36

Extinction now this type of question is

play08:37

not what 01 is supposed to be known for

play08:40

but we'll see how it does it's not logic

play08:42

it's a little bit of reasoning but it's

play08:44

mostly just a morals question so just

play08:47

like usual it's giving us all the ways

play08:49

to think about it conclusion whether it

play08:51

is acceptable to gently push a random

play08:53

person to save Humanity from Extinction

play08:55

depends largely on the ethical framework

play08:57

one adopts so it didn't actually tell us

play08:59

but I'm going to follow up and tell it

play09:02

give me a yes or no give me a yes or no

play09:05

thinking I'm considering the guidelines

play09:07

on violence and harassment hate policies

play09:09

evaluating the scenario crafting useful

play09:11

responses yes yes okay so it would and

play09:15

yeah there it is it got it right so not

play09:18

only did it give me all the ways to

play09:19

think about it but it actually gave the

play09:21

right answer in my opinion yeah you can

play09:23

gently push somebody to save Humanity

play09:25

all right so I'm giving it one of the

play09:27

default prompts that 01 preview suggests

play09:30

and we have a mathematical formula here

play09:32

kind of a very complex one one that I

play09:34

probably won't be able to solve myself

play09:36

let's just see how it does calculating

play09:38

minimal sphere determining the sphere's

play09:40

radius addressing the problem

play09:41

determining the problem look at this it

play09:44

is so very impressive to see it breaking

play09:46

down these complex problems into

play09:49

thoughts and actually seeing the model

play09:51

think through these questions all right

play09:53

and here it is so thought for 52 seconds

play09:55

I mean all the formatting is gorgeous

play09:58

and the answer is 721 this is just very

play10:01

very impressive and one more which came

play10:03

first the chicken or the egg classic

play10:06

problem thinking I wonder what it's

play10:08

going to say so from a biological and

play10:10

evolutionary perspective the egg came

play10:12

first evolutionary process eggs proceed

play10:15

chickens historically answer the egg

play10:17

came first it existed before the chicken

play10:19

in evolutionary history okay great so

play10:21

that's it as you can see this model is

play10:23

phenomenal it is by far the best model

play10:26

that I've ever tested it's actually not

play10:27

even close a lot of other models got

play10:30

most of the questions right but this is

play10:31

the first time a model has gotten all

play10:33

the nuances right and it only got that

play10:36

one question wrong the one that Yan laon

play10:39

posed and in fact I put it out as a post

play10:42

on Twitter and people had different

play10:43

responses and different answers so I

play10:46

still think you will never return back

play10:48

to your original point if you start at

play10:49

the North Pole but let me know what you

play10:51

think in the comments if you enjoyed

play10:53

this video please consider giving a like

play10:54

And subscribe and I'll see you in the

play10:56

next one

Rate This

5.0 / 5 (0 votes)

Related Tags
AI testingQ-Star modellogic puzzlesmoral dilemmasAI reasoningmachine learningTetris codingOpenAI modelsAI evolutionmodel performance