I was Wrong About ChatGPT's New o1 Model

Skill Leap AI
16 Sept 202413:38

Summary

TLDRIn this video, the creator tests the new GPT-1 model's capabilities by comparing it with a custom GPT model using Chain of Thought prompting. They conduct an IQ and math test, aiming to evaluate the model's logical reasoning and mathematical prowess. The custom GPT, despite not being specialized for math, performs surprisingly well, closely matching the GPT-1's results. The video suggests that while GPT-1 shows improvement, it isn't the significant leap in performance that was initially anticipated, leading to a tie between the two models in the tests conducted.

Takeaways

  • 🔍 The video compares the new GPT-1 model's performance with a custom GPT model using the Chain of Thought prompting technique.
  • 🆕 The GPT-1 model is claimed to excel in logic and reasoning tasks, particularly in math, due to its fine-tuning for step-by-step problem-solving.
  • 📝 The video creator built a custom GPT model with specific instructions to mimic the Chain of Thought prompting, making it publicly available for others to use.
  • ⚖️ A series of IQ and math questions were used to test and compare the performance of the GPT-1 model against the custom GPT model.
  • 🤖 Both models were presented with the same questions to ensure a fair comparison, with the video showcasing their step-by-step thought processes.
  • 📉 The custom GPT model, despite not being specialized for math, performed surprisingly well, coming close to the GPT-1 model's performance.
  • 📊 The video revealed that the GPT-1 model did not show a significant leap in performance over the custom model in the math and logic tests conducted.
  • 🔗 The video description includes a link to the custom GPT model for viewers to try it out and compare the models themselves.
  • 🤔 The video creator expresses initial skepticism about the GPT-1 model's advertised improvements, suggesting it may not be as groundbreaking as first impressions suggested.
  • ⏱️ The GPT-1 model took longer to process some questions, indicating a more in-depth analysis but not always leading to correct answers.

Q & A

  • What is the main focus of the video?

    -The main focus of the video is to compare the performance of the new GPT-3 model (referred to as '01 preview') with a custom GPT model using Chain of Thought prompting on IQ and math problems.

  • What is the Chain of Thought prompting technique?

    -The Chain of Thought prompting technique is a method where the AI is instructed to think step-by-step, understand the problem, break down the reasoning process, explain each step, and review the thought process for errors before providing an answer.

  • How does the video creator plan to test the AI models?

    -The video creator plans to test the AI models by giving them five IQ-related questions to assess logic and reasoning, and five math questions to evaluate their performance in problem-solving, as math is where the new model claims to excel.

  • What is the purpose of creating a custom GPT model in the video?

    -The purpose of creating a custom GPT model is to replicate the Chain of Thought prompting technique and to compare its performance with the new GPT-3 model, providing a baseline for comparison.

  • How does the video creator ensure a fair comparison between the models?

    -The video creator ensures a fair comparison by using the same set of questions for both the custom GPT model and the new GPT-3 model, and by presenting the questions in the same format to both models.

  • What was the outcome of the IQ test in the video?

    -The outcome of the IQ test was a tie between the custom GPT model and the new GPT-3 model, as both made the same mistake on one question and answered the rest correctly.

  • What was the performance of the new GPT-3 model on math questions according to the video?

    -The new GPT-3 model performed well on math questions but not as exceptionally as the benchmarks suggested, with the video creator concluding that it was not a significant improvement over the custom GPT model with Chain of Thought prompting.

  • What was the video creator's initial impression of the new GPT-3 model?

    -The video creator's initial impression of the new GPT-3 model was that it might be a significant improvement over previous models, especially in math and logic, but after deeper testing, they found it to be not as groundbreaking as initially thought.

  • What is the video creator's conclusion about the new GPT-3 model after the tests?

    -The video creator's conclusion is that the new GPT-3 model, while performing well, does not show a giant leap in performance over the custom GPT model with Chain of Thought prompting, and they would call it a tie in their tests.

  • How does the video creator plan to share the custom GPT model?

    -The video creator plans to make the custom GPT model publicly available and will provide a link to it in the video description for viewers to use and test.

Outlines

00:00

🤖 Testing GPT Models for IQ and Math

The script discusses a comparison between the new GPT model and a custom GPT model created by the author. The author initially had high hopes for the new model but found that it had limitations. To test the models, the author switched between different accounts and reset usage limits. The custom GPT was created with specific instructions for step-by-step problem-solving, which the new model also claims to use. The author plans to make this custom GPT publicly available and demonstrates its creation process. A series of IQ and math questions are then posed to both models to evaluate their logical reasoning and mathematical abilities. The results show that both models perform similarly, with the new model not showing a significant advantage in the tests conducted.

05:00

📊 Analyzing Model Performance on SAT Math Questions

This paragraph delves into the performance of the custom GPT model and the new model on a set of challenging SAT math questions. The author presents the questions and the step-by-step thought processes of both models as they attempt to solve them. The custom GPT model, despite not being specialized for math, manages to provide correct answers in some cases, while the new model, which is claimed to excel in math, makes mistakes. The author notes that the new model's detailed thought process is more in-depth but also slower. The results from the math test show a close competition between the two models, with the new model only slightly ahead by one correct answer out of the questions tested.

10:01

🔍 Final Thoughts on Model Comparison and Future Testing

In the concluding paragraph, the author reflects on the initial impressions of the new GPT model and the results of the in-depth testing. The author had initially thought the new model would be a significant improvement over previous versions, especially in math and logic, but the tests did not show a substantial leap in performance. The author admits that the test is not scientific and is based on a limited number of questions. The author invites viewers to conduct their own tests and share their findings, indicating a willingness to continue exploring and comparing the capabilities of different AI models.

Mindmap

Keywords

💡GPT

GPT stands for Generative Pre-trained Transformer, a type of deep learning model that has been pre-trained on a large corpus of text data to generate human-like text. In the context of the video, GPT refers to the AI models used for testing, including the custom GPT created by the video creator and the new model GPT-1. The video discusses how these models perform on various logic and math problems, showcasing their capabilities and limitations.

💡Chain of Thought prompting

This is a technique used to guide AI models to think step-by-step and logically through problems. It involves breaking down complex problems into smaller parts and solving them sequentially. In the video, the creator mentions using this technique to fine-tune a GPT model, aiming to improve its performance on logic and reasoning tasks. The script illustrates this by showing how the model is prompted to 'understand the problem, break down the reasoning process, explain each step, and arrive at the final answer.'

💡Custom GPT

A custom GPT refers to a version of the GPT model that has been tailored with specific instructions or datasets to perform certain tasks. In the video, the creator has built a custom GPT by providing it with a set of instructions for Chain of Thought prompting. This custom model is then compared with the new GPT-1 model to test its performance on logic and math problems, demonstrating how customization can influence AI model outcomes.

💡IQ test

An IQ test is a series of problems designed to measure human intelligence. In the video, the IQ test is used as a benchmark to evaluate the logical and reasoning capabilities of the AI models. The creator poses several IQ-related questions to both the custom GPT and GPT-1 models to see how they perform compared to human intelligence, highlighting the challenge of replicating human cognitive abilities in AI.

💡Math test

A math test in this context refers to a series of mathematical problems used to assess the computational and logical capabilities of the AI models. The video includes a set of math problems, some of which are taken from the '15 hardest SAT Math questions,' to test the models' ability to solve complex mathematical problems. The results of these tests are used to compare the performance of the custom GPT and GPT-1 models.

💡Benchmark

A benchmark in the context of the video is a standard or point of reference used to evaluate the performance of the AI models. The script mentions that the GPT-1 model showed a significant improvement in benchmarks, particularly in math, when using the Chain of Thought prompting technique. Benchmarks help to quantify and compare the effectiveness of different AI models in solving problems.

💡Logic and reasoning

Logic and reasoning are cognitive processes that involve using valid reasoning to form judgments. In the video, the creator tests the AI models' ability to perform logical and reasoning tasks through IQ tests and math problems. The models' performance on these tasks is indicative of their ability to mimic human thought processes and solve problems in a step-by-step manner.

💡Fine-tuning

Fine-tuning in AI refers to the process of adjusting a pre-trained model to perform a specific task or function. In the video, the creator mentions that the GPT-1 model appears to have been fine-tuned to perform step-by-step reasoning, as evidenced by its performance on logic and math problems. Fine-tuning is a key aspect of improving AI model performance for specific applications.

💡Model comparison

Model comparison is the process of evaluating and contrasting the performance of different AI models to determine their relative strengths and weaknesses. In the video, the creator compares the custom GPT model with the new GPT-1 model by testing them on IQ and math problems. This comparison helps to identify which model performs better under certain conditions and provides insights into their capabilities.

💡SAT Math questions

The SAT Math questions are part of the Scholastic Assessment Test (SAT), which is a standardized test widely used for college admissions in the United States. In the video, the creator uses some of the hardest SAT Math questions to challenge the AI models, aiming to push their mathematical problem-solving capabilities to the limit. The SAT questions serve as a rigorous test for the models' logical and computational skills.

Highlights

The new model inside Chat GPT1's preview may not be as impressive as initially thought.

The testing was done across different accounts due to limits being reset after high usage.

A custom GPT model was created and will be made publicly available.

Custom GPTs are mini versions of Chat GPT that allow for personalized instructions.

The model was fine-tuned using Chain of Thought prompting to think step-by-step and correct mistakes.

The custom GPT and the new model were tested on IQ and math questions to evaluate logic and reasoning.

Both models struggled with math-related questions, which are typically challenging for AI.

The custom GPT and the new model provided the same answer for the first math question.

The new model's Chain of Thought prompting technique was compared to the custom GPT's performance.

Both models correctly identified false in a true/false question about number combinations.

The new model provided a more detailed thought process for sequential reasoning questions.

The custom GPT and the new model both made the same mistake on a question about identifying the least similar option.

In the IQ test, both models performed equally, with one incorrect answer each.

The new model claimed to excel in math, but the custom GPT model with Chain of Thought prompting performed similarly.

The new model took longer to process math questions but did not consistently outperform the custom GPT.

The test results showed no significant difference between the custom GPT and the new model in math and logic.

The video concludes that the new model is not a giant leap in performance as initially perceived.

Transcripts

play00:00

the new model inside of chat gpt1

play00:02

preview may not be as good as I thought

play00:04

when I first tested it I had more time

play00:06

to test it I switched between a couple

play00:08

of different accounts they actually

play00:09

reset how many times you could use it

play00:11

they reset the limit because a lot of

play00:13

people ran out very quickly so I did

play00:16

some more testing and in this video I'm

play00:18

going to dive much deeper so what I did

play00:20

is I'm going to test this 01 preview in

play00:22

this window and in this other window I

play00:25

try to replicate what that oan preview

play00:28

is doing in the background with a custom

play00:30

GPT so this gp01 clone I'll make this

play00:34

publicly available I'll link it in the

play00:35

description below this video and I'll

play00:37

show you how I made it so all I did was

play00:40

I created this custom GPT by the way if

play00:43

you never used custom gpts before

play00:45

there're a mini version of chat GPT

play00:47

where you could give it your own set of

play00:49

instructions so that's what I did here

play00:51

and you could upload files and things

play00:53

like that I have ton of videos about

play00:54

building custom gpts on this channel but

play00:57

let me show you exactly what I did with

play00:59

this one because technically the model

play01:02

that they just released works like this

play01:04

in the background they kind of

play01:06

fine-tuned a GPT model it looks like to

play01:09

do exactly this you are an AI assistant

play01:12

designed to Think Through problems step

play01:14

by step using Chain of Thought prompting

play01:17

so this is the prompting technique they

play01:19

officially said in their documentation

play01:22

this is the prompting technique that

play01:24

they used to get the model to think step

play01:26

by step and try to correct its own

play01:29

mistakes before providing any answers

play01:31

you must understand the problem break

play01:35

down the reasoning process explain each

play01:38

step and arrive at the final answer and

play01:41

review the thought process so double

play01:43

check the reasoning for errors and gaps

play01:45

before finalizing your answer and I'll

play01:47

actually just copy this and I'll put

play01:49

this in the description too if you want

play01:51

to build your own GPT you could use mine

play01:53

as well here that is free to use so

play01:56

that's all I did to create this GPT so

play01:59

let's go ahead and use this one okay

play02:01

this is the test I'm going to run inside

play02:03

of my o1 clone that I just built I'm

play02:05

going to give a five questions related

play02:07

to IQ to see how it does with logic and

play02:09

reasoning and I'm going to give it five

play02:11

math questions because in their

play02:13

Benchmark that's where it really leaps

play02:15

ahead of any other model The Benchmark

play02:18

takes it sometimes from a 133% score to

play02:21

like a 85% score when it comes to math

play02:24

using that Chain of Thought prompting

play02:25

inside of the 01 model and here inside

play02:28

of this regular chat GPT will do the

play02:30

same thing so I'll just copy and paste

play02:32

the same questions let's start here with

play02:34

our IQ test and then we'll do some math

play02:37

tests too and I'm going to actually copy

play02:39

and paste each time the actual question

play02:43

what number is one qu of 1/10th of 1 of

play02:46

200 okay these models are typically

play02:49

really bad at answering these type of

play02:51

questions any math related questions

play02:53

even counting how many words are in an

play02:56

answer they typically can't do that so

play02:58

let's take this one and I'll give it to

play03:00

my GPT clone here and we'll also paste

play03:03

it over here okay let's go to our clone

play03:06

so this is the answer from the GPT the

play03:09

answer is one which is C let's go to 01

play03:14

preview answer so same answer let's see

play03:17

what the actual answer was okay the

play03:19

answer is one so tied so far let's go to

play03:22

the next question three of the following

play03:24

numbers add up to 27 and it's spelled

play03:27

out 27 and these are the numbers that

play03:29

has to choose from to get it to add up

play03:31

to 27 so again I'm going to take this

play03:34

one this is true or false let's ask my

play03:36

GPT clone first let's see what answer we

play03:39

get and again it's doing step by step

play03:41

based on that system prompt I gave it so

play03:44

this is going to be different than

play03:46

regular chat GPT I use GPT 40 for some

play03:49

of these and it just didn't think like

play03:51

this out loud usually and I wasn't

play03:54

getting the same responses so if I was

play03:56

to compare it against GPT 40 the GPT one

play04:00

model is going to win but I want to see

play04:02

if it's also going to beat my custom GPT

play04:04

that has that Chain of Thought prompting

play04:07

so here it ran through all these

play04:09

different combination and the answer is

play04:11

false let's see what we get out of o1

play04:14

thinking let me see if I could actually

play04:16

see how he's thinking through it let me

play04:18

open this up identifying

play04:21

combinations identifying some gaps okay

play04:24

pretty quickly it's doing the exact same

play04:27

thing it's calculating all the different

play04:28

combinations you had 20 the answer is

play04:31

false let me see if I had 20 in the last

play04:33

one oh I didn't number it but it looks

play04:36

about 20 or so okay let's see what the

play04:39

actual answer is it says it's false yep

play04:41

you got that one right too let's go to

play04:43

the next one okay this is sequential

play04:45

reasoning this is an interesting one so

play04:47

let's take this one it's telling us

play04:49

here's a bunch of numbers what number

play04:51

comes next I like these okay here is our

play04:54

step by step in our custom GPT I'm

play04:56

always showing the Clone first just so

play04:58

it's consistent here

play05:00

let's go to the bottom and

play05:02

43 and inside of 01 we also got 43 this

play05:07

actually give us a lot more of its work

play05:10

it's showing us a lot more work here in

play05:13

9 seconds versus the other one let's see

play05:15

what the actual answer is okay C okay

play05:19

you got that one right too let's go to

play05:20

the next one okay I like this one this

play05:22

one has no numbers let's try this one it

play05:25

says which one of these five is least

play05:28

like the other four so he analyzed every

play05:31

single option all five here common

play05:33

traits and the conclusion is Dolphin

play05:36

okay this time 01 took 17 seconds so

play05:39

here's are all the options it's actually

play05:41

digging a lot deeper into some of these

play05:44

than the other model did let's see what

play05:46

we got on the bottom analysis conclusion

play05:48

dolphin so same response and dolphin Oh

play05:53

dolphin is not the answer so they both

play05:55

got a wrong so I think that's three out

play05:58

of four they made their first first

play05:59

mistake here on this one okay Turtle I

play06:02

guess a turtle breathes air and has four

play06:05

legs and I guess well yeah that's pretty

play06:08

obvious if I look at these answers that

play06:10

turtle is different than these other

play06:13

four so it did get that one wrong okay

play06:15

I'll just do one more like it and this

play06:17

will be our last IQ one we'll switch to

play06:20

math if you rearrange the letters of

play06:22

this word right here you would have the

play06:24

name of one of these right here so this

play06:27

is a good one actually for this t test

play06:29

let's try that okay so the GPT right

play06:32

away says the word would be Earth and

play06:35

that is a planet let's try our other

play06:38

model oh this one actually did a lot

play06:40

more so Earth was one heart hater it

play06:44

came up with three words okay and then

play06:46

it says heart or Earth heart actually

play06:48

wasn't one of the options here so Earth

play06:49

is the one is probably going to pick

play06:51

planets let's go back here planets and

play06:55

that is correct so four out of five and

play07:00

no winner here right their exact tie

play07:02

here they answered everything correctly

play07:03

in the IQ test except the one they got

play07:06

wrong they both got a wrong in the same

play07:08

exact way so it's a tie right now

play07:12

between my custom GPT with my own set of

play07:14

instructions versus the new model 01

play07:16

which is using the Chain of Thought

play07:18

prompting no difference let's go to our

play07:21

math test this is where the 01 model

play07:24

claims to really Excel and beat any

play07:27

previous model and the GPT model mod is

play07:30

considered a previous model because it's

play07:31

not powered by the new model okay here

play07:34

I'm going to take these this is just

play07:36

from a different website and I think

play07:38

this is called the 15 hardest SAT Math

play07:40

questions and I'm going to take the

play07:42

questions and the multiple choice

play07:44

exactly as they appear I'm not going to

play07:46

change anything so if these questions

play07:48

are formatted incorrectly well they both

play07:50

have to work from the same starting

play07:52

point okay the GPT is answering pretty

play07:55

quickly it went to the step-by-step

play07:56

analysis which is again based on that

play07:58

system prompt always the first thing is

play08:00

going to do determining the relationship

play08:03

conclusion and two only so B is the

play08:07

answer it came up with okay interesting

play08:09

our 01 preview got a different answer

play08:12

one and two only so let's see what we

play08:15

come up with here let's go to the answer

play08:18

section and the final answer is D so let

play08:21

me go back okay D so the 01 preview got

play08:26

the right answer and my clone did not

play08:28

have the right answer so there is an

play08:30

extra point when it comes to math right

play08:31

away looks like the new model one even

play08:34

though if I look through the process we

play08:37

got step-by-step analysis here let me go

play08:40

to this one let's see what it did

play08:42

differently okay it went through every

play08:44

single statement and he kind of came to

play08:46

a conclusion and he decided if it's true

play08:49

or false and then that's how he came up

play08:51

with this answer right here okay let me

play08:53

copy this next one over here okay our

play08:55

GPT gave us an answer B which is number

play08:58

three right here and it came up with

play09:01

this pretty quick it took about six s

play09:03

seconds here to get this answer let's go

play09:05

to 01 okay 01 is still thinking it's

play09:08

been a little while and okay so the

play09:11

thought process is a little bit more in

play09:13

depth if you look underneath the hood

play09:16

here to see what it's doing it's

play09:17

definitely giving us a lot more detail

play09:20

here behind the scenes but it's taking

play09:23

quite a while to get an answer and it

play09:25

looks like it's hitting some problems so

play09:27

switching the approach them breaking

play09:30

down reworking revisiting analyzing

play09:33

rearranging he doing quite a bit over

play09:35

here and still nothing wow it's still

play09:39

going and he thinks the answer is a -16

play09:43

and let me just scroll up to show you

play09:45

the amount of work he did behind the

play09:46

scenes here well I guess the answer

play09:48

started here but this is all the stuff

play09:50

that it was going through to come up

play09:52

with that answer the actual answer is B

play09:55

so my custom GPT that is not supposed to

play09:58

be very good at math got it right and

play10:00

the O model that's supposed to win at

play10:03

math by 70 percentage points over the

play10:06

previous models got it wrong oh wait a

play10:09

minute I missed something so it says b

play10:12

was the answer which is correct but it

play10:14

says b equals 3 and I went back on the

play10:17

test right here b equals -3 so C

play10:21

actually equals three so it did actually

play10:24

get it wrong I missed that because B was

play10:27

the correct answer but B should have

play10:28

been three so I don't know maybe it

play10:31

guess half a point here it doesn't quite

play10:32

get it completely right but I guess if

play10:35

it was a multiple choice in some kind of

play10:37

sat you would have picked B and you

play10:38

would have got it right but uh you

play10:40

missed a minus sign right here okay so

play10:43

the next One D the value cannot be

play10:46

determined this is inside of our GPT

play10:49

clone and this one D the value cannot be

play10:52

determined let's go to the actual answer

play10:54

okay so D is not the answer a is the

play10:58

answer which is 2 12 wow they both got a

play11:02

wrong in this case okay here's the word

play11:05

problem here and the answer is 60 let's

play11:08

give it to clone okay our clone says it

play11:11

cannot come up with an answer it doesn't

play11:13

have enough information to come up with

play11:15

an answer based on that question and

play11:18

here's kind of the thinking process and

play11:21

this one it just took a few seconds here

play11:22

to respond hinting at missing info okay

play11:26

this one also thinks there is a lack of

play11:28

information the problem cannot be solved

play11:31

so let me see what we missed here well I

play11:33

took this exactly as it

play11:35

was and this one did come up with an

play11:38

answer the final answer is 60 okay so I

play11:41

guess they both failed there as well

play11:44

okay I'll just do one extra one here

play11:46

since we didn't get a conclusive answer

play11:47

there I'll take this one okay this one

play11:50

my GPT clone says the answer is

play11:53

2.25 let's go to 01 the answer is 2.25

play11:57

and let's go to the actual question

play11:59

question the answer is

play12:01

2.25 okay so as you could see when it

play12:03

comes to the IQ test exactly the same

play12:07

score right and the math test 01 one I

play12:10

think just by one question right

play12:12

everything you got wrong the other one

play12:13

got wrong so the Chain of Thought

play12:15

prompting technique that I added to the

play12:17

system instructions seemed to actually

play12:20

work well enough to get it to get very

play12:23

close to 01 it's definitely not a five

play12:25

or 6X Improvement in math and logic like

play12:29

I saw in the benchmarks again not super

play12:32

scientific I'm just doing 10 different

play12:33

questions here in those two categories

play12:36

so if you want to do your own test let

play12:38

me know in the comment section what you

play12:39

get but from my first impression to now

play12:44

I definitely don't think the model this

play12:46

01 model it's in preview though but it's

play12:49

definitely not the giant leap that I

play12:51

first thought when I tested it with a

play12:53

few coding and math questions now that

play12:55

I'm testing a bit deeper and before

play12:57

recording the the video I ran it through

play12:58

a bunch test and when they were getting

play13:01

the equal results between my custom GPT

play13:04

and the1 model I decided to make this

play13:06

video and do a real test in real time as

play13:10

I'm recording the video to see what it

play13:11

comes up with and not show my previous

play13:13

results and I mean I would just call it

play13:17

a tie honestly at this point so let me

play13:19

know what you find out if you want to do

play13:21

these kind of tests on your own we have

play13:23

very limited credit inside of the 01

play13:25

preview and the mini the 01 mini I

play13:28

didn't think was going to be Fair

play13:29

compression so I wanted to test their

play13:30

best model available right now let me

play13:32

know what you come up with and what your

play13:34

thoughts are and I'll see you in the

play13:36

next video

Rate This

5.0 / 5 (0 votes)

Ähnliche Tags
AI TestingGPT ModelsChain of ThoughtLogic AnalysisMath ChallengesIQ TestAI ComparisonCustom GPTProblem SolvingAI Benchmark
Benötigen Sie eine Zusammenfassung auf Englisch?