I was Wrong About ChatGPT's New o1 Model
Summary
TLDRIn this video, the creator tests the new GPT-1 model's capabilities by comparing it with a custom GPT model using Chain of Thought prompting. They conduct an IQ and math test, aiming to evaluate the model's logical reasoning and mathematical prowess. The custom GPT, despite not being specialized for math, performs surprisingly well, closely matching the GPT-1's results. The video suggests that while GPT-1 shows improvement, it isn't the significant leap in performance that was initially anticipated, leading to a tie between the two models in the tests conducted.
Takeaways
- 🔍 The video compares the new GPT-1 model's performance with a custom GPT model using the Chain of Thought prompting technique.
- 🆕 The GPT-1 model is claimed to excel in logic and reasoning tasks, particularly in math, due to its fine-tuning for step-by-step problem-solving.
- 📝 The video creator built a custom GPT model with specific instructions to mimic the Chain of Thought prompting, making it publicly available for others to use.
- ⚖️ A series of IQ and math questions were used to test and compare the performance of the GPT-1 model against the custom GPT model.
- 🤖 Both models were presented with the same questions to ensure a fair comparison, with the video showcasing their step-by-step thought processes.
- 📉 The custom GPT model, despite not being specialized for math, performed surprisingly well, coming close to the GPT-1 model's performance.
- 📊 The video revealed that the GPT-1 model did not show a significant leap in performance over the custom model in the math and logic tests conducted.
- 🔗 The video description includes a link to the custom GPT model for viewers to try it out and compare the models themselves.
- 🤔 The video creator expresses initial skepticism about the GPT-1 model's advertised improvements, suggesting it may not be as groundbreaking as first impressions suggested.
- ⏱️ The GPT-1 model took longer to process some questions, indicating a more in-depth analysis but not always leading to correct answers.
Q & A
What is the main focus of the video?
-The main focus of the video is to compare the performance of the new GPT-3 model (referred to as '01 preview') with a custom GPT model using Chain of Thought prompting on IQ and math problems.
What is the Chain of Thought prompting technique?
-The Chain of Thought prompting technique is a method where the AI is instructed to think step-by-step, understand the problem, break down the reasoning process, explain each step, and review the thought process for errors before providing an answer.
How does the video creator plan to test the AI models?
-The video creator plans to test the AI models by giving them five IQ-related questions to assess logic and reasoning, and five math questions to evaluate their performance in problem-solving, as math is where the new model claims to excel.
What is the purpose of creating a custom GPT model in the video?
-The purpose of creating a custom GPT model is to replicate the Chain of Thought prompting technique and to compare its performance with the new GPT-3 model, providing a baseline for comparison.
How does the video creator ensure a fair comparison between the models?
-The video creator ensures a fair comparison by using the same set of questions for both the custom GPT model and the new GPT-3 model, and by presenting the questions in the same format to both models.
What was the outcome of the IQ test in the video?
-The outcome of the IQ test was a tie between the custom GPT model and the new GPT-3 model, as both made the same mistake on one question and answered the rest correctly.
What was the performance of the new GPT-3 model on math questions according to the video?
-The new GPT-3 model performed well on math questions but not as exceptionally as the benchmarks suggested, with the video creator concluding that it was not a significant improvement over the custom GPT model with Chain of Thought prompting.
What was the video creator's initial impression of the new GPT-3 model?
-The video creator's initial impression of the new GPT-3 model was that it might be a significant improvement over previous models, especially in math and logic, but after deeper testing, they found it to be not as groundbreaking as initially thought.
What is the video creator's conclusion about the new GPT-3 model after the tests?
-The video creator's conclusion is that the new GPT-3 model, while performing well, does not show a giant leap in performance over the custom GPT model with Chain of Thought prompting, and they would call it a tie in their tests.
How does the video creator plan to share the custom GPT model?
-The video creator plans to make the custom GPT model publicly available and will provide a link to it in the video description for viewers to use and test.
Outlines
🤖 Testing GPT Models for IQ and Math
The script discusses a comparison between the new GPT model and a custom GPT model created by the author. The author initially had high hopes for the new model but found that it had limitations. To test the models, the author switched between different accounts and reset usage limits. The custom GPT was created with specific instructions for step-by-step problem-solving, which the new model also claims to use. The author plans to make this custom GPT publicly available and demonstrates its creation process. A series of IQ and math questions are then posed to both models to evaluate their logical reasoning and mathematical abilities. The results show that both models perform similarly, with the new model not showing a significant advantage in the tests conducted.
📊 Analyzing Model Performance on SAT Math Questions
This paragraph delves into the performance of the custom GPT model and the new model on a set of challenging SAT math questions. The author presents the questions and the step-by-step thought processes of both models as they attempt to solve them. The custom GPT model, despite not being specialized for math, manages to provide correct answers in some cases, while the new model, which is claimed to excel in math, makes mistakes. The author notes that the new model's detailed thought process is more in-depth but also slower. The results from the math test show a close competition between the two models, with the new model only slightly ahead by one correct answer out of the questions tested.
🔍 Final Thoughts on Model Comparison and Future Testing
In the concluding paragraph, the author reflects on the initial impressions of the new GPT model and the results of the in-depth testing. The author had initially thought the new model would be a significant improvement over previous versions, especially in math and logic, but the tests did not show a substantial leap in performance. The author admits that the test is not scientific and is based on a limited number of questions. The author invites viewers to conduct their own tests and share their findings, indicating a willingness to continue exploring and comparing the capabilities of different AI models.
Mindmap
Keywords
💡GPT
💡Chain of Thought prompting
💡Custom GPT
💡IQ test
💡Math test
💡Benchmark
💡Logic and reasoning
💡Fine-tuning
💡Model comparison
💡SAT Math questions
Highlights
The new model inside Chat GPT1's preview may not be as impressive as initially thought.
The testing was done across different accounts due to limits being reset after high usage.
A custom GPT model was created and will be made publicly available.
Custom GPTs are mini versions of Chat GPT that allow for personalized instructions.
The model was fine-tuned using Chain of Thought prompting to think step-by-step and correct mistakes.
The custom GPT and the new model were tested on IQ and math questions to evaluate logic and reasoning.
Both models struggled with math-related questions, which are typically challenging for AI.
The custom GPT and the new model provided the same answer for the first math question.
The new model's Chain of Thought prompting technique was compared to the custom GPT's performance.
Both models correctly identified false in a true/false question about number combinations.
The new model provided a more detailed thought process for sequential reasoning questions.
The custom GPT and the new model both made the same mistake on a question about identifying the least similar option.
In the IQ test, both models performed equally, with one incorrect answer each.
The new model claimed to excel in math, but the custom GPT model with Chain of Thought prompting performed similarly.
The new model took longer to process math questions but did not consistently outperform the custom GPT.
The test results showed no significant difference between the custom GPT and the new model in math and logic.
The video concludes that the new model is not a giant leap in performance as initially perceived.
Transcripts
the new model inside of chat gpt1
preview may not be as good as I thought
when I first tested it I had more time
to test it I switched between a couple
of different accounts they actually
reset how many times you could use it
they reset the limit because a lot of
people ran out very quickly so I did
some more testing and in this video I'm
going to dive much deeper so what I did
is I'm going to test this 01 preview in
this window and in this other window I
try to replicate what that oan preview
is doing in the background with a custom
GPT so this gp01 clone I'll make this
publicly available I'll link it in the
description below this video and I'll
show you how I made it so all I did was
I created this custom GPT by the way if
you never used custom gpts before
there're a mini version of chat GPT
where you could give it your own set of
instructions so that's what I did here
and you could upload files and things
like that I have ton of videos about
building custom gpts on this channel but
let me show you exactly what I did with
this one because technically the model
that they just released works like this
in the background they kind of
fine-tuned a GPT model it looks like to
do exactly this you are an AI assistant
designed to Think Through problems step
by step using Chain of Thought prompting
so this is the prompting technique they
officially said in their documentation
this is the prompting technique that
they used to get the model to think step
by step and try to correct its own
mistakes before providing any answers
you must understand the problem break
down the reasoning process explain each
step and arrive at the final answer and
review the thought process so double
check the reasoning for errors and gaps
before finalizing your answer and I'll
actually just copy this and I'll put
this in the description too if you want
to build your own GPT you could use mine
as well here that is free to use so
that's all I did to create this GPT so
let's go ahead and use this one okay
this is the test I'm going to run inside
of my o1 clone that I just built I'm
going to give a five questions related
to IQ to see how it does with logic and
reasoning and I'm going to give it five
math questions because in their
Benchmark that's where it really leaps
ahead of any other model The Benchmark
takes it sometimes from a 133% score to
like a 85% score when it comes to math
using that Chain of Thought prompting
inside of the 01 model and here inside
of this regular chat GPT will do the
same thing so I'll just copy and paste
the same questions let's start here with
our IQ test and then we'll do some math
tests too and I'm going to actually copy
and paste each time the actual question
what number is one qu of 1/10th of 1 of
200 okay these models are typically
really bad at answering these type of
questions any math related questions
even counting how many words are in an
answer they typically can't do that so
let's take this one and I'll give it to
my GPT clone here and we'll also paste
it over here okay let's go to our clone
so this is the answer from the GPT the
answer is one which is C let's go to 01
preview answer so same answer let's see
what the actual answer was okay the
answer is one so tied so far let's go to
the next question three of the following
numbers add up to 27 and it's spelled
out 27 and these are the numbers that
has to choose from to get it to add up
to 27 so again I'm going to take this
one this is true or false let's ask my
GPT clone first let's see what answer we
get and again it's doing step by step
based on that system prompt I gave it so
this is going to be different than
regular chat GPT I use GPT 40 for some
of these and it just didn't think like
this out loud usually and I wasn't
getting the same responses so if I was
to compare it against GPT 40 the GPT one
model is going to win but I want to see
if it's also going to beat my custom GPT
that has that Chain of Thought prompting
so here it ran through all these
different combination and the answer is
false let's see what we get out of o1
thinking let me see if I could actually
see how he's thinking through it let me
open this up identifying
combinations identifying some gaps okay
pretty quickly it's doing the exact same
thing it's calculating all the different
combinations you had 20 the answer is
false let me see if I had 20 in the last
one oh I didn't number it but it looks
about 20 or so okay let's see what the
actual answer is it says it's false yep
you got that one right too let's go to
the next one okay this is sequential
reasoning this is an interesting one so
let's take this one it's telling us
here's a bunch of numbers what number
comes next I like these okay here is our
step by step in our custom GPT I'm
always showing the Clone first just so
it's consistent here
let's go to the bottom and
43 and inside of 01 we also got 43 this
actually give us a lot more of its work
it's showing us a lot more work here in
9 seconds versus the other one let's see
what the actual answer is okay C okay
you got that one right too let's go to
the next one okay I like this one this
one has no numbers let's try this one it
says which one of these five is least
like the other four so he analyzed every
single option all five here common
traits and the conclusion is Dolphin
okay this time 01 took 17 seconds so
here's are all the options it's actually
digging a lot deeper into some of these
than the other model did let's see what
we got on the bottom analysis conclusion
dolphin so same response and dolphin Oh
dolphin is not the answer so they both
got a wrong so I think that's three out
of four they made their first first
mistake here on this one okay Turtle I
guess a turtle breathes air and has four
legs and I guess well yeah that's pretty
obvious if I look at these answers that
turtle is different than these other
four so it did get that one wrong okay
I'll just do one more like it and this
will be our last IQ one we'll switch to
math if you rearrange the letters of
this word right here you would have the
name of one of these right here so this
is a good one actually for this t test
let's try that okay so the GPT right
away says the word would be Earth and
that is a planet let's try our other
model oh this one actually did a lot
more so Earth was one heart hater it
came up with three words okay and then
it says heart or Earth heart actually
wasn't one of the options here so Earth
is the one is probably going to pick
planets let's go back here planets and
that is correct so four out of five and
no winner here right their exact tie
here they answered everything correctly
in the IQ test except the one they got
wrong they both got a wrong in the same
exact way so it's a tie right now
between my custom GPT with my own set of
instructions versus the new model 01
which is using the Chain of Thought
prompting no difference let's go to our
math test this is where the 01 model
claims to really Excel and beat any
previous model and the GPT model mod is
considered a previous model because it's
not powered by the new model okay here
I'm going to take these this is just
from a different website and I think
this is called the 15 hardest SAT Math
questions and I'm going to take the
questions and the multiple choice
exactly as they appear I'm not going to
change anything so if these questions
are formatted incorrectly well they both
have to work from the same starting
point okay the GPT is answering pretty
quickly it went to the step-by-step
analysis which is again based on that
system prompt always the first thing is
going to do determining the relationship
conclusion and two only so B is the
answer it came up with okay interesting
our 01 preview got a different answer
one and two only so let's see what we
come up with here let's go to the answer
section and the final answer is D so let
me go back okay D so the 01 preview got
the right answer and my clone did not
have the right answer so there is an
extra point when it comes to math right
away looks like the new model one even
though if I look through the process we
got step-by-step analysis here let me go
to this one let's see what it did
differently okay it went through every
single statement and he kind of came to
a conclusion and he decided if it's true
or false and then that's how he came up
with this answer right here okay let me
copy this next one over here okay our
GPT gave us an answer B which is number
three right here and it came up with
this pretty quick it took about six s
seconds here to get this answer let's go
to 01 okay 01 is still thinking it's
been a little while and okay so the
thought process is a little bit more in
depth if you look underneath the hood
here to see what it's doing it's
definitely giving us a lot more detail
here behind the scenes but it's taking
quite a while to get an answer and it
looks like it's hitting some problems so
switching the approach them breaking
down reworking revisiting analyzing
rearranging he doing quite a bit over
here and still nothing wow it's still
going and he thinks the answer is a -16
and let me just scroll up to show you
the amount of work he did behind the
scenes here well I guess the answer
started here but this is all the stuff
that it was going through to come up
with that answer the actual answer is B
so my custom GPT that is not supposed to
be very good at math got it right and
the O model that's supposed to win at
math by 70 percentage points over the
previous models got it wrong oh wait a
minute I missed something so it says b
was the answer which is correct but it
says b equals 3 and I went back on the
test right here b equals -3 so C
actually equals three so it did actually
get it wrong I missed that because B was
the correct answer but B should have
been three so I don't know maybe it
guess half a point here it doesn't quite
get it completely right but I guess if
it was a multiple choice in some kind of
sat you would have picked B and you
would have got it right but uh you
missed a minus sign right here okay so
the next One D the value cannot be
determined this is inside of our GPT
clone and this one D the value cannot be
determined let's go to the actual answer
okay so D is not the answer a is the
answer which is 2 12 wow they both got a
wrong in this case okay here's the word
problem here and the answer is 60 let's
give it to clone okay our clone says it
cannot come up with an answer it doesn't
have enough information to come up with
an answer based on that question and
here's kind of the thinking process and
this one it just took a few seconds here
to respond hinting at missing info okay
this one also thinks there is a lack of
information the problem cannot be solved
so let me see what we missed here well I
took this exactly as it
was and this one did come up with an
answer the final answer is 60 okay so I
guess they both failed there as well
okay I'll just do one extra one here
since we didn't get a conclusive answer
there I'll take this one okay this one
my GPT clone says the answer is
2.25 let's go to 01 the answer is 2.25
and let's go to the actual question
question the answer is
2.25 okay so as you could see when it
comes to the IQ test exactly the same
score right and the math test 01 one I
think just by one question right
everything you got wrong the other one
got wrong so the Chain of Thought
prompting technique that I added to the
system instructions seemed to actually
work well enough to get it to get very
close to 01 it's definitely not a five
or 6X Improvement in math and logic like
I saw in the benchmarks again not super
scientific I'm just doing 10 different
questions here in those two categories
so if you want to do your own test let
me know in the comment section what you
get but from my first impression to now
I definitely don't think the model this
01 model it's in preview though but it's
definitely not the giant leap that I
first thought when I tested it with a
few coding and math questions now that
I'm testing a bit deeper and before
recording the the video I ran it through
a bunch test and when they were getting
the equal results between my custom GPT
and the1 model I decided to make this
video and do a real test in real time as
I'm recording the video to see what it
comes up with and not show my previous
results and I mean I would just call it
a tie honestly at this point so let me
know what you find out if you want to do
these kind of tests on your own we have
very limited credit inside of the 01
preview and the mini the 01 mini I
didn't think was going to be Fair
compression so I wanted to test their
best model available right now let me
know what you come up with and what your
thoughts are and I'll see you in the
next video
Weitere ähnliche Videos ansehen
New ChatGPT o1 VS GPT-4o VS Claude 3.5 Sonnet - The Ultimate Test
Claude 3 Opus contro ChatGPT 4: chi è il migliore?
How To Use GPT-4o (GPT4o Tutorial) Complete Guide With Tips and Tricks
GPT-4o VS Claude 3.5 Sonnet - Which AI is #1?
ChatGPT Plus X Claude PRO: QUAL VALE ASSINAR?
Riassunto di tutti gli annunci di OpenAI: GPT4o e non solo!
5.0 / 5 (0 votes)