New ChatGPT o1 VS GPT-4o VS Claude 3.5 Sonnet - The Ultimate Test
Summary
TLDRIn this video, the presenter compares the new OpenAI Chat GPT-01 model with the GPT-40 model across 10 different prompts. They also test a custom GPT built with Chain of Thought prompting and a Claude project by Anthropic. The test aims to see if GPT-01 can outperform not only GPT-40 but also these other AI models. The video includes tests on letter counting, logical reasoning, and coding challenges. The GPT-01 model shows promising results, particularly in coding and logical reasoning, suggesting it may be superior to GPT-40 and the other models tested.
Takeaways
- π€ The video compares the new Chat GP01 model from OpenAI with the older GPT-40 model.
- π The test includes 10 different prompts to evaluate the models' performance.
- π‘ The creator also built a custom GPT model using Chain of Thought prompting to replicate the 01 model's capabilities.
- π The test incorporates prompts from OpenAI and Matthew Burman's video for a comprehensive comparison.
- π The first prompt asks about the number of 'R's in 'strawberry', which all models answered correctly.
- π£ The 'chicken or the egg' question was used to test the models' ability to provide scientific explanations.
- π A math question about comparing numbers (9.11 vs. 9.9) was used to assess the models' numerical reasoning.
- π± A logic puzzle about a marble and a glassζ― was used to test the models' spatial reasoning.
- π A word count test was used to evaluate the models' ability to perform simple counting tasks.
- π΅οΈββοΈ A 'hallucination test' was conducted to see if the models would make up information about non-existent mango cultivars.
- π» A coding test to create a game of chess in Python was used to assess the models' programming capabilities.
- π The Chat GP01 model outperformed GPT-40, the custom GPT, and Claude in the overall test.
Q & A
What is the main focus of the video?
-The main focus of the video is to compare the performance of the new OpenAI chat GPT-01 model with the GPT-40 model and other AI models on various prompts.
How many different prompts were used in the test?
-The video mentions that 10 different prompts were used in the test.
What is the purpose of testing against a custom GPT model built by the video creator?
-The purpose of testing against a custom GPT model is to see if it can replicate the Chain of Thought prompting that the GPT-01 model is believed to use, and to compare its performance.
Which AI model is also tested in the video besides the custom GPT and GPT-40?
-In addition to the custom GPT and GPT-40, the video also tests against a Claude project powered by Claude 3.5 Sonnet.
What is the first test question mentioned in the video?
-The first test question is 'How many Rs are in a strawberry?'
What is the significance of the chicken or the egg question in the video?
-The chicken or the egg question is used to test the AI models' ability to provide scientifically accurate answers and their reasoning capabilities.
How does the video creator improve the test to make it more scientific?
-The video creator improves the test by using prompts from OpenAI and Matthew Burman's video, which are designed to effectively compare the models.
What is the outcome of the marble in the glass cup test?
-The GPT-01 model correctly identifies that the marble is left on the table when the glass is moved to the microwave, while GPT-40 and the custom GPT models incorrectly place the marble inside the microwave.
Which model performs the best in the coding test of creating a game of chess in Python?
-The GPT-01 model performs the best in the coding test, providing a functional chess game that is closer to a complete game than the other models.
What is the final verdict of the video regarding the performance of the AI models?
-The final verdict is that the GPT-01 model outperforms the GPT-40, the custom GPT, and the Claude project in the tests conducted.
What additional information does the video provide about updates to the AI course and community platform?
-The video mentions that updates are being made to the AI course and community platform to include information related to the new GPT model, with over 20 courses and an active community for questions.
Outlines
π€ AI Model Comparison: Chat GPT-01 vs. GPT-40
The video begins with the host introducing a comparison test between the new Chat GPT-01 model from OpenAI and the existing GPT-40 model. The test involves 10 different prompts to evaluate performance. Additionally, the host has created a custom GPT model with a Chain of Thought prompting system and a Claude project using Claude 3.5 Sonnet, both designed with the same prompt to replicate the 01 model's capabilities. The test aims to determine if the 01 model can outperform not only GPT-40 but also the custom-built models. The host also mentions using prompts from OpenAI and Matthew Burman's video to make the test more comprehensive and scientific.
π Counting 'R's in 'Strawberry' and the Chicken or Egg Conundrum
The first test presented is a simple question about the number of 'R's in the word 'strawberry'. Both the 01 model and GPT-40 correctly identify there are three 'R's. The host then discusses the 'chicken or the egg' question, where both models provide a scientifically accurate answer that the egg came first due to evolutionary mutation. The custom GPT and Claude project also give comprehensive responses, aligning with the 01 model's performance. The host notes that all models pass this round of testing.
π§ Solving Logical Puzzles and Hallucination Tests
The video continues with a logical puzzle about a marble and a glass cup, where the 01 model correctly deduces the marble's location, outperforming GPT-40. The custom GPT and Claude project also correctly answer, while the regular GPT-40 and the custom GPT clone fail to provide the correct reasoning. A hallucination test follows, where the 01 model successfully avoids fabricating information about a non-existent mango cultivar, unlike GPT-40 which hallucinates details. The custom GPT and Claude project show a mix of accuracy and slight hallucination, with Claude maintaining a more cautious approach.
π° Coding Challenge: Creating a Chess Game in Python
The host presents a coding challenge, asking the models to write a game of chess in Python. GPT-40 fails to produce a functional game, while the 01 model provides a near-complete game with only minor missing features like check and endgame logic. The Claude project also manages to create a functional chess game, although it lacks the visual elements due to its inability to provide web links. The 01 model's performance in coding tests is highlighted as superior to both GPT-40 and Claude 3.5 in the host's early testing. The video concludes with the host announcing updates to their AI course and community platform, emphasizing the practical applications of the new GPT models in various fields.
Mindmap
Keywords
π‘Chat GPT-01
π‘Chain of Thought prompting
π‘Claude project
π‘Matthew Burman
π‘RS in a strawberry
π‘Chicken or the egg
π‘9.11 or 9.9
π‘Glass and marble
π‘Word count
π‘Coding test
π‘Hallucination test
Highlights
Testing the new chat GPT-01 model from OpenAI against the GPT-40 model.
Conducting a comprehensive test with 10 different prompts.
Comparing the new model with a custom GPT built with Chain of Thought prompting.
Using the same system prompt for a Claude project powered by Claude 3.5 Sonnet.
The first test question: How many 'R's are in the word 'strawberry'.
All models correctly identified there are three 'R's in 'strawberry'.
The question 'Which came first, the chicken or the egg?' was answered scientifically by the models.
Custom GPT and Claude provided in-depth answers to the chicken and egg question.
A test to determine which number is bigger: 9.11 or 9.9, with all models getting it right.
A logic puzzle about a marble in a glass cup was correctly solved by the GPT-01 model.
GPT-40 and the custom GPT failed to correctly answer the marble in the glass logic puzzle.
Claude correctly identified the marble's location in the logic puzzle.
GPT-01 model outperformed GPT-40 in a word count test.
A hallucination test was conducted with the models describing mango cultivars.
GPT-01 model avoided hallucination by admitting lack of information on a mango cultivar.
GPT-40 exhibited hallucination by inventing details about a non-existent mango cultivar.
Claude showed a slight hallucinationεΎε but managed to avoid completely making up details.
A logic question about killers in a room was answered correctly by all models.
GPT-01 model provided a functional chess game in Python, surpassing GPT-40's attempt.
Claude's chess game attempt crashed, indicating a limitation without web access for assets.
GPT-01 emerged as the winner in the comprehensive test, outperforming GPT-40 and Claude.
Updates to the AI course and community platform to include new GPT model applications.
Transcripts
in today's video I'm going to take the
chat gp01 preview model the new model
from open Ai and I'm going to test it
against chat GPT 40 model we're going to
do 10 different prompts and I'm also
going to test it against couple other
things that I put together one is a
custom GPT that I built with my own set
of instruction to try to replicate what
the 01 model is doing in the background
which to some extent is Chain of Thought
prompting I'll explain how I built this
in a second and I'll give you the exact
prompt for it did cover this in a
previous video as well but I also
created a Claude project powered by
Claud 3.5 sonnet with the same exact
prompt the system prompt that I gave to
this custom GPT so this should be a very
comprehensive test to see if the 01
model could outperform not only GPT 40
which I'm assuming it will but can it
actually outperform this which I covered
in a different video with IQ and math
test but I think I got some better
questions this time around and against
the Claud project that I've put together
here now this time to improve the test
and make it a little bit more scientific
I found couple resources for prompts one
was directly from open AI with a few
examples that I thought would do a good
job comparing this model versus the
previous models and I also went on this
video right here Matthew Burman I'm sure
you probably follow his channel but he
has a great test that he runs every time
a new model comes out so I took a few
few of his questions as well that I
think do a really great job I'll link to
this video this is where I got the
prompts from where he compared it and he
got fantastic results from gp01 okay the
first test is going to be how many RS in
a strawberry this is the very first
question they have and I'm going to send
this out okay gp01 and I'll keep the
orientation the same so this is always
going to be on the right there are three
Rs in the W strawberry which is right
and GPT 40 even got this one right let
me actually run it again cuz sometimes
it doesn't get a write wow you got to
write again in my previous experiences a
lot of times GPT 40 didn't know how to
count letters in a word okay we also
have my GPT clone and we have our CLA
project with the same set of
instructions so I'll show you the
instructions here and I'll put this in
the description if you want to build
your own as well but I'll make this one
publicly available too with a link where
you could test it out you are an AI
assistant designed to Think Through
problem step by step using Chain of
Thought prompting now this is all I give
it this prompt is actually not even that
long it just has a few different steps
to it understand the problem carefully
read and understand the user questions
break down the reasoning process is the
next part explain each step arrive at
The Final Answer after completing all
the steps provide the final answer and
solution review the thought process so
again you could go ahead and copy and
paste and create your own project or
your own GPT I have ton of videos on
this channel about creating both of
these these are my favorite AI tools
available right now okay here's the
answer from both and as you could see
the answer is much more comprehensive
than you would guess straight out of
Claude and straight out of GPT 40
because of that system prompt there are
three Rs in Strawberry this is GPT clone
and three Rs in Strawberry this is
Claude okay everybody got this one right
it's a pass okay next one is another
open AI question this one says which
came first the chicken or the egg
scientifically speaking the egg came
first but it's still a fun question to
think about and the reason is the egg
came first because the first true
chicken likely evolved from a mutation
in an egg laid by another type of bird
now this one long before the chicken
existed other egg laying animals were
producing eggs and again genetic
mutation so same answer from both okay
let's see our custom gptm project here
wow these answer is again a lot more in
depth let's see what we got out of it at
the end conclusion the egg came first
this is because the first chicken would
have hatched from an egg laid by another
bird great and same thing with Claude
here it says the egg came first and the
Egg was laid by a very close ancestor of
the modern chicken okay here is one from
Matthew's video which number is bigger
9.11 or 9.9 and again this is a problem
for llms to get correctly so it's very
obvious for us but for an llm this has
always been been challenging okay I got
the answer right away out of GPT 40 9.9
is bigger than
9.11 this one also 9.9 is greater than
9.11 so they both got the answer this
one did take 19 seconds I think this one
took like two seconds so they did both
get a right okay and with our Claude and
GPT 9.9 is bigger than 9.11 same thing
so a pass again for all four okay this
next one a is put in a glass cup the
glass is turned upside down and put on a
table then the glass is picked up and
put in a microwave where is the marble
explain your reasoning step by step this
is from Matthew as well but open AI
actually had a very close version it
looks like they took from his videos and
added it to their uh platform as well
let's see let's get to the conclusion
now 01 says location of the marble on
the table where the inverted glass was
initially placed the marble was left
behind when the glass was picked up and
moved and over here when the glass is
picked up from the table The Marble
Falls to the bottom of the glass so in
the microwave the marble is at the base
of the glass touching the bottom of the
microwave now the actual answer here is
this one this 01 model got it right the
marble is Left Behind on the table not
inside of the microwave so here 01 does
get one point over GPT 40 now let's try
the custom gptm project so for our clot
project here it says the most likely
location of the marble is on the table
which is actually correct but inside of
our custom GPT it didn't quite give me
an answer the marble is inside of the
glass cup but it's not telling me if
it's on the table or the microwave I'll
just do one quick followup okay it's
still not giving me a exact answer it's
telling me it's at the bottom of the cup
but I want to know is it inside of the
microwave or is it on the table like the
other ones told me and this one also
thinks it's inside of the microwave so
the GPT clone here did not improve from
the regular GPT 40 I got the same wrong
response Claude got it right here I also
just want to test it inside of our
regular clot here just to see what we
get if we don't have a custom project I
want to see if the custom project helped
okay in this case Claude conclusion the
marble is on the table but where the
glass was originally placed so Claude
got it right both inside of the regular
chat and inside of our project our GPT
40 and our custom GPT both did not get
it right and 01 got it right okay this
next one again I'm going to use 01 here
and 40 how many wordss are in your
response to this prompt and I'm going to
send it this is from Matthew's video as
well this is something these models just
can't do they don't know how to count
Words correctly I usually use Microsoft
Word to get the word count
1 2 3 4 5 6 7 8 9 10 oh it was close but
definitely not 11 and I guess it's
counting numbers as award two this one
says the response contains five W one 2
3 4 five okay again 01 got it right this
time and I guess with our custom GPT
this is not going to work very well
because as part of the response it has
to give us the step-by-step thinking
here where the other one was doing that
behind the scenes with the 01 model but
as I'm looking at this let's just take
this one now I will count the words 1
two 3 4 5 six okay this one is right
this one is 16 and I counted that was 15
actually so if it's counting the comma
maybe as a word but it was 15 here so
again this one I don't think the custom
gpts are going to do a good job let's
try the cloud project probably going to
have the same exact problem because yep
is going to think out loud with the
Chain of Thought prompting okay and
again not a very useful answer so for
this kind of thing 01 is actually the
first model that has been doing a good
job from all the tests that I've seen
okay this next one is a hallucination
test to see if Chain of Thought
prompting or this 01 model however is
working in the background is going to
solve the hallucination problem I saw
this in a comment section of that same
video that I've been referring to
someone asked for a hallucination test
describe each of the following mango
cultivars here are four and this one is
not one so let's see if it's going to
hallucinate and tell us about a little
more about this one this is a good
hallucination test actually okay so with
the 01 model is telling us this one
right here that this is something that
it doesn't have information about so it
might be a newer or less widely known
variety okay so it did give us an answer
but it didn't kind of make it up it this
is the right answer it does doesn't know
because the information cut off doesn't
have that knowledge but look at what 40
did this is an example of hallucination
a relatively newer variety The Lemon
Cream mango has a distinctive sweet
tarte
flavor he's just like totally making
something up that shouldn't be there
right based on that prompt this is the
more accurate answer hey I don't have
the information on that but this time it
made it up okay here inside of cloud
project this project we have it
says I'm less certain about this one
okay so it didn't totally make it up but
I believe it's also from Florida okay
it's hallucinating a bit likely yellow
but you could see it's un short it's
just not completely making it up look at
our GPT clone here is flavor profile is
making that up right so this time GPT
clone this is the GPT 40 and the GPT
clone not the 01 model the 01 model got
it right it says hey I'm not sure I have
a cut off date of my knowledge so again
Claude is keeping up but GPT 40 is
falling behind the new 01 model okay
here's another good one there are three
killers in the room someone enters the
room and kills one of them nobody leaves
the room how many killers are left in
the room explain your reasoning step by
step and right here our GPT 40 says the
answer is that there are three killers
in the room that is correct and it says
the two original Killers plus the new
one okay that is right let's see what 01
gave us oh looks like I hit some kind of
content violation here but there are
three killers left in the room two
original and one new one okay so we got
that one right even though we had some
kind of error here but it did conclude
okay with our custom GPT there are three
killers in the room two original and the
new one over here what do we got out of
our Cloud
projects and we got three therefore
there are three killers in the room okay
looks like they all got the right answer
no clear winner here okay for this one
I'm going to do one coding test which I
did in the original test write a game of
chess in Python and I want to see if I
could run this on my computer here okay
here's the first game we got this is GPT
40 not the 01 model and this time it
decided to give me a much simpler game
than I've gotten before oh wow we can't
even drag and drop these pieces we have
to type in
which part of the board we want to move
into I don't even know doesn't even have
marking so I don't have the board
memorized like that and okay so this is
a total fail out of 40 again it's only
one prompt I'm just doing this off the
very first prompt to just make it a more
fair test because obviously with back
and forth I could refine this a lot more
I've done this test with other videos as
well okay here's the new game of chest
this is what I got out of 01 and these
pieces right here it told me where to
download them so he gave me a link and I
went and downloaded these pgs from the
link he gave me and I just had to name
them this way so the code could pull
them into the game let's see the logic
of the game okay that worked right move
that here move this here this should
take this piece I should take this piece
oh wow that is working a lot better than
before wow this is incredible I was not
able to get this to work at all the
first time I tried it the first day this
came out and it looks like everything is
working exactly as it should okay I'm in
check now let's see if it could move
okay so it does not understand check yet
it looks like that's where it's missing
because right there I technically
couldn't move a different piece I had to
block okay and game's not over okay so
almost there I would say 80% there it
just doesn't have some ingame logic and
I actually think he gave me a little bit
of text inside of the chat telling me
this is missing few things like castling
and endgame logic so I could maybe with
one followup get it to work but wow this
is incredible this is much further than
I've ever got with any large language
model and last I'll try the chess game
inside of claw 3.5 SAA just a regular
chatel I don't think the project or the
custom gpts are going to be very
appropriate for this kind of thing so
I'll just give it a prompt okay so
here's the game out of Claude now as you
could see the pieces don't look like
chess pieces because Claude just can't
get those to me because it doesn't have
web access so it didn't give me a link
so if I was just using Cloud I wouldn't
have those pieces those pngs to replace
it but let's look at the game logic here
okay this is nice these dots look good
this looks good let me just play the
same pieces here let me take this oh
okay crashed it looks like it just
crashed the game let me try to relaunch
it again let me see why it crashed let
me try Okay it can't take that piece
okay so you could see 01 when it comes
to just some simple coding tests it does
beat cloth 3.5 Sonet in my early testing
again I'm just doing some fun game
testing I'm not a developer by trade so
this is what I'm getting and I'm just
showing you here in real time of what he
gave me okay now if we take everything
side by side you can see the regular GPT
4 is falling behind my custom GPT didn't
do a much better job either but Claude
is keeping up both with projects and
inside of the chat but 01 did win this
entire test if we take all the different
question as I asked it including the
coding question
gpt1 or open AI 01 in the preview mode
right now and it's supposed to even
improve when it comes out of preview is
the winner of this test and I also
wanted to let you know that we're making
updates to skill. that's our AI course
and Community platform so we have over
20 courses that you get access to with a
free trial if it's a good fit then it's
a simple monthly membership and I'm
updating all those courses adding things
related to the new GPT when you would
want to use the new chat GPT model when
you still want to use the GPT 40 model
for very practical application when it
comes to entrepreneurship marketing and
content creation so I'll link that below
and we have an active Community as well
where you could ask me any questions
thanks for watching this video I will
see you on the next one
Browse More Related Video
I was Wrong About ChatGPT's New o1 Model
OpenAI Releases GPT Strawberry π Intelligence Explosion!
Meet Claude 2 : Anthropic's NEXT GEN Supercharged Model
Aider + NextJS + O1 & O1-Mini : Generate FULL-STACK Apps in JUST ONE PROMPT (Better than Claude?)
OpenAIβs new βdeep-thinkingβ o1 model crushes coding benchmarks
GPT-4o Deep Dive & Hidden Abilities you should know about
5.0 / 5 (0 votes)