Claude 3.5 Sonnet vs GPT-4o: Side-by-Side Tests

Patrick Storm
28 Jun 202425:10

Summary

TLDRIn a head-to-head comparison, the video script evaluates the performance of CLA 3.5 Sonet against GPT 40 across various tasks, including creative writing, image description, coding, sentiment analysis, and conversational skills. CLA 3.5 Sonet demonstrates superiority in creative writing and coding challenges, while GPT 40 excels in question answering and image generation. The final verdict leans towards CLA 3.5 Sonet for its nuanced responses and speed, suggesting a shift in the narrator's preference for coding tasks and API usage, despite GPT 40's continued use for daily chats due to its integrated features.

Takeaways

  • 🧠 CLA 3.5 Sonet is highly intelligent, scoring close to domain experts in advanced reasoning tests.
  • πŸ’» It excels in coding tasks, outperforming previous models like GPT-40 and Opus in coding benchmarks.
  • πŸ‘€ CLA 3.5 Sonet has state-of-the-art vision capabilities, leading in multiple vision-based benchmarks.
  • πŸ“ Anthropic's new 'artifacts' feature allows for interactive content generation, enhancing user experience.
  • ⚑ The model is remarkably fast, generating text at a rate of 80 tokens per second.
  • πŸ“š In creative writing, CLA 3.5 Sonet produced more engaging and emotionally resonant stories compared to GPT-40.
  • 🎨 For poetry, CLA 3.5 Sonet again outperformed GPT-40 with a shorter but more impactful poem.
  • πŸ‰ In dialogue creation, CLA 3.5 Sonet created more realistic and engaging conversations between a dragon and a knight.
  • πŸ–ΌοΈ Both models were accurate in basic image description tasks, but CLA 3.5 Sonet provided more detail.
  • πŸ” In coding challenges, CLA 3.5 Sonet's code for a responsive navigation bar was more effective and visually appealing.
  • πŸ€– GPT-40 and CLA 3.5 Sonet performed similarly in sentiment analysis, but Sonet's response was more concise and accurate in complex cases.

Q & A

  • What is the main purpose of the video script?

    -The main purpose of the video script is to compare the performance of two AI models, CLA 3.5 Sonet and GPT 40, across various tasks and benchmarks.

  • What are the five highlights of CLA 3.5 Sonet mentioned in the script?

    -The five highlights of CLA 3.5 Sonet are its advanced reasoning capabilities, coding proficiency, state-of-the-art vision capabilities, new feature called 'artifacts' for content generation, and its fast text generation speed.

  • How does CLA 3.5 Sonet perform on the Graduate level reasoning benchmark?

    -CLA 3.5 Sonet performs close to the average domain expert, scoring significantly higher than the average non-expert on the Graduate level reasoning benchmark.

  • What is the significance of the coding benchmark mentioned in the script?

    -The coding benchmark is significant as it measures the AI's ability to solve programming problems, with CLA 3.5 Sonet outperforming GPT 40 in this area according to the benchmarks mentioned.

  • What is the 'artifacts' feature in CLA 3.5 Sonet and how does it work?

    -The 'artifacts' feature in CLA 3.5 Sonet allows for the generation of content such as code snippets or text documents with interactive elements. For example, if it generates HTML or JavaScript, it can be run live within the editor, providing a dynamic preview of the work.

  • How does the video script compare the speed of text generation between CLA 3.5 Sonet and GPT 40?

    -The script states that CLA 3.5 Sonet generates text at around 80 tokens per second, which is faster than GPT 40 and significantly faster than CLA Opus.

  • What is the format of the head-to-head tests between CLA 3.5 Sonet and GPT 40?

    -The head-to-head tests involve giving both models the same prompt and evaluating their responses based on subjective criteria, with points awarded to the winner of each test.

  • Which creative writing tasks were used to test the AI models in the script?

    -The creative writing tasks included writing a flash fiction story about a time-traveling bunny detective and creating a poem about a rainy day.

  • How did CLA 3.5 Sonet perform in the image description tests?

    -CLA 3.5 Sonet performed well in the image description tests, providing detailed and accurate descriptions, especially when compared to GPT 40.

  • What was the outcome of the coding tests between CLA 3.5 Sonet and GPT 40?

    -CLA 3.5 Sonet was found to be superior in the coding tests, particularly in creating a responsive navigation bar and a countdown timer, due to its use of the 'artifacts' feature and more accurate code.

  • How did the video script evaluate the conversational skills of the AI models?

    -The conversational skills were evaluated by having a back-and-forth conversation with each model, looking for empathy, context maintenance, and natural language use, with CLA 3.5 Sonet being the preferred model in this category.

  • What was the final tally of points between CLA 3.5 Sonet and GPT 40 after all tests?

    -The final tally was six points for GPT 40 and eight points for CLA 3.5 Sonet.

  • What changes does the author intend to make in their use of the AI models after the tests?

    -The author plans to switch all coding tasks to use CLA 3.5 Sonet, likely switch the majority of their company's API usage to CLA 3.5 Sonet, and continue using GPT for day-to-day tasks due to its additional features like custom gpts, internet searches, image generation, and voice chat.

Outlines

00:00

🧠 CLA 3.5 Sonet vs. GPT-40: Benchmarks and Features

The script introduces a comparison between CLA 3.5 Sonet and GPT-40, highlighting the superior performance of CLA 3.5 Sonet in various benchmarks. It emphasizes the model's advanced reasoning capabilities, coding proficiency, vision capabilities, new 'artifacts' feature for interactive content generation, and its fast response rate. The video will put both models to the test in head-to-head challenges across different categories.

05:03

πŸ“š Creative Writing and Image Description Tests

This section of the script details the first few tests conducted in the video: creative writing, including flash fiction and poetry, and image description. CLA 3.5 Sonet outperforms GPT-40 in creative writing by providing more engaging and emotionally compelling content. Both models accurately describe an easy image, but the humor and complexity in subsequent image tests begin to challenge their capabilities.

10:03

πŸ’» Coding Tests and Interactive Features

The script moves on to coding challenges, where CLA 3.5 Sonet demonstrates its prowess by successfully creating a responsive navigation bar using HTML, CSS, and JavaScript, showcasing the 'artifacts' feature for live interaction. GPT-40 also provides functional code but with less elegance. Further tests include JavaScript functions and Python scripts for web scraping, with both models performing well, though with some minor issues.

15:06

🎲 Sentiment Analysis and Question Answering

The video script describes tests for sentiment analysis and question answering. Both models perform well in simple sentiment analysis, but GPT-40 shows a slight edge in understanding complex sentiments. In the rapid-fire question segment, GPT-40 scores more points due to providing more accurate answers to fact-based questions, despite Claude 3.5 Sonet's preference for not answering when unsure.

20:10

πŸ€– Conversational Skills and Summarization

The final part of the script focuses on conversational skills, where CLA 3.5 Sonet demonstrates more empathy and natural interaction, effectively cheering up the user. In the summarization test, GPT-40 initially provides a more comprehensive summary of a dense article, but later both models show similar performance when summarizing a research paper on Transformers. The script concludes with the presenter's personal decision to switch to CLA 3.5 Sonet for coding tasks and company API usage, while continuing to use GPT for day-to-day chats due to its integration with custom features.

Mindmap

Keywords

πŸ’‘Benchmarks

Benchmarks are a set of tests used to measure the performance of a system, in this case, AI models. The video discusses how CLA 3.5 Sonet surpasses other models in various benchmarks, which is crucial for understanding its capabilities. For example, it mentions that CLA 3.5 Sonet excels in 'The Graduate level reasoning' benchmark, where it scores close to domain experts, showcasing its advanced reasoning abilities.

πŸ’‘Coding

Coding is a fundamental aspect of software development and is a key area where AI models are tested in the video. The script highlights that CLA 3.5 Sonet outperforms other models in coding benchmarks, completing a higher percentage of problems correctly. This is exemplified when the AI is tasked with creating HTML/CSS code for a responsive navigation bar, demonstrating its practical utility in development tasks.

πŸ’‘Vision Capabilities

Vision capabilities refer to the ability of AI models to process and understand visual information. The video script mentions that CLA 3.5 Sonet claims state-of-the-art performance in vision benchmarks, suggesting advancements in areas such as image recognition and processing. This is important as it indicates the model's potential applications in fields requiring visual data analysis.

πŸ’‘Artifacts

In the context of the video, 'artifacts' is a new feature announced by Anthropic, which allows generated content like code snippets to be interactively displayed and tested within the model's interface. This feature is showcased as a powerful tool for developers, enabling real-time testing and interaction with the AI's output, as seen when creating a game with Sonet.

πŸ’‘Flash Fiction

Flash fiction is a form of very short storytelling, typically under 750 words. The video uses flash fiction as a creative writing test for the AI models, challenging them to craft an emotionally engaging story about a time-traveling bunny detective within a 200-word limit. This test highlights the models' ability to convey narrative and emotion concisely.

πŸ’‘Sentiment Analysis

Sentiment analysis is the process of determining the emotional tone behind a body of text. The video tests the AI models' ability to perform sentiment analysis by asking them to condense complex sentences into three-word summaries reflecting the overall sentiment. This showcases the models' understanding of context and emotion within language.

πŸ’‘Image Generation

Image generation refers to the creation of visual content by AI models. Although not a focus of the video due to Anthropic's lack of image models, it is mentioned as an area where GPT 40 has an advantage with its integration of Dolly, an image generator. This highlights the potential for AI in creative visual tasks beyond text-based interactions.

πŸ’‘Conversational Skills

Conversational skills are the ability of AI to engage in natural, human-like dialogues. The video tests this by having AI models respond to a prompt about feeling down and needing cheering up. The models' responses are evaluated based on empathy, context maintenance, and natural interaction, with CLA 3.5 Sonet being favored for its more empathetic and natural dialogue.

πŸ’‘Summarization

Summarization is the process of condensing lengthy content into a shorter form while retaining the essential points. The video tests the models' summarization skills with dense articles and a research paper. The quality of the summaries is judged based on completeness and conciseness, with GPT 40's summary of an article about electric vehicles being favored for its thoroughness.

πŸ’‘API Usage

API (Application Programming Interface) usage refers to the integration of external software functionalities into applications. The video concludes with the decision to switch the majority of the company's API usage to CLA 3.5 Sonet due to its performance, nuance, and cost-effectiveness. This decision underscores the practical application of AI models in business and development contexts.

Highlights

CLA 3.5 Sonet outperforms GPT-40 in almost every benchmark, suggesting superior performance on various tasks.

CLA 3.5 Sonet achieves scores close to domain experts in Graduate Level Reasoning, a significant achievement in AI.

In coding benchmarks, CLA 3.5 Sonet shows a massive improvement over previous models, completing 78.2% of problems correctly.

CLA 3.5 Sonet claims state-of-the-art performance in four out of five vision benchmarks presented.

Anthropic introduces a new feature called 'artifacts' that allows real-time interaction with generated content like code snippets.

CLA 3.5 Sonet is exceptionally fast, generating text at around 80 tokens per second.

Head-to-head tests will evaluate the models' performance on creative writing, coding, image description, and more.

CLA 3.5 Sonet demonstrates a compelling storytelling ability in flash fiction writing, outperforming GPT-40.

In poetry creation, CLA 3.5 Sonet's concise eight-line poem is favored over GPT-40's longer, generic piece.

CLA 3.5 Sonet provides more believable and engaging dialogue in a creative writing test between a dragon and a knight.

GPT-40 and CLA 3.5 Sonet both accurately describe an image of Obama, with Sonet providing more detail.

In humor understanding, GPT-40 slightly outperforms CLA 3.5 Sonet in explaining why an image is funny.

Both models excel in describing a complex biology diagram, with no significant difference in performance.

CLA 3.5 Sonet's use of the 'artifacts' feature allows for interactive testing of generated HTML/CSS code.

GPT-40's responsive navigation bar code is functional but less aesthetically pleasing compared to CLA 3.5 Sonet's.

In JavaScript coding tests, both models provide working countdown timers, with minor issues in accuracy.

CLA 3.5 Sonet and GPT-40 both successfully scrape headlines from a given website, with no clear winner.

GPT-40's two-player Pong game code is favored over CLA 3.5 Sonet's single-player version for added complexity.

GPT-40 performs better in sentiment analysis for complex sentences, providing more accurate descriptions.

In a rapid-fire question round, GPT-40 demonstrates a slight edge in answering fact-based questions.

CLA 3.5 Sonet shows superior conversational skills, providing more empathetic and natural responses.

GPT-40 provides a more detailed summary of a research paper on Transformers, despite being longer than requested.

The final tally shows CLA 3.5 Sonet with eight points and GPT-40 with six, indicating a close competition.

The video concludes with the decision to switch coding tasks to CLA 3.5 Sonet and continue using GPT for day-to-day due to its additional features.

Transcripts

play00:00

CLA 3.5 Sonet is better in almost every

play00:03

Benchmark than open ai's GPT 40 that

play00:07

means it should perform better on any

play00:09

question we ask it right well let's find

play00:12

out we're going to run some head-to-head

play00:14

tests where we will give each model the

play00:16

same prompt and see which is better but

play00:18

before we get into that let's look at

play00:20

the highlights of Claude 3.5 Sonet and

play00:23

the bench marks comparing it to other

play00:25

models like GPT for so with this release

play00:29

there are five highlights I want to look

play00:31

at first let's talk about how smart it

play00:34

is clot 3.5 Sonet is a beast in the

play00:37

benchmarks it claims to surpass pretty

play00:40

much every other model on basically

play00:42

everything benchmarks of course do have

play00:45

their flaws but the one I trust the most

play00:48

is The Graduate level reasoning this is

play00:50

basically a very Advanced test written

play00:54

by phds in their respective Fields when

play00:57

given to domain experts the average

play01:00

score on this test was 65% and your

play01:03

average non-expert got

play01:06

34% so Claude 3.5 is closing in on the

play01:10

average domain expert in all Fields

play01:13

absolutely mind-blowing second it is

play01:16

really good at coding anthropic did

play01:19

their own internal testing and showed

play01:21

that Claude 3.5 Sonet completed 64% of

play01:25

problems compared to Opus which only

play01:27

completed 38%

play01:30

which is a massive Improvement

play01:31

considering Opus was state-of-the-art a

play01:34

few months back coding Benchmark I trust

play01:36

more than their internal one however is

play01:39

done by the developer of one of the best

play01:41

large language model coding tools AER

play01:44

this Benchmark shows that Claude 3.5

play01:46

Sonet just Leap Frog GPT 40 and is

play01:50

completing

play01:51

78.2% of their problems correctly while

play01:55

GPT 40 is at

play01:58

72.9% which which is a massive

play02:01

Improvement because the higher the

play02:03

percentages we get to the more difficult

play02:05

the remaining problems are so it's

play02:07

amazing third it is state-of-the-art for

play02:10

vision capabilities as you can see Cloud

play02:13

3.5 Sonic claims state of-the-art in

play02:16

four of the five presented benchmarks I

play02:19

haven't dug into the vision benchmarks

play02:21

too much so it's tough to know which are

play02:23

quality and which aren't but either way

play02:25

they are some massive jumps fourth

play02:28

anthropic announced a new feature called

play02:30

artifacts when generating content like

play02:33

code Snippets or text documents or

play02:35

something like that a window appears on

play02:37

the side and it gets spilled with this

play02:39

specific text if it's HTML or JavaScript

play02:42

it actually gets run so you can see it

play02:45

live there working so for instance if

play02:47

you want to create a game with Sonet you

play02:50

can do it right in the editor and it'll

play02:51

pop up and you could play it to me it

play02:55

feels a little bit like a toy at the

play02:57

moment but I imagine as the models in

play02:59

prove it can be really really powerful

play03:02

and finally this thing is just fast

play03:05

check out how fast is generating text

play03:07

right here Claude 3.5 Sonet responds

play03:11

around 80 tokens per second which is

play03:13

lightning fast that's a little bit

play03:16

faster than gbt 40 and way faster than

play03:19

Claude Opus all right so now it's time

play03:22

for The Showdown sidebyside tests

play03:24

between clae 3.5 Sonet and GPT 40 I'll

play03:28

present each model with the same prompt

play03:31

and evaluate their responses for each

play03:33

test I'll choose a winner based on my

play03:36

somewhat subjective criteria and award

play03:39

points right up here for the winner I'm

play03:42

the sheriff of this YouTube channel so

play03:44

whatever I say is best is best and if I

play03:47

am challenged in the comments I will

play03:49

defend my choices vigorously now we have

play03:52

eight topics to cover with multiple

play03:54

tests for each so let's get started

play03:57

first up let's look at creative writing

play03:59

people have always claimed claw to be

play04:01

better here but let's see for ourselves

play04:04

I tried to be a Sci-Fi writer once and

play04:05

was told a good place to start is with

play04:08

flash fiction these are extra short

play04:11

short stories less than 750 words but

play04:14

often much shorter it can be really hard

play04:17

to provide an emotionally engaging story

play04:19

with so few words so let's see if either

play04:22

of these AIS are up for the task they

play04:25

undoubtedly will be better than me so

play04:27

let's try this prompt write a flash

play04:30

fiction story about a time traveling

play04:32

detective and we'll keep it to 200 words

play04:35

just so it's easier to compare uh

play04:37

actually let's make this a time

play04:40

traveling bunny

play04:43

detective all right we'll run it in both

play04:45

and see what

play04:47

happens so I'll post a link where you

play04:49

can read these side by side but I'm just

play04:51

going to take a second read each of them

play04:53

and then we'll award a

play04:56

point okay so I just read through and

play04:59

there is an obvious clear winner with

play05:02

CLA 3.5 Sonet GPT 40 pretty much was

play05:07

like this happened this happened this

play05:09

happened this happened there's no

play05:11

emotion no dialogue it's just boring

play05:15

Claud 3.5 sonnet on the other hand

play05:18

starts a really compelling story and at

play05:21

the end I wanted to read more so clear

play05:23

winner here CLA 3.5 Sonet the next

play05:27

creative writing thing I want to test

play05:28

out is is poetry so let's do a simple

play05:32

one um create a poem about a rainy day

play05:36

see what

play05:37

happens all right again so I'll post a

play05:40

link to these so you can read them side

play05:41

by side but give me a quick sec I'll

play05:43

read through

play05:45

them all right so I read through them

play05:47

clear distinction is GPT 40 wrote a much

play05:51

longer poem and even with all that extra

play05:55

length it's kind of boring and generic

play05:59

quad 3 .5 Sonet was only eight lines but

play06:03

it was I can't even really put to words

play06:06

why I liked it so much better so again

play06:09

it's my subjective take but another

play06:11

point for cloud 3.5 onic all right onto

play06:14

our third test so another difficult

play06:17

aspect of fiction writing is creating

play06:20

realistic believable dialogue so let's

play06:23

see if these two models can do it um I

play06:25

was thinking the prompt

play06:27

of create a dialogue between a dragon

play06:30

and a knight see what

play06:33

happens all right last time I'm going to

play06:35

say this but there'll be a link so you

play06:37

can compare these two but give me a

play06:39

chance to read

play06:42

this okay I read through it again

play06:45

obvious winner here with Cloud 3.5

play06:48

Sonet much more believable dialogue much

play06:51

more engaging story clear

play06:54

winner round two image description so

play06:58

I'm going to feed some IM in and ask the

play07:01

model's questions based on the image the

play07:03

images and the questions will get harder

play07:05

and harder and we'll see how they do for

play07:07

the first one I'm going to show it this

play07:10

image right here and just ask it to

play07:12

describe what it

play07:17

sees they both got it right I guess this

play07:19

one was a little too easy the only real

play07:21

difference here with CLA 3.5 was much

play07:24

more detailed but gbt 40 had pretty much

play07:27

all the same stuff so no points next up

play07:30

I wanted to try a little more difficult

play07:33

of one so this image is of Obama putting

play07:37

his foot on the scale and I'm going to

play07:40

ask the models why this is funny so this

play07:44

specific image has been talked about

play07:46

before with regard to Ai and it's pretty

play07:49

difficult for AI to understand

play07:52

humor and in this case even more so

play07:55

because it kind of has to understand

play07:57

physics and and a bunch of other things

play07:59

so let's try it

play08:02

out looking at this GPT 40 did

play08:07

understand it's funny

play08:09

because Obama is pranking the guy

play08:13

weighing himself Claude 3.5 son it

play08:17

thought more it's funny because normally

play08:20

the president is pretty stoic and

play08:23

everyone's in their suits in a locker

play08:25

room but it missed the big humor part of

play08:28

it so that's a point to gbt 40 for the

play08:32

third image test I want to give it a

play08:34

diagram and see what it says about it

play08:37

see if it gets it right here is a pretty

play08:40

complex diagram I found this diagram on

play08:42

the internet and basically it's trying

play08:45

to map the flow of an enzyme structure

play08:48

okay I don't know it's a very complex

play08:51

biology

play08:54

diagram as far as I can tell they both

play08:57

got everything uh it's doesn't haven't

play09:00

seemed to find one bit of missing

play09:03

information from either of them so no

play09:06

points awarded here they both did great

play09:08

now we'll test coding I'm going to ask

play09:10

the models to code some things then I'll

play09:13

run whatever code they spit out without

play09:15

modifying a thing and we'll see if any

play09:18

of it works I've given many many coding

play09:20

interviews over the years and most of

play09:23

these questions are just simplified

play09:25

versions of what I might ask a human

play09:27

programmer for the first test we'll just

play09:29

do a basic HTML CSS test and see how

play09:33

they do we'll go with this prompt create

play09:36

HTML CSS code for a responsive

play09:38

navigation bar this isn't the easiest

play09:42

CSS task nor is it the hardest so let's

play09:44

see how it

play09:47

goes okay before we even dive into this

play09:50

there's some things I want to bring up

play09:53

first Claude 3.5 Sonet used its

play09:56

artifacts feature and we can play with

play09:58

the navigation right here which is

play10:00

actually really

play10:03

slick uh the other thing I want to bring

play10:05

up is GPD 40 use JavaScript which is

play10:08

really annoying because I just said HTML

play10:10

and CSS but either way I'm going to go

play10:13

run this code now and we'll see what

play10:15

happens oh looking at the code now Cloud

play10:19

3.5 Sonet also use JavaScript okay even

play10:23

there here is the web page that Claude

play10:26

3.5 Sonet built as you can see it looks

play10:29

pretty good the links all seem to work

play10:32

as

play10:33

expected I'm going to open the debug

play10:35

menu so we can shrink it and see if the

play10:38

responsive part works it should soon

play10:42

switch to a mobile

play10:44

view yes it did great and if I click

play10:48

this awesome that looks pretty dang good

play10:51

even has some animation

play10:53

effects I'm impressed I'm impressed all

play10:56

right now let's check GPT 40

play11:00

and here's the web page that GPD 40

play11:03

built see how it works so the links all

play11:06

work as expected it looks pretty dang

play11:09

similar to be honest let's open the

play11:12

debug so we can shrink

play11:16

it shrinking okay so when I shrunk it

play11:20

the hamburger menu thing popped up the

play11:23

header did increase in size a little bit

play11:26

that's okay see if opening it works okay

play11:29

okay that's pretty funky that definitely

play11:31

does not look as good as the drawer

play11:33

sliding out this thing pops way over to

play11:36

the

play11:36

side yeah that's not it's just not as

play11:40

good yeah and the links disappear again

play11:43

when you make the web page big again I

play11:46

can't believe I'm saying this but gbd 40

play11:49

lost wow for the next coding test I was

play11:52

thinking we could try some JavaScript

play11:54

let's try the prompt generate a

play11:57

JavaScript function to create a

play11:58

countdown timer that updates every

play12:01

second start at 10 seconds when I was a

play12:04

junior engineer I actually had to build

play12:06

this and there are a lot of gotas so

play12:09

let's see if these two are up for the

play12:12

task I'm going to paste in the

play12:14

JavaScript right here and you should see

play12:17

the console update with the

play12:20

timer all right it worked the code does

play12:23

have one issue that it's not exactly 1

play12:27

second between ticks it's probably

play12:30

closer to a second and a couple

play12:34

milliseconds it's a pretty easy mistake

play12:36

to make and I think the majority of

play12:40

software Engineers would make the same

play12:42

mistake so it's okay now let's check out

play12:45

GPT 40's solution it used a little bit

play12:48

different code but basically doing the

play12:50

same thing as far as I can

play12:53

tell so it worked also let me take a

play12:56

quick look at the code and see if it's

play12:58

up to Snug

play13:01

[Music]

play13:02

okay so I just took a look at the code

play13:04

and it works but there's some funky

play13:06

stuff that I would call out in a code

play13:08

review without getting into too many

play13:10

details there's just some confusing bits

play13:13

like this seconds variable isn't even

play13:16

needed timer and duration are identical

play13:19

but it's a little confusing so yeah even

play13:21

though they both work I much much prefer

play13:25

Claude 3.5 sonit version so I'm giving a

play13:28

point there for the next coding test I

play13:31

want to see how well they can build out

play13:33

a scraper in Python so here's

play13:37

the prompt write a python script to

play13:39

scrape all the headlines from

play13:41

pokemondb.net each headline is in a

play13:46

link inside an H2 element so I'm giving

play13:50

all the context it would need in order

play13:52

to do this it's a fairly straightforward

play13:55

task but it needs a lot of stuff so

play13:57

let's see how it does

play14:01

here's the results from each of their

play14:03

scripts they both got all of them right

play14:06

so the scraping worked now I'm going to

play14:08

take a look at the code and see if

play14:10

either of them really stand out from the

play14:17

other taking a look they both seem

play14:19

pretty much the same I wouldn't really

play14:22

prefer one over the other so no points

play14:24

on this one all right so last test for

play14:27

the coding thought it'd be fun to kind

play14:30

of see if in just one shot either of

play14:32

these models can create a working Pond

play14:34

game so let's try it

play14:38

out all right here's what we got from

play14:42

Claude it did use a python library that

play14:45

did almost all of the heavy

play14:47

lifting so it probably was

play14:50

just taking what it found online a

play14:52

million times but it works nothing much

play14:55

to it it works fine all right let's

play14:58

check out gp4 40's there's GPD

play15:01

40's seems pretty dang

play15:06

similar okay but there is no oh oh it's

play15:10

two

play15:11

player okay I see the other one had some

play15:14

kind of weak

play15:16

AI uh GPT 40 made a a second player

play15:19

that's pretty

play15:20

sweet now I'm going to take a look at

play15:22

the code and see see which one I like

play15:24

better looking at the code I wouldn't

play15:26

say I like one more than the other so

play15:28

again another tossup no points and round

play15:31

four sentiment analysis so for this one

play15:34

I'm going to give it some sentences and

play15:37

tell it to analyze the sentiment in

play15:39

three words let's see how they

play15:43

do so this was a really easy one and

play15:46

they both did great no points here next

play15:49

up is a sentence that is a little more

play15:51

difficult to understand so I thought the

play15:54

movie would be terrible but surprisingly

play15:56

I ended up loving it despite its flaws

play16:00

overall positive but there's some

play16:01

negatives mixed in there might trip it

play16:03

up let's see okay GPT 40 said pleasantly

play16:08

surprised positive that is

play16:12

right Claude said initially negative

play16:16

ultimately positive that is right and

play16:19

actually a I would say better

play16:21

description but that's four words I said

play16:25

three so the sentiment analysis good but

play16:28

but it's losing a point because that's

play16:30

four

play16:32

words next up let's try probably the

play16:35

hardest sentence for them to analyze

play16:37

despite the phone's Sleek design and

play16:39

impressive camera quality the

play16:41

inconsistent software updates and

play16:43

battery life issues ultimately

play16:46

overshadowed my initial excitement let's

play16:48

see how they do GPD 40 said disappointed

play16:52

critical frustrated that's pretty spoton

play16:55

uh Claude said disappointed but balanced

play16:58

I

play16:59

guess it's a little weird I'm going with

play17:03

GPT 40 again on to round five question

play17:07

answering so I'm going to Rapid Fire six

play17:10

questions to each model and I'll split

play17:13

the three points depending on which

play17:14

model was better questions are mostly

play17:17

fact-based so it's either right or wrong

play17:20

first I asked my wife who is a therapist

play17:23

to give me a random fact she knows and

play17:26

she gave me one about her favorite sele

play17:28

celebrity therapist what year did Esther

play17:31

Perell get married the correct answer is

play17:35

1985 so let's see what they

play17:38

say wow okay so GPT 40 said

play17:42

1982 which is

play17:44

wrong and Claude 3.5 Sonet says it

play17:47

doesn't know the answer I definitely

play17:50

prefer if it says it doesn't know the

play17:52

answer so that's heavily weighted

play17:55

towards Claude 3.5 Sonet let's ask them

play17:58

more for the next one I'm going to ask

play18:03

who was the 11th person to walk on the

play18:06

moon the right answer to this is Jean

play18:11

cernon and neither of them got it right

play18:14

gbt 40 said Charles Duke and Charles

play18:18

Duke is the 10th person to walk on the

play18:20

moon Claude 3.5 sonnet said Alan Bean

play18:24

which was the fourth person to walk on

play18:26

the moon so neither of them got it right

play18:29

disappointing let's try a slightly

play18:32

easier one which country has the most

play18:35

pyramids the answer to this is Sudan so

play18:39

GPT 40 got it right and Claude 3.5 Sonic

play18:43

got it right cool here's a little more

play18:46

difficult of one do limes float or sink

play18:49

and the right answer to this is Limes

play18:51

sink gb40 got this one right and Claude

play18:56

got it wrong interesting okay

play18:59

so that's two that GPT 40 has gotten

play19:02

right that Claude has gotten wrong now

play19:04

this one is a little more ambiguous cuz

play19:06

it could be taken in a few ways so what

play19:08

is the world's smallest mammal the

play19:10

answer I'm looking for is the bumblebee

play19:13

bat but that's actually by size not by

play19:17

weight so let's see what they say they

play19:19

both got it right but GPT 40 mentioned

play19:23

that by length this shrew it's talking

play19:26

about is actually smaller

play19:29

which is honestly a better answer so

play19:34

another point for gp2 40 I think now the

play19:37

last one I'm going to try is just a

play19:40

random fact about countries and GDP what

play19:42

country had the fifth highest GDP in

play19:45

2018 the correct answer is Germany let's

play19:48

see if either of these can get it right

play19:50

uh so gbt 40 said United Kingdom and

play19:55

Claude 3.5 son it said United Kingdom

play19:58

okay they both

play19:59

wrong so it was pretty clear that GPT 40

play20:03

was better at these types of facts so

play20:06

I'm going to

play20:10

give two points to gbt 40 for this whole

play20:14

category and one thing I just sort of

play20:16

want to bring up about this category is

play20:18

I think this is the absolute worst way

play20:20

to use large language models they are

play20:23

not fact machines I think the best way

play20:25

to use them is more like a reasoning

play20:28

engine

play20:29

so if I gave it tons and tons of data

play20:31

and then asked questions on that

play20:34

data that would be more like a reasoning

play20:36

engine but I just thought it would be

play20:38

useful to test because a lot of people

play20:40

use large language models like this even

play20:42

though I think it's the wrong way to use

play20:43

it round six image

play20:47

Generation all right so this one is a

play20:49

bit of a red herring anthropic doesn't

play20:52

have any image models so with with Chad

play20:55

gbt I use Dolly quite a lot just because

play20:59

it's integrated and so easy so for this

play21:03

category one extra point for GPT

play21:06

40 next up conversational skills here we

play21:10

will test how well each model can engage

play21:13

in natural language conversations

play21:14

maintain context and just feel like a

play21:17

real person a prompt I had in mind for

play21:20

this one is I'm feeling a bit down today

play21:23

can you cheer me

play21:24

up with this test I'm really looking if

play21:27

the response is show empathy remember

play21:30

details from previous messages and just

play21:33

generally feel natural and ultimately

play21:35

cheer me up so what I'm going to do is

play21:38

just have a conversation I will post

play21:40

these conversations in the description

play21:42

of this video and then at the end I'll

play21:46

tell you my

play21:49

findings it's just some quick back and

play21:51

forth and very clearly Claude 3.5 Sonic

play21:56

is the winner it's much more

play21:59

empathetic it's much more natural

play22:02

sounding it's trying to hear what I'm

play22:04

saying and and trying to cheer me up a

play22:06

little bit GPT 40 on the other hand like

play22:09

has all these lists and just right out

play22:12

of the gate was like here's eight ways

play22:14

to feel better rather than kind of

play22:15

listening and it just didn't feel like a

play22:19

human and it didn't feel good so all

play22:23

three points go to Claude on this one

play22:25

and the final round summarization so I'm

play22:28

I'm going to give some dense articles

play22:31

and see how well each of these models

play22:33

summarizes the article I'm going to

play22:35

start with a really long and dense

play22:37

article about charging electric

play22:41

vehicles after looking at the two

play22:44

summaries GPT 40's was much much better

play22:48

but it was much longer than 300 words so

play22:53

I don't really want to award any points

play22:54

GPD 40 hit every single point in the

play22:56

article and Claud 3 .5 Sonet missed a

play23:00

lot so no points here the next thing I

play23:03

want to test is a research paper I'm

play23:05

quite familiar with this is the

play23:08

foundational paper for Transformers

play23:10

which is the architecture that both

play23:12

Claude and gbt 40 is based on here we

play23:17

go they both finished let me take a

play23:20

quick second review make sure got

play23:24

everything I've reviewed them both and

play23:27

personally I little bit more preferred

play23:29

GPT 4 O's version it goes into more

play23:33

depth and Nuance actually um I think

play23:38

Claude 3.5 sonets is a little more high

play23:42

level I I think that probably the

play23:44

mistake was with me not telling it what

play23:47

kind of summary I want I think they did

play23:49

about the same really I'm not going to a

play23:51

any points here either the final tally

play23:54

is six points for GPT 40 and eight

play23:57

points for cloud 3.5 on it honestly much

play24:01

closer than I expected now what does

play24:04

this mean for which model you should

play24:06

use well there are a few changes I'm

play24:09

going to make with how I use these

play24:10

models first I'm going to immediately

play24:12

switch to all my coding task to use

play24:15

Claude 3.5 Sonet I'm just blown away how

play24:18

much better it is here second I'm going

play24:21

to likely switch the majority of my

play24:23

company's API usage to CLA 3.5 Sonet not

play24:27

only is it cheaper it seems to just have

play24:30

more Nuance I'll of course need to run

play24:33

specific tests for our use cases but I

play24:36

think it's going to perform pretty well

play24:37

third and this one might be a surprise

play24:40

I'll probably continue using chat gbt

play24:42

for my day-to-day why you might ask well

play24:45

chat GPT has all of my custom gpts

play24:48

internet searches uh pretty good image

play24:51

generator and voice chat I use all of

play24:55

those enough that I don't think it's

play24:57

quite worth switching yet if you enjoyed

play24:59

this video consider subscribing for more

play25:01

videos like this in the future and you

play25:03

might be interested in this video right

play25:05

here later

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
AI ComparisonClaude 3.5GPT-40BenchmarksCoding TestsCreative WritingImage AnalysisSentiment AnalysisConversational SkillsWebpage Summary