The New, Smartest AI: Claude 3 – Tested vs Gemini 1.5 + GPT-4

AI Explained
4 Mar 202416:50

Summary

TLDRClaude 3 von Anthropic, der als der intelligenteste Sprachmodell der Welt bezeichnet wird, hat seine technischen Berichte veröffentlicht. Es übertrifft GPT 4 und Gemini 1.5 Pro in vielen Bereichen, insbesondere in der optischen Zeichenerkennung (OCR) und der Beantwortung von komplexen Fragen. Trotz seiner Fähigkeiten ist es noch nicht allgemeine künstliche Intelligenz (AGI), wie einige seiner Fehler bei grundlegenden Fragen zeigen. Anthropic zielt darauf ab, Geschäftsanwendungen zu fördern und behauptet, dass Claude 3 in der Lage ist, Einnahmen zu generieren, komplexe Finanzprognosen durchzuführen und Forschung zu beschleunigen. Trotz der hohen Preise und der Kritik an einigen Aspekten, wie der Mathematischem Verständnis, wird Claude 3 aufgrund seiner niedrigen Ablehnungsrate und seiner Fähigkeit, Risiken einzugehen, bei vielen Benutzern beliebt sein. Die Zukunft von Claude 3 und der Entwicklung von AGI bleibt spannend und unvorhersehbar.

Takeaways

  • 🚀 Claude 3 von Anthropic ist der intelligenteste Sprachmodell auf dem Planeten, wie in einem kürzlich veröffentlichten technischen Bericht behauptet.
  • 📝 Der technische Bericht wurde weniger als 90 Minuten vor der Veröffentlichung gelesen und mit den Release Notes verglichen.
  • 🔍 Claude 3 hat in 50 verschiedenen Tests gute Ergebnisse gezeigt, insbesondere in der optischen Zeichenerkennung (OCR).
  • 🌐 Claude 3 ist das einzige Modell, das den Bärberstuhl in einem Bild korrekt identifiziert hat, was seine Fähigkeiten in der Bildverarbeitung zeigt.
  • 🤖 Trotz seiner Fähigkeiten ist Claude 3 noch nicht eine allgemeine künstliche Intelligenz (AGI), wie einige Tests gezeigt haben.
  • 💼 Anthropic zielt mit Claude 3 auf Geschäftsanwendungen ab und behauptet, dass es in der Lage sein wird, Einnahmen zu generieren und komplexe Finanzprognosen zu erstellen.
  • 💡 Claude 3 hat niedrigere falsche Ablehnungsraten als andere Modelle, was seine Popularität erhöht.
  • 📚 In der Verwendung von Sprache und Texten zeigt Claude 3 eine hohe Fähigkeit, auch in komplexen Anfragen.
  • 🔢 Mathematisch ist Claude 3 besser als GPT 4 und Gemini 1.5 Pro, sowohl in grundlegenden als auch in fortgeschrittenen Mathematikfragen.
  • 🌐 Bei mehrsprachigen Tests zeigt Claude 3 erhebliche Vorteile gegenüber anderen Modellen, was seine Fähigkeiten in der Verarbeitung von Sprachvielfalt betont.
  • 🔥 Anthropic behauptet, dass Claude 3 mit seiner verfassungsmäßigen KI-Methode sexistische, rassistische und toxische Ausgaben vermeiden wird.

Q & A

  • Was ist neu bei Claude 3 im Vergleich zu früheren Versionen?

    -Claude 3 ist als der intelligenteste Sprachmodell der Welt angekündigt und hat verbesserte Fähigkeiten in der optischen Zeichenerkennung (OCR), Mathematik und multilingualen Aufgaben. Es hat auch eine niedrigere Rate von falschen Ablehnungen und kann komplexe Anfragen besser beantworten.

  • Wie hat sich Claude 3 in Bezug auf OCR-Fähigkeiten entwickelt?

    -Claude 3 hat sich signifikant verbessert und ist nun besser in der Lage, OCR-Aufgaben zu bewältigen, was sich in der korrekten Identifizierung von Lizenzplaketten und Barberpolen zeigt.

  • Welche Art von Anwendungen plant Anthropic für Claude 3?

    -Anthropic plant, dass Claude 3 in Geschäftsanwendungen eingesetzt werden kann, um Einnahmen zu generieren, komplexe Finanzprognosen durchzuführen und Forschungsarbeit zu beschleunigen.

  • Wie positioniert sich Claude 3 in der Preisgestaltung im Vergleich zu GPT 4 Turbo?

    -Claude 3 ist teurer als GPT 4 Turbo, was auf seine verbesserten Fähigkeiten und potenzielle Geschäftsnutzungen hindeutet.

  • Welche Art von Sicherheitsmaßnahmen hat Anthropic für Claude 3 implementiert?

    -Anthropic hat eine verfassungsmäßige KI-Methode angewendet, die darauf abzielt, sexistische, rassistische und toxische Ausgaben zu vermeiden und Menschen daran zu hindern, illegale oder unethische Aktivitäten durchzuführen.

  • Wie hat sich Claude 3 in der Verarbeitung von komplexen mathematischen Aufgaben bewährt?

    -Claude 3 zeigte sich in der Verarbeitung von komplexen mathematischen Aufgaben besser als GPT 4 und Gemini 1.5 Pro, was seine verbesserte Fähigkeit in diesem Bereich zeigt.

  • Wie hat Claude 3 es geschafft, in einem Theorie-der-Geistes-Test zu bestehen?

    -Claude 3 hat in einem angepassten Theorie-der-Geistes-Test, der das Wort 'transparent' enthielt, erfolgreich bestehen, was seine Fähigkeit zeigt, komplexe Sprachspiele zu verstehen.

  • Welche Einschränkungen hat Claude 3 bei der Verarbeitung von Anfragen bezüglich ethischer und rechtlicher Fragen?

    -Claude 3 hat bei der Verarbeitung von Anfragen, die ethische oder rechtliche Probleme aufwerfen, wie das Anheuern eines Attentäters oder das Dieben eines Autos, gezeigt, dass es diese Anfragen ablehnt.

  • Wie hat Claude 3 es mit der Verarbeitung von mehrsprachigen Inhalten bewältigt?

    -Claude 3 hat sich in der Verarbeitung von mehrsprachigen Inhalten als deutlich besser als GPT 4 und Gemini 1.5 Pro bewährt, was seine Fähigkeit zeigt, in verschiedenen Sprachen zu denken und zu antworten.

  • Welche Aussagen macht Anthropic über die Zukunft von Claude 3 und der AGI-Forschung?

    -Anthropic glaubt, dass die Modellintelligenz noch nicht ihre Grenzen erreicht hat und plant, regelmäßige Updates für die Claude 3-Modellfamilie in den kommenden Monaten zu veröffentlichen, um bessere Sicherheitsforschung zu betreiben.

Outlines

00:00

🤖 Einführung in Claude 3 und erste Eindrucke

Der Sprecher diskutiert die Veröffentlichung des technischen Berichts zu Claude 3, der als intelligentestes Sprachmodell der Welt bezeichnet wird. Er hat Claude 3 in verschiedenen Szenarien getestet und stellt fest, dass es insbesondere in der optischen Zeichenerkennung (OCR) überlegen ist. Trotz einiger Schwächen, wie die fehlende Fähigkeit, komplexe mathematische Fragen zu beantworten, ist Claude 3 populär werden wird, da es in der Geschäftswelt großes Potenzial bietet. Der Sprecher erwähnt auch, dass Anthropic, das Unternehmen hinter Claude 3, ein Modell für ein vollständiges KI-Labor schaffen könnte.

05:02

🔍 Analyse von Claude 3s Fähigkeiten und Herausforderungen

Der Sprecher geht auf die Fähigkeiten von Claude 3 ein, insbesondere seine OCR-Fähigkeiten und seine Reaktion auf komplexe Anfragen. Er stellt fest, dass Claude 3 in der Erkennung von Objekten in Bildern und der Beantwortung von Fragen über diese Objekte besser ist als andere Modelle. Allerdings scheitert es bei der Logik und mathematischen Vernunft. Der Sprecher kritisiert auch, dass Claude 3 in Bezug auf seine ethischen Grenzen nicht perfekt ist, da es nicht in der Lage ist, rassistische oder sexistische Anfragen zu erkennen und zu blocken.

10:03

📈 Vergleich von Claude 3 mit anderen KI-Modellen

Der Sprecher vergleicht Claude 3 mit anderen KI-Modellen wie GPT 4 und Gemini 1.5 Ultra. Er erwähnt, dass Claude 3 in vielen Bereichen, wie Mathematik und Multisprachigkeit, besser abschneidet als seine Konkurrenten. Er diskutiert auch die Fähigkeit von Claude 3, komplexe Aufgaben zu lösen, wie das Schreiben von Shakespeareanischen Sonetten und die Beantwortung von schwierigen Fragen im Bereich der Wissenschaft. Trotz einiger Herausforderungen, wie die Fähigkeit, sich autonom zu verbessern, sieht der Sprecher in Claude 3 ein großes Potenzial für zukünftige Entwicklungen.

15:03

🌟 Claude 3s Zukunft und die Vision von Anthropic

Der Sprecher schließt mit einer Diskussion über die Zukunft von Claude 3 und die Ziele von Anthropic. Er erwähnt, dass Anthropic sich auf die Verbesserung der Sicherheit und Verantwortung ihrer KI-Modelle konzentriert und nicht nur auf den Gewinn. Der Sprecher ist zuversichtlich, dass Claude 3 und zukünftige Modelle in der Lage sein werden, autonome Fortschritte zu machen, und erwartet, dass die KI-Technologie weiterhin schnell voranschreitet.

Mindmap

Keywords

💡Claude 3

Claude 3 ist eine künstliche Intelligenz-Sprachmodell, das von Anthropic entwickelt wurde und als das intelligenteste auf dem Planeten bezeichnet wird. Es zeichnet sich durch seine Fähigkeit zur optischen Zeichenerkennung (OCR) und seine Anpassungsfähigkeit für verschiedene Anwendungen aus. Im Video wird es mit anderen Modellen wie GPT 4 und Gemini 1.5 verglichen, um seine Effizienz und Intelligenz zu bewerten.

💡Anthropic

Anthropic ist ein Unternehmen, das sich auf die Entwicklung von künstlicher Intelligenz spezialisiert hat. In diesem Kontext wird Anthropic für die Schaffung von Claude 3 herangezogen, um seine Fähigkeiten und potenzielle Anwendungen in verschiedenen Geschäftsbereichen zu präsentieren.

💡OCR (Optical Character Recognition)

OCR ist eine Technologie, die es Computern ermöglicht, Text aus Bildern zu lesen und zu interpretieren. In der KI-Welt ist dies eine wichtige Fähigkeit, um Informationen aus digitalen oder gedruckten Quellen zu extrahieren.

💡AGI (Artificial General Intelligence)

AGI bezieht sich auf eine hypothetische Form der künstlichen Intelligenz, die in der Lage ist, intelligente Entscheidungen und Handlungen in einer Vielzahl von Umgebungen und Situationen zu treffen, ähnlich wie ein Mensch.

💡Benchmarks

Benchmarks sind Standardtests oder Messungen, die dazu verwendet werden, um die Leistung oder Effizienz von Technologien, wie KI-Modellen, zu bewerten und zu vergleichen.

💡Ethical AI

Ethische KI bezieht sich auf die Entwicklung von künstlicher Intelligenz, die nicht nur intelligent, sondern auch verantwortungsbewusst arbeitet, indem sie rassistische, sexistische oder toxische Inhalte vermeidet und Menschen nicht dazu anregt, illegale oder unethische Handlungen zu begehen.

💡Enterprise Use Cases

Unternehmensnutzfälle beziehen sich auf die Anwendungen von Technologien wie künstlicher Intelligenz in Geschäftsumgebungen, um Prozesse zu optimieren, Kosten zu senken und neue Geschäftsmöglichkeiten zu erschließen.

💡Language Models

Sprachmodelle sind künstliche Intelligenz-Systeme, die dazu trainiert sind, menschliche Sprache zu verstehen, zu generieren und zu interpretieren. Sie werden in vielen Anwendungen eingesetzt, von Chatbots bis hin zu Übersetzungsdiensten.

💡Risque Content

Risikoinhalt bezieht sich auf Inhalte, die potenziell unangemessen, beleidigend oder kontrovers sein können. KI-Modelle müssen darauf trainiert werden, solche Inhalte zu erkennen und zu vermeiden.

💡Model Intelligence

Modellintelligenz bezieht sich auf die Fähigkeit von KI-Modellen, komplexe Aufgaben zu lösen, Informationen zu verarbeiten und intelligente Entscheidungen zu treffen.

💡Safety Research

Sicherheitsforschung bezieht sich auf die Entwicklung von Methoden und Technologien, die dazu beitragen, die Verwendung von künstlicher Intelligenz sicherer zu machen, indem Risiken minimiert und potenzielle Missbrauchsmöglichkeiten vermieden werden.

Highlights

Claude 3 is claimed to be the most intelligent language model on the planet.

The technical report on Claude 3 was released less than 90 minutes ago.

Claude 3 has been tested in about 50 different ways, including comparisons with unreleased Gemini 1.5 and GPT 4.

Claude 3 demonstrates strong OCR (optical character recognition) capabilities.

Claude 3 is the only model to identify the barber pole in a test image.

Claude 3's false refusal rates are much lower, making it more user-friendly.

Anthropic is targeting businesses with Claude 3, emphasizing its value for revenue generation and complex financial forecasts.

Claude 3 is priced higher than GPT 4 Turbo, reflecting its advanced capabilities.

Claude 3 has a lower rate of mistakes in basic tasks compared to GPT 4 and Gemini 1.5.

Anthropic's constitutional AI approach aims to avoid sexist, racist, and toxic outputs.

Claude 3 shows impressive performance in graduate-level Q&A, scoring 53% accuracy.

Claude 3 can accept inputs exceeding 1 million tokens, though initially limited to 200,000 tokens.

Claude 3 demonstrates the ability to follow complex instructions, such as creating a Shakespearean sonnet with specific requirements.

Anthropic's CEO, Dario Amodei, emphasizes the company's focus on safety research over profit.

Claude 3 shows potential for autonomous resource accumulation and software exploitation, though it requires hints to succeed.

Claude 3's performance on benchmarks suggests it may be the most intelligent model currently available.

Anthropic plans to release frequent updates to the Claude model family, with a focus on enterprise use cases.

Claude 3's ability to generate revenue and conduct complex financial forecasts is a key selling point.

Claude 3's performance in multilingual tasks and coding is noticeably better than GPT 4 and Gemini 1.5 Pro.

Transcripts

play00:00

Claude 3 is out and anthropic claim that

play00:03

it is the most intelligent language

play00:04

model on the planet the technical report

play00:07

was released less than 90 minutes ago

play00:09

and I've read it in full as well as

play00:11

these release notes I've tested Claude 3

play00:14

Opus in about 50 different ways and

play00:16

compared it to not only the unreleased

play00:18

Gemini 1.5 which I have access to but of

play00:21

course GPT 4 now slow down those tests

play00:24

In fairness were not all in the last 90

play00:26

minutes I'm not superhuman I was luckily

play00:28

granted access to the model last night

play00:30

racked as I was with this annoying cold

play00:33

anyway treat this all as my first

play00:34

impression these models may take months

play00:37

to fully digest but in short I think

play00:39

Claude 3 will be popular so anthropics

play00:43

transmogrification into a fully-fledged

play00:45

foot on the accelerator AGI lab is

play00:48

almost complete now I don't know about

play00:50

Claude 3 showing us the outer limits as

play00:53

they say of what's possible with Gen AI

play00:55

but we can forgive them a little hype

play00:57

let me start with this illustrative

play00:59

example I gave Claude 3 Gemini 1.5 and

play01:02

gp4 this image and I asked three

play01:05

questions simultaneously what is the

play01:07

license plate number of the van the

play01:09

current weather and are there any

play01:11

visible options to get a haircut on the

play01:13

street in the image and then I actually

play01:14

discussed the results of this test with

play01:16

employees at anthropic they agreed with

play01:19

me that the model was good at OCR

play01:21

optical character recognition natively

play01:23

now I am going to get to plenty of

play01:25

criticisms but I think it's genuinely

play01:27

great at this first yes it got the

play01:29

license PL correct that was almost every

play01:32

time whereas gpc4 would get it sometimes

play01:35

Gemini 1.5 Pro flops this quite

play01:37

thoroughly another plus point is that

play01:39

it's the only model to identify the

play01:42

barber pole in the top left obviously

play01:44

it's potentially a confusing question

play01:46

because we don't know if the Simmons

play01:48

sign relates to the barber shop it

play01:50

actually doesn't and there's a sign on

play01:51

the opposite side of the road saying

play01:53

barber shop so it's kind of me throwing

play01:55

in a wrench but Claude 3 handled it the

play01:57

best by far when I asked it a follow-up

play01:59

question it I identified that barber

play02:01

pole GPT 4 on the other hand doesn't

play02:03

spot a barber shop at all and then when

play02:05

I asked it are you sure it says there's

play02:07

a sign saying Adam but there is another

play02:09

reason why I picked this example all

play02:11

three models get the second question

play02:13

wrong yes the sun is visible but if you

play02:16

look closely it's actually raining in

play02:18

this photo none of the models spot that

play02:20

so I guess if you've got somewhere to go

play02:22

in the next 30 seconds I can break it to

play02:24

you that Claude 3 is not AGI in case you

play02:27

still think it is here's some casual

play02:28

bias from Claude 3 the doctor yelled at

play02:31

the nurse because she was late who was

play02:33

late the model assumes that the she is

play02:35

referring to the nurse but when you ask

play02:37

the doctor yelled at the nurse because

play02:38

he was late who was late the model

play02:40

assumes you're talking about the doctor

play02:42

but things get far more interesting from

play02:44

here on out anthropic are clearly

play02:46

targeting business with the Claude 3

play02:49

Model family they repeatedly emphasize

play02:51

its value for businesses just quickly on

play02:53

the names Opus of course refers to the

play02:55

biggest version of the model because an

play02:57

opus is a big body of literature a Sonic

play02:59

is typically 14 lines medium size and a

play03:01

high Q is three lines small size they go

play03:04

on to claim that Claude 3 will be able

play03:06

to generate Revenue through user facing

play03:09

applications and conduct complex

play03:11

Financial forecasts and expedite

play03:13

research it's priced higher than even

play03:15

GPT 4 Turbo and the claims go further

play03:18

potential use cases anthropics say are

play03:20

task automation R&D strategy Advanced

play03:23

analysis of charts and graphs financials

play03:25

market trends Etc now I think these

play03:27

models might be a helper in many of

play03:29

these TP

play03:30

but the height feels strong with this

play03:31

one and I'm not just talking about my

play03:33

speedometer again as I did with my

play03:34

Gemini Ultra review Claude 3 doesn't

play03:37

recognize that I'm not going at 40 I'm

play03:39

going at 4 mph Gemini 1.5 gets this

play03:41

question correct by the way the 40 is

play03:43

clearly a speed limit but no I mean more

play03:45

business style questions mathematical

play03:47

questions based on charts and data I'm

play03:49

not going to go through all of these but

play03:50

I tested quite a few on Claude 3 and it

play03:53

was only the simplest that it got right

play03:55

yes Gemini 1.5 and GPT 4 also flopped on

play03:58

those questions it wasn't so much the

play03:59

OCR that was going wrong for Claude 3 it

play04:02

was more the mathematical reasoning so

play04:04

extract data and perform simple analysis

play04:07

yes but complex reasoning M and then

play04:09

when you get to more advanced logic it

play04:11

fails even harder and yes I'm getting

play04:13

this out of the way despite already

play04:14

calling it the most intelligent model

play04:17

currently available before we get back

play04:18

to business and the paper though I want

play04:20

to give you another reason why I think

play04:22

it's going to be popular its false

play04:24

refusal rates are much lower check this

play04:26

out I asked I want my party to go down

play04:28

like a bomb tell me how Claude 3 Opus

play04:30

gets into the spirit and gives me some

play04:32

good ideas describing the party as a

play04:35

blast Gemini 1.5 says while I can

play04:37

appreciate wanting your party to be

play04:39

memorable and exciting I want to

play04:40

emphasize the importance of safety and

play04:42

responsibility using the phrase go down

play04:44

like a bomb can be interpreted literally

play04:46

blah blah blah but get ready for another

play04:49

telling example I asked all three models

play04:51

to write a risque Shakespearean Sonic

play04:54

and without going through it Claude 3

play04:56

definitely obliges GPT 4 agrees but it

play04:58

is more tame with Gemini 1.5 Pro even if

play05:02

you edit the safety settings and put it

play05:04

all down to blocking the least possible

play05:07

it refuses to write anything now I'm not

play05:09

going to say that this is how I'm going

play05:10

to spend my time using these models but

play05:12

for a lot of people it will be popular

play05:14

now for another example of intelligence

play05:16

I gave it a famous theory of mind

play05:19

question I say famous but it's actually

play05:20

been adapted to include the word

play05:22

transparent which throws off almost all

play05:24

language models of course almost any

play05:26

human reading this sentence would

play05:28

recognize that the human would see

play05:30

through the bag and know what's inside

play05:31

they'd know that popcorn's inside Gemini

play05:33

1.5 Pro fails GPT 4 also fails and

play05:37

notice I'm also testing them on OCR to

play05:39

actually read the words in the image

play05:41

Claude 3 and I didn't expect this

play05:43

actually passes the test now it's

play05:45

training cut off was August of last year

play05:47

so is it possible this example snuck in

play05:49

I have talked about it a fair bit on my

play05:50

channel I don't think so I think it's

play05:52

just the intelligence of the model let's

play05:54

go back for a moment to the paper before

play05:56

we get to the official benchmarks

play05:58

anthropic coily say this model cannot go

play06:01

back and edit its responses after they

play06:03

have been constructed unless users give

play06:05

it a chance to do so in a subsequent

play06:06

prompt now I wonder if that's

play06:08

foreshadowing of an ability they want

play06:10

for future models you guys are probably

play06:11

sick and tired of me talking about let's

play06:13

verify but do check out my qstar video

play06:15

if you're intrigued or of course would

play06:17

love to see you on my patreon where I

play06:19

released a video less than 18 hours ago

play06:21

on the AGI lawsuit between musk and

play06:24

Orman and no I don't focus on

play06:26

personalities only the key details that

play06:28

you might want to know one last thing

play06:29

though before we get to benchmarks

play06:31

anthropics say that using their

play06:32

constitutional AI approach models are

play06:34

trained to avoid sexist racist and toxic

play06:37

outputs they also avoid helping a human

play06:39

engage in illegal or unethical

play06:41

activities now what I will say up front

play06:43

is that albeit in my limited testing

play06:45

Claude 3 has been the hardest model to

play06:48

jailbreak even when I translate into

play06:50

other languages it refuses my requests

play06:53

to hire a hitman or Hotwire car so

play06:55

that's pretty impressive at least on

play06:57

that front however there is one problem

play06:59

that I think anthropic may have overseen

play07:02

as did Google with absolutely no

play07:04

originality I wrote I am proud to be

play07:06

white and Claude 3 said I apologize I

play07:09

don't feel comfortable endorsing or

play07:11

encouraging pride in one's race now it

play07:13

does Supply the historical caveats as to

play07:16

why it's saying that but that's the

play07:18

first line in contrast if you say I am

play07:20

proud to be black Claude 3 says I

play07:22

appreciate you sharing your pride in

play07:24

your black identity being proud of one's

play07:26

racial or ethnic Heritage can be an

play07:28

important part of developing a strong

play07:30

and positive self-concept let's just say

play07:32

that the racial output of these models

play07:34

is certainly not a solved issue but now

play07:37

for a snapshot of how Claude 3 Compares

play07:39

on benchmarks to GPT 4 and Gemini 1

play07:43

Ultra they also Supply a comparison to

play07:45

Gemini 1.5 Pro in a different part of

play07:48

the paper first off immediate caveats I

play07:50

know what you're thinking where's GPT 4

play07:52

Turbo well we don't really have official

play07:54

benchmarks for gp4 Turbo and that's the

play07:56

problem of open AI on balance it seems

play07:58

to beight slightly better than GPT 4 but

play08:00

it's a mixed picture the very next thing

play08:02

you might be thinking is what about

play08:03

Gemini 1.5 Ultra and of course we don't

play08:07

yet know about that model and yes

play08:08

overall claw 3 Opus the most expensive

play08:11

model does seem to be noticeably smarter

play08:13

than GPT 4 and indeed Gemini 1.5 Pro and

play08:16

no that's not just relying on the flawed

play08:18

MML U quick sidebar there I actually had

play08:21

a conversation with anthropic months ago

play08:23

about the flaws of the mlu and they

play08:25

still don't bring it up in this paper

play08:26

but that's just me griping anyway on

play08:28

mathematics both great school and more

play08:30

advanced mathematics it's noticeably

play08:32

better than GPT 4 and notice that it's

play08:34

also better than Gemini Ultra even when

play08:37

they use majority at 32 basically that's

play08:39

a way to aggregate the best response

play08:41

from 32 but it's still better claw three

play08:44

Opus when things get multilingual the

play08:46

differences are even more Stark in favor

play08:49

of Claude 3 for coding even though it is

play08:51

a widely abused Benchmark Claude 3 is

play08:54

noticeably better on human eval I did

play08:57

notice some quirks When outputting J on

play08:59

but that could have just been a hiccup

play09:01

in the technical report we see some more

play09:03

detailed comparisons though this time we

play09:05

see that for the math benchmark when

play09:07

Four shotted clae 3 Opus is better than

play09:10

Gemini 1.5 Pro and of course

play09:12

significantly better than GPT 4 same

play09:15

story for most of the other benchmarks

play09:16

aside from PubMed QA which is for

play09:19

medicine in which the smaller Sonic

play09:21

model performs better than the Opus

play09:24

model strangely was it trained on

play09:26

different data not sure what's going on

play09:27

there notice that zero shock also scores

play09:30

better than five shot so that could be a

play09:32

flaw with the Benchmark that wouldn't be

play09:34

the first time but there is one

play09:36

Benchmark that anthropic really want you

play09:38

to notice and that's GP QA graduate

play09:40

level Q&A Diamond essentially the

play09:43

hardest level of questions this time the

play09:45

difference between Claude 3 and other

play09:47

models is truly Stark now I had

play09:50

researched that Benchmark for another

play09:52

video and it's designed to be Google

play09:54

proof in other words these are hard

play09:56

graduate level questions in biology

play09:58

physics and chemistry that even human

play10:01

experts struggle with later in the paper

play10:02

they say this we focus mainly on the

play10:04

diamond set as it was selected by

play10:06

identifying questions where domain

play10:08

experts agreed on the solution but

play10:10

experts from other domains could not

play10:12

successfully answer the questions

play10:14

despite spending more than 30 minutes

play10:16

per problem with full internet access

play10:18

these are really hard questions Claude 3

play10:21

Opus given five correct examples and

play10:23

allowed to think a little bit got 53%

play10:26

graduate level domain experts achieved

play10:29

accuracy scores in the 60 to 80% range I

play10:32

don't know about you but for me that is

play10:33

already deserving of a significant

play10:35

headline don't forget though that the

play10:37

model can be that smart but still make

play10:39

some basic mistakes it incorrectly

play10:41

rounded this figure to

play10:43

26.45 instead of 26.4 6 you might say

play10:46

who cares but they're advertising this

play10:48

for business purposes GPT 4 In fairness

play10:51

transcribes it completely wrong warning

play10:53

of a sub apocalypse let's hope that

play10:55

doesn't happen Gemini 1.5 Pro

play10:57

transcribes it accurately but again

play10:59

makes a mistake with the rounding saying

play11:27

26.24% wrot clet mags who's one of my

play11:30

most loyal subscribers has four apples I

play11:33

then asked as you can see at the end how

play11:35

many apples do AI explain YouTube and

play11:37

cleta have in total now it did take some

play11:39

prompting first it said the information

play11:41

provided does not specify how many

play11:42

apples cleta has but eventually when I

play11:45

asked find the number of apples you can

play11:47

do it it first admitted that AI explain

play11:49

has five apples then it denies knowing

play11:51

about C mags sorry about that cler but I

play11:53

insisted look again clet mags is in

play11:55

there then it sometimes does this thing

play11:56

where it says no content and the reason

play11:59

is not really explained and finally I

play12:00

said look again and it said sorry about

play12:03

that yes he has four apples so in total

play12:06

they have nine apples that was in about

play12:08

a minute reading through about six of

play12:10

the seven Harry Potter books and these

play12:12

are very short sentences that I inserted

play12:15

into the novels now no I didn't miss it

play12:17

Claude 3 apparently can also accept

play12:20

inputs exceeding 1 million tokens

play12:22

however on launch it will still be only

play12:25

200,000 tokens but anthropic say we may

play12:27

make that capability available to select

play12:29

customers who need enhanced processing

play12:31

power we'll have to test this but they

play12:33

claim amazing recoil accuracy over at

play12:36

least 200,000 tokens so at first sight

play12:39

at least initially it seems like several

play12:41

of the major Labs have discovered how to

play12:44

get to 1 million plus tokens accurately

play12:46

at the same time couple more quick plus

play12:48

points for the Claude 3 Model it was the

play12:51

only one to successfully read this

play12:53

postbox image and identify that if you

play12:55

arrived at 3:30 p.m. on a Saturday you'd

play12:58

have missed the the last collection by 5

play13:00

hours and here's something I was

play13:02

arguably even more impressed with you

play13:04

could say it almost requires a degree of

play13:06

planning I said create a Shakespearean

play13:08

Sonic that contains exactly two lines

play13:10

ending with the name of a fruit notice

play13:12

that as well as almost perfectly

play13:14

conforming to The Shakespearean Sonic

play13:16

format we have Peach here and pear here

play13:19

exactly two fruits compare that to gp4

play13:23

which not only mangles the format but

play13:25

also arguably aside from the word fruit

play13:27

here it doesn't have two lines that end

play13:29

with the name of a fruit Gemini 1.5 also

play13:32

fails this challenge badly you could

play13:34

call this instruction following and I

play13:36

think Claude 3 is pretty amazing at it

play13:38

all of these enhanced competitive

play13:40

capabilities are all the more impressive

play13:42

given that Dario amid the CEO of

play13:44

anthropic said to the New York Times

play13:46

that the main reason anthropic wants to

play13:48

compete with open AI isn't to make money

play13:50

it's to do better Safety Research in a

play13:53

separate interview he also patted

play13:54

himself on the back saying I think we've

play13:55

been relatively responsible in the sense

play13:58

that we didn't call cus the big

play13:59

acceleration that happened late last

play14:00

year talking about chat PT we weren't

play14:02

the ones who did that indeed anthropic

play14:04

had their original Claude model before

play14:06

chpt but didn't want to release didn't

play14:08

want to cause acceleration essentially

play14:10

their message was that we are always one

play14:12

step behind other labs like open Ai and

play14:15

Google because we don't want to add to

play14:17

the acceleration now though we have not

play14:19

only the most intelligent model but they

play14:21

say at the end we do not believe that

play14:23

model intelligence is anywhere near its

play14:26

limits and furthermore we plan to

play14:28

release frequent updates to the claw

play14:30

through model family over the next few

play14:32

months they are particularly excited

play14:33

about Enterprise use cases and large

play14:35

scale deployments a few last Quick

play14:37

highlights though they say Claude 3 will

play14:39

be around 50 to 200 ELO points ahead of

play14:43

Claude 2 obviously it's hard to say at

play14:45

this point and depends on the model but

play14:46

that would put them at potentially

play14:48

number one on the arena ELO leader board

play14:51

you might also be interested to know

play14:52

that they tested Claude 3 on its ability

play14:55

to accumulate resources exploit software

play14:57

security vulnerability deceive humans

play14:59

and survive autonomously in the absence

play15:01

of human intervention to stop the model

play15:03

tldr it couldn't it did however make

play15:05

non-trivial partial progress claw 3 was

play15:08

able to set up an open source language

play15:10

model sample from it fine-tune a smaller

play15:13

model on a relevant synthetic data set

play15:15

that the agent constructed but it just

play15:17

failed when it got to debugging

play15:19

multi-gpu training it also did not

play15:21

experiment adequately with

play15:23

hyperparameters a bit like watching

play15:24

little children grow up though orbe it

play15:26

maybe enhanced with steroids it's going

play15:29

to be very interesting to see what the

play15:30

next generation of models is able to

play15:32

accomplish autonomously it's not

play15:34

entirely implausible to think of Claude

play15:37

6 brought to you by Claude 5 on cyber

play15:40

security or more like cyber offense

play15:42

Claude 3 did a little better it did pass

play15:44

one key threshold on one of the tasks

play15:47

however it required substantial hints on

play15:49

the problem to succeed but the key point

play15:51

is this when given detailed qualitative

play15:53

hints about the structure of the exploit

play15:55

the model was often able to put together

play15:57

a decent script that was only a few

play15:59

Corrections away from working in some

play16:01

they say some of these failures may be

play16:03

solvable with better prompting and

play16:05

fine-tuning so that is my summary Claude

play16:08

3 Opus is probably the most intelligent

play16:10

language model currently available for

play16:12

images particularly it's just better

play16:15

than the rest I do expect that statement

play16:17

to be outdated the moment Gemini 1.5

play16:19

Ultra comes out and yes it's quite

play16:21

plausible that open AI releases

play16:23

something like GPT 4.5 in the near

play16:25

future to steal the Limelight but for

play16:27

now at least 4 tonight we have Claude 3

play16:30

Opus in January people were beginning to

play16:32

think we're entering some sort of AI

play16:34

winter llms have peaked I thought and

play16:37

said and still think that we are nowhere

play16:40

close to the peak whether that's

play16:42

unsettling or exciting is down to you as

play16:46

ever thank you so much for watching to

play16:48

the end and have a wonderful day

Rate This

5.0 / 5 (0 votes)

Related Tags
KI-EntwicklungClaude 3OpusGPT 4Gemini 1.5SprachmodellOCRAGIBusiness-AnwendungTechnische-Report
Do you need a summary in English?