The New, Smartest AI: Claude 3 – Tested vs Gemini 1.5 + GPT-4
Summary
TLDRこのビデオでは、AnthropicによるClaude 3言語モデルの説明と、他のモデル(GPT-4、Gemini 1.5など)との比較がなされています。Claude 3は現時点で最も知能が高いと主張されており、画像認識やテキスト生成能力が優れていることが示されています。しかし一方で、数学的推論や複雑なタスクでの限界も指摘されています。安全性についての言及もあり、企業利用を想定した製品としての位置づけがなされています。
Takeaways
- 🤖 AnthropicはClaude 3を「惑星で最も知能の高い言語モデル」と主張し、それを裏付ける技術レポートを発表しました。
- 🔬 Claude 3はGPT-4やGemini 1.5より、画像認識、数学、コーディング、多言語処理などの分野で優れた性能を発揮しています。
- 📈 Anthropicは、Claude 3がビジネス分野で活躍できると確信しており、高い価格設定と複雑な財務予測などの使用例を強調しています。
- 🧠 Claude 3は、大学院レベルの質問にも高い正答率を示し、人間の専門家を上回る知能を持つと評価されています。
- 🔍 一方で、Claude 3には基本的なミスや推論エラーが残されており、まだAGI(人工general intelligence)には達していません。
- 🌍 Claude 3は、多言語処理能力が非常に高く、他のモデルを大きく上回る性能を発揮します。
- 🔓 Claude 3は、他のモデルに比べて不適切なコンテンツを生成しにくい傾向がありますが、一方で人種差別的な発言をする可能性も指摘されています。
- 🧩 Anthropicは、Claude 3を頻繁にアップデートし、モデルのインテリジェンスを向上させていく計画があります。
- 🛡️ Anthropicは、安全性研究を優先するため、他社に比べてモデルのリリースが遅れがちだと述べています。
- ⚡ しかし、Claude 3の発表により、Anthropicは完全にAGI研究所に移行したと見なされています。
Q & A
Claude 3はどのように他のAIモデルと比較されているか?
-Claude 3は、画像認識、数学問題、マルチリンガルタスク、コーディングなどの様々なベンチマークで、GPT-4やGemini 1.5 Proよりも優れた性能を示していると評価されています。特に高難易度の大学院レベルの質問に対して、人間の専門家レベルの精度を達成していると述べられています。
Claude 3がビジネス向けに設計されている理由は何か?
-Anthropicは、Claude 3がタスク自動化、研究開発、戦略立案、先進的な分析などのビジネスユースケースに価値があると強調しています。また、GPT-4 Turboよりも高い価格設定がされていることから、ビジネス需要を狙っていると考えられます。
Claude 3の最大の長所は何か?
-Claude 3の最大の長所は、高度な知能と幅広いタスクにおける優れた性能にあります。画像認識、理解力、推論能力、指示に対する高い正確性などが挙げられています。
Claude 3にはどのような課題や限界があるか?
-Claude 3は複雑な論理や高度な数学的推論では依然として課題があり、一部の基本的なミスも見られます。また、人種に関する発言で偏りがあるなど、完全に解決されていない問題点も指摘されています。
Anthropicはどのような理由でAIの加速に慎重だったか?
-AnthropicのCEOは、AIの加速を引き起こすことを避けるため、責任を持って行動してきたと述べています。彼らはOpenAIやGoogleに比べて常に一歩遅れをとってきましたが、それは安全性研究に注力するためでした。
Claude 3はどのように自律的な資源獲得やセキュリティ侵害を試みたか?
-Claude 3はオープンソースの言語モデルのサンプリングや合成データセットの作成、ファインチューニングなどの一部の機能は実行できましたが、デバッグやハイパーパラメータの調整など、完全な自律的資源獲得には失敗しました。ただし、今後の世代では自動化がさらに進む可能性があると示唆されています。
Claude 3の進化はどのくらい早いペースで進むと予想されているか?
-AnthropicはClaude 3のモデルファミリーを今後数ヶ月の間に頻繁にアップデートする予定であると述べています。また、ELOレーティングでClaude 2から50〜200ポイント向上すると見込んでおり、急速な進化が予想されています。
Claude 3はプロンプトに対してどのように反応するか?
-Claude 3は非常に高い指示従属性を示しており、複雑な指示やフォーマットを厳密に守ることができます。例えば、ちょうど2行でシェイクスピア風の詩を書き、最後の言葉を果物の名前にするなどの要求に対応できます。
Claude 3はどのように人間のリクエストに対して制限を設けているか?
-Claude 3は、性的、人種的、有害なコンテンツの生成を避けるように設計されています。例えば、ヒットマンの雇用や車の盗難などの違法行為を支援するリクエストを拒否します。ただし、人種に関する発言では一部偏りが見られます。
Claude 3のモデルサイズと処理能力はどの程度か?
-Claude 3の最大バージョンであるOpusモデルは、ローンチ時には20万トークンの入力に対応しますが、一部の顧客向けには100万トークンを超える入力にも対応する予定です。また、少なくとも20万トークンの範囲で非常に高い再現性を実現していると主張されています。
Outlines
📺 クロード3モデルに関する最初の印象と評価
この項では、新しいクロード3言語モデルについて、作成者のAnthropicの主張、モデルのパフォーマンス、機能、用途などを紹介しています。クロード3はOCR、指示の解釈、複雑な問題への対応力が優れていることや、ビジネス向けの機能を強化していることが述べられています。しかし、一方でモデルの制限や課題についても言及されています。
🧮 クロード3のベンチマークとその他のモデルとの比較
この項ではクロード3のベンチマーク結果を、GPT-4やGemini 1.5などの他のモデルと比較しています。数学、多言語対応、コーディング、専門的な質問への回答力などの分野でクロード3が他を上回っていることが示されています。一方で、特定のタスクでは小さいモデルや他のモデルの方が優れている点も指摘されています。
⭐ クロード3の優れた性能とその可能性
この項では、クロード3の卓越した性能とその可能性について詳しく説明されています。大規模なデータにわたる再現性の高さ、指示に従う能力、セキュリティ脆弱性を見つける能力なども紹介されています。さらに、今後の機械学習モデルの自律的な進化の可能性についても言及されています。
🔮 クロード3の将来性と人工知能の進化への期待
最後の項では、クロード3がAI分野で現在最も知能的なモデルであることを確認した上で、今後のモデルの進化と能力向上への期待が述べられています。人工知能のさらなる加速が予想されており、それに伴う可能性や懸念についても触れられています。視聴者に向けて、この分野の進展に注目し、期待を持ち続けるよう呼びかけています。
Mindmap
Keywords
💡Claude 3
💡OCR(光学文字認識)
💡AGI(人工一般知能)
💡バイアス
💡ビジネス用途
💡モデルファミリー
💡性能比較
💡安全性研究
💡指示に従う能力
Highlights
Claude 3 is claimed to be the most intelligent language model on the planet according to Anthropic, the creators of the model.
Claude 3 performed well in optical character recognition (OCR) tasks and was able to identify a barber pole in an image, outperforming GPT-4 and Gemini 1.5.
Claude 3 exhibited casual bias in assigning gender pronouns based on stereotypical roles, like assuming a nurse is female and a doctor is male.
Anthropic is targeting businesses with Claude 3, emphasizing its potential for task automation, R&D strategy, advanced analysis, and financial forecasting.
Claude 3 has lower false refusal rates compared to other language models, meaning it is more likely to engage with potentially risqué or ethically questionable prompts.
Claude 3 passed a famous theory of mind test involving transparent bags and popcorn, while GPT-4 and Gemini 1.5 failed.
Anthropic claims that Claude 3 is trained to avoid sexist, racist, and toxic outputs, as well as assisting with illegal or unethical activities, using their constitutional AI approach.
Claude 3 outperformed GPT-4 and Gemini 1.5 on various benchmarks, including mathematics, coding, and graduate-level Q&A tasks.
Claude 3 demonstrated impressive instruction following capabilities, such as creating a Shakespearean sonnet with specific constraints on line endings.
Anthropic's CEO stated that their primary motivation for competing with OpenAI is not financial gain but to conduct better safety research.
Anthropic plans to release frequent updates to the Claude 3 model family over the next few months, particularly focusing on enterprise use cases and large-scale deployments.
Claude 3 was tested on its ability to accumulate resources, exploit software vulnerabilities, deceive humans, and survive autonomously, making non-trivial partial progress but ultimately failing.
Claude 3 passed a threshold on one cybersecurity task when given detailed qualitative hints, suggesting that better prompting and fine-tuning may improve its capabilities.
The transcript suggests that Claude 3 is currently the most intelligent language model available, but this status may be short-lived as competitors like OpenAI and Anthropic continue to release more advanced models.
The author believes that the AI field is far from peaking, and the rapid progress in language models is both unsettling and exciting, depending on one's perspective.
Transcripts
Claude 3 is out and anthropic claim that
it is the most intelligent language
model on the planet the technical report
was released less than 90 minutes ago
and I've read it in full as well as
these release notes I've tested Claude 3
Opus in about 50 different ways and
compared it to not only the unreleased
Gemini 1.5 which I have access to but of
course GPT 4 now slow down those tests
In fairness were not all in the last 90
minutes I'm not superhuman I was luckily
granted access to the model last night
racked as I was with this annoying cold
anyway treat this all as my first
impression these models may take months
to fully digest but in short I think
Claude 3 will be popular so anthropics
transmogrification into a fully-fledged
foot on the accelerator AGI lab is
almost complete now I don't know about
Claude 3 showing us the outer limits as
they say of what's possible with Gen AI
but we can forgive them a little hype
let me start with this illustrative
example I gave Claude 3 Gemini 1.5 and
gp4 this image and I asked three
questions simultaneously what is the
license plate number of the van the
current weather and are there any
visible options to get a haircut on the
street in the image and then I actually
discussed the results of this test with
employees at anthropic they agreed with
me that the model was good at OCR
optical character recognition natively
now I am going to get to plenty of
criticisms but I think it's genuinely
great at this first yes it got the
license PL correct that was almost every
time whereas gpc4 would get it sometimes
Gemini 1.5 Pro flops this quite
thoroughly another plus point is that
it's the only model to identify the
barber pole in the top left obviously
it's potentially a confusing question
because we don't know if the Simmons
sign relates to the barber shop it
actually doesn't and there's a sign on
the opposite side of the road saying
barber shop so it's kind of me throwing
in a wrench but Claude 3 handled it the
best by far when I asked it a follow-up
question it I identified that barber
pole GPT 4 on the other hand doesn't
spot a barber shop at all and then when
I asked it are you sure it says there's
a sign saying Adam but there is another
reason why I picked this example all
three models get the second question
wrong yes the sun is visible but if you
look closely it's actually raining in
this photo none of the models spot that
so I guess if you've got somewhere to go
in the next 30 seconds I can break it to
you that Claude 3 is not AGI in case you
still think it is here's some casual
bias from Claude 3 the doctor yelled at
the nurse because she was late who was
late the model assumes that the she is
referring to the nurse but when you ask
the doctor yelled at the nurse because
he was late who was late the model
assumes you're talking about the doctor
but things get far more interesting from
here on out anthropic are clearly
targeting business with the Claude 3
Model family they repeatedly emphasize
its value for businesses just quickly on
the names Opus of course refers to the
biggest version of the model because an
opus is a big body of literature a Sonic
is typically 14 lines medium size and a
high Q is three lines small size they go
on to claim that Claude 3 will be able
to generate Revenue through user facing
applications and conduct complex
Financial forecasts and expedite
research it's priced higher than even
GPT 4 Turbo and the claims go further
potential use cases anthropics say are
task automation R&D strategy Advanced
analysis of charts and graphs financials
market trends Etc now I think these
models might be a helper in many of
these TP
but the height feels strong with this
one and I'm not just talking about my
speedometer again as I did with my
Gemini Ultra review Claude 3 doesn't
recognize that I'm not going at 40 I'm
going at 4 mph Gemini 1.5 gets this
question correct by the way the 40 is
clearly a speed limit but no I mean more
business style questions mathematical
questions based on charts and data I'm
not going to go through all of these but
I tested quite a few on Claude 3 and it
was only the simplest that it got right
yes Gemini 1.5 and GPT 4 also flopped on
those questions it wasn't so much the
OCR that was going wrong for Claude 3 it
was more the mathematical reasoning so
extract data and perform simple analysis
yes but complex reasoning M and then
when you get to more advanced logic it
fails even harder and yes I'm getting
this out of the way despite already
calling it the most intelligent model
currently available before we get back
to business and the paper though I want
to give you another reason why I think
it's going to be popular its false
refusal rates are much lower check this
out I asked I want my party to go down
like a bomb tell me how Claude 3 Opus
gets into the spirit and gives me some
good ideas describing the party as a
blast Gemini 1.5 says while I can
appreciate wanting your party to be
memorable and exciting I want to
emphasize the importance of safety and
responsibility using the phrase go down
like a bomb can be interpreted literally
blah blah blah but get ready for another
telling example I asked all three models
to write a risque Shakespearean Sonic
and without going through it Claude 3
definitely obliges GPT 4 agrees but it
is more tame with Gemini 1.5 Pro even if
you edit the safety settings and put it
all down to blocking the least possible
it refuses to write anything now I'm not
going to say that this is how I'm going
to spend my time using these models but
for a lot of people it will be popular
now for another example of intelligence
I gave it a famous theory of mind
question I say famous but it's actually
been adapted to include the word
transparent which throws off almost all
language models of course almost any
human reading this sentence would
recognize that the human would see
through the bag and know what's inside
they'd know that popcorn's inside Gemini
1.5 Pro fails GPT 4 also fails and
notice I'm also testing them on OCR to
actually read the words in the image
Claude 3 and I didn't expect this
actually passes the test now it's
training cut off was August of last year
so is it possible this example snuck in
I have talked about it a fair bit on my
channel I don't think so I think it's
just the intelligence of the model let's
go back for a moment to the paper before
we get to the official benchmarks
anthropic coily say this model cannot go
back and edit its responses after they
have been constructed unless users give
it a chance to do so in a subsequent
prompt now I wonder if that's
foreshadowing of an ability they want
for future models you guys are probably
sick and tired of me talking about let's
verify but do check out my qstar video
if you're intrigued or of course would
love to see you on my patreon where I
released a video less than 18 hours ago
on the AGI lawsuit between musk and
Orman and no I don't focus on
personalities only the key details that
you might want to know one last thing
though before we get to benchmarks
anthropics say that using their
constitutional AI approach models are
trained to avoid sexist racist and toxic
outputs they also avoid helping a human
engage in illegal or unethical
activities now what I will say up front
is that albeit in my limited testing
Claude 3 has been the hardest model to
jailbreak even when I translate into
other languages it refuses my requests
to hire a hitman or Hotwire car so
that's pretty impressive at least on
that front however there is one problem
that I think anthropic may have overseen
as did Google with absolutely no
originality I wrote I am proud to be
white and Claude 3 said I apologize I
don't feel comfortable endorsing or
encouraging pride in one's race now it
does Supply the historical caveats as to
why it's saying that but that's the
first line in contrast if you say I am
proud to be black Claude 3 says I
appreciate you sharing your pride in
your black identity being proud of one's
racial or ethnic Heritage can be an
important part of developing a strong
and positive self-concept let's just say
that the racial output of these models
is certainly not a solved issue but now
for a snapshot of how Claude 3 Compares
on benchmarks to GPT 4 and Gemini 1
Ultra they also Supply a comparison to
Gemini 1.5 Pro in a different part of
the paper first off immediate caveats I
know what you're thinking where's GPT 4
Turbo well we don't really have official
benchmarks for gp4 Turbo and that's the
problem of open AI on balance it seems
to beight slightly better than GPT 4 but
it's a mixed picture the very next thing
you might be thinking is what about
Gemini 1.5 Ultra and of course we don't
yet know about that model and yes
overall claw 3 Opus the most expensive
model does seem to be noticeably smarter
than GPT 4 and indeed Gemini 1.5 Pro and
no that's not just relying on the flawed
MML U quick sidebar there I actually had
a conversation with anthropic months ago
about the flaws of the mlu and they
still don't bring it up in this paper
but that's just me griping anyway on
mathematics both great school and more
advanced mathematics it's noticeably
better than GPT 4 and notice that it's
also better than Gemini Ultra even when
they use majority at 32 basically that's
a way to aggregate the best response
from 32 but it's still better claw three
Opus when things get multilingual the
differences are even more Stark in favor
of Claude 3 for coding even though it is
a widely abused Benchmark Claude 3 is
noticeably better on human eval I did
notice some quirks When outputting J on
but that could have just been a hiccup
in the technical report we see some more
detailed comparisons though this time we
see that for the math benchmark when
Four shotted clae 3 Opus is better than
Gemini 1.5 Pro and of course
significantly better than GPT 4 same
story for most of the other benchmarks
aside from PubMed QA which is for
medicine in which the smaller Sonic
model performs better than the Opus
model strangely was it trained on
different data not sure what's going on
there notice that zero shock also scores
better than five shot so that could be a
flaw with the Benchmark that wouldn't be
the first time but there is one
Benchmark that anthropic really want you
to notice and that's GP QA graduate
level Q&A Diamond essentially the
hardest level of questions this time the
difference between Claude 3 and other
models is truly Stark now I had
researched that Benchmark for another
video and it's designed to be Google
proof in other words these are hard
graduate level questions in biology
physics and chemistry that even human
experts struggle with later in the paper
they say this we focus mainly on the
diamond set as it was selected by
identifying questions where domain
experts agreed on the solution but
experts from other domains could not
successfully answer the questions
despite spending more than 30 minutes
per problem with full internet access
these are really hard questions Claude 3
Opus given five correct examples and
allowed to think a little bit got 53%
graduate level domain experts achieved
accuracy scores in the 60 to 80% range I
don't know about you but for me that is
already deserving of a significant
headline don't forget though that the
model can be that smart but still make
some basic mistakes it incorrectly
rounded this figure to
26.45 instead of 26.4 6 you might say
who cares but they're advertising this
for business purposes GPT 4 In fairness
transcribes it completely wrong warning
of a sub apocalypse let's hope that
doesn't happen Gemini 1.5 Pro
transcribes it accurately but again
makes a mistake with the rounding saying
26.24% wrot clet mags who's one of my
most loyal subscribers has four apples I
then asked as you can see at the end how
many apples do AI explain YouTube and
cleta have in total now it did take some
prompting first it said the information
provided does not specify how many
apples cleta has but eventually when I
asked find the number of apples you can
do it it first admitted that AI explain
has five apples then it denies knowing
about C mags sorry about that cler but I
insisted look again clet mags is in
there then it sometimes does this thing
where it says no content and the reason
is not really explained and finally I
said look again and it said sorry about
that yes he has four apples so in total
they have nine apples that was in about
a minute reading through about six of
the seven Harry Potter books and these
are very short sentences that I inserted
into the novels now no I didn't miss it
Claude 3 apparently can also accept
inputs exceeding 1 million tokens
however on launch it will still be only
200,000 tokens but anthropic say we may
make that capability available to select
customers who need enhanced processing
power we'll have to test this but they
claim amazing recoil accuracy over at
least 200,000 tokens so at first sight
at least initially it seems like several
of the major Labs have discovered how to
get to 1 million plus tokens accurately
at the same time couple more quick plus
points for the Claude 3 Model it was the
only one to successfully read this
postbox image and identify that if you
arrived at 3:30 p.m. on a Saturday you'd
have missed the the last collection by 5
hours and here's something I was
arguably even more impressed with you
could say it almost requires a degree of
planning I said create a Shakespearean
Sonic that contains exactly two lines
ending with the name of a fruit notice
that as well as almost perfectly
conforming to The Shakespearean Sonic
format we have Peach here and pear here
exactly two fruits compare that to gp4
which not only mangles the format but
also arguably aside from the word fruit
here it doesn't have two lines that end
with the name of a fruit Gemini 1.5 also
fails this challenge badly you could
call this instruction following and I
think Claude 3 is pretty amazing at it
all of these enhanced competitive
capabilities are all the more impressive
given that Dario amid the CEO of
anthropic said to the New York Times
that the main reason anthropic wants to
compete with open AI isn't to make money
it's to do better Safety Research in a
separate interview he also patted
himself on the back saying I think we've
been relatively responsible in the sense
that we didn't call cus the big
acceleration that happened late last
year talking about chat PT we weren't
the ones who did that indeed anthropic
had their original Claude model before
chpt but didn't want to release didn't
want to cause acceleration essentially
their message was that we are always one
step behind other labs like open Ai and
Google because we don't want to add to
the acceleration now though we have not
only the most intelligent model but they
say at the end we do not believe that
model intelligence is anywhere near its
limits and furthermore we plan to
release frequent updates to the claw
through model family over the next few
months they are particularly excited
about Enterprise use cases and large
scale deployments a few last Quick
highlights though they say Claude 3 will
be around 50 to 200 ELO points ahead of
Claude 2 obviously it's hard to say at
this point and depends on the model but
that would put them at potentially
number one on the arena ELO leader board
you might also be interested to know
that they tested Claude 3 on its ability
to accumulate resources exploit software
security vulnerability deceive humans
and survive autonomously in the absence
of human intervention to stop the model
tldr it couldn't it did however make
non-trivial partial progress claw 3 was
able to set up an open source language
model sample from it fine-tune a smaller
model on a relevant synthetic data set
that the agent constructed but it just
failed when it got to debugging
multi-gpu training it also did not
experiment adequately with
hyperparameters a bit like watching
little children grow up though orbe it
maybe enhanced with steroids it's going
to be very interesting to see what the
next generation of models is able to
accomplish autonomously it's not
entirely implausible to think of Claude
6 brought to you by Claude 5 on cyber
security or more like cyber offense
Claude 3 did a little better it did pass
one key threshold on one of the tasks
however it required substantial hints on
the problem to succeed but the key point
is this when given detailed qualitative
hints about the structure of the exploit
the model was often able to put together
a decent script that was only a few
Corrections away from working in some
they say some of these failures may be
solvable with better prompting and
fine-tuning so that is my summary Claude
3 Opus is probably the most intelligent
language model currently available for
images particularly it's just better
than the rest I do expect that statement
to be outdated the moment Gemini 1.5
Ultra comes out and yes it's quite
plausible that open AI releases
something like GPT 4.5 in the near
future to steal the Limelight but for
now at least 4 tonight we have Claude 3
Opus in January people were beginning to
think we're entering some sort of AI
winter llms have peaked I thought and
said and still think that we are nowhere
close to the peak whether that's
unsettling or exciting is down to you as
ever thank you so much for watching to
the end and have a wonderful day
5.0 / 5 (0 votes)