Open AI's Q* Is BACK! - Was AGI Just Solved?
Summary
TLDRこのビデオスクリプトでは、最近のAIコミュニティにおける興味深い動きを紹介しています。特に、大きな言語モデル(LLM)が数学問題を解決する能力が注目されています。研究によると、Googleがアルファゴを解決するために使用したモンテカルロ木探索技術を用いたLLMは、GSM数学ベンチマークで高い正解率を記録しました。また、OpenAIが数学問題を解決する能力を示したQARの話題も取り上げられており、AIの進化とその可能性が熱く語られています。
Takeaways
- 🧠 LLM(Large Language Models)は数学問題を解決する能力を示しており、これは技術的なマイルストーンとして注目されています。
- 📈 Googleがアルファ狗を解決するために使用したモンテカルロ木探索技術を用いた研究により、パラメータ数が1000倍少ないLLMでも高い数学問題解決率を達成できることがわかりました。
- 🔍 QAR(Quantum AI Research)は、AIコミュニティを騒然とさせ、Sam Almanが退職する直前に話題になりました。
- 🤖 GPT-0は、DeepMindのアルファゼロプログラムを模倣して、言語モデルが数学や科学の問題を解決する方法を模索する秘密プロジェクトでした。
- 🎯 Monte Carlo tree searchとbackpropagationを組み合わせた手法により、Llama 3はGSM-8K数学ベンチマークで高いスコアを記録しました。
- 🚀 QARは、見たことのない数学問題を解決する能力を示し、AIの安全性に関心を集めています。
- 🔑 Andrej Karpathyが語ったように、未来のモデルではモンテカルロ木探索のような手法が必要不可欠で、これはAIの改善に役立ちます。
- 📚 AlphaGoは、人間を超えるために自己改善を通じて学習し、これはLLMにとっても重要なステップです。
- 🛠️ コンピュートのコストが、LLMと検索を組み合わせた手法の普及を阻む主な障害物となっており、これは最適化の必要性を示しています。
- 🌐 Arc AGIは、新しい基準として提唱されており、システムがAGI(人工一般知能)を達成するかどうかを評価するための重要な指標となっています。
- 🔮 将来のモデルでは、視覚やコーディング、長いコンテキストの理解を改善することで、Arc AGIのベンチマークを超える可能性が示唆されています。
Q & A
QARとは何ですか?また、なぜAIコミュニティに注目されていますか?
-QARは数学問題を未見olvedのものを解決できるという重要な技術的マイルストーンを持っており、AIコミュニティに注目される要因の一つです。QARはSam Almanが退職する直前にOpenAIで突破を果たしたとされる情報記事の話題となりました。
最近の研究で、大きな言語モデルが数学にどのように役立つか説明してください。
-最近の研究では、大きな言語モデルがMonte Carlo tree searchとback propagationを用いて数学問題を解決することができ、Llama 8 billion parametersでは96.7%の正解率を記録しました。これは、パラメータ数が200倍少ないにもかかわらず、GPT 4 ClaudeやGeminiよりも優れていることを意味します。
GPT 0プロジェクトとは何で、なぜ秘密裏に行われたのですか?
-GPT 0は、言語モデルが数学や科学の問題を含む推論タスクを解決できるようにするため、OpenAIが2021年に開始した秘密プロジェクトです。Alpha Zeroプログラムにちなんでおり、チェスや囲碁をプレイするプログラムとして開発されました。
Monte Carlo tree searchとは何であり、どのようにAIに適用される可能性があると述べていますか?
-Monte Carlo tree searchは、Alpha Goで使用されたアルゴリズムで、AIシステムがすべての可能な構成を探索し、次に最も良い手を決定するものです。この手法は、言語モデルに適用され、数学問題の解決に役立つとされています。
Andrej Karpathyが言語モデルの未来について何を述べていますか?
-Andrej Karpathyは、言語モデルの未来においてはMonte Carlo tree searchのような手法が必要であると述べています。彼は1時間の講演で、これらのモデルが改善されるために必要なものであると語っています。
Alpha Goの成功の鍵は何でしたか?
-Alpha Goの成功の鍵は、自己改善とMonte Carlo tree searchでした。Alpha Goは人間エキスパートプレイヤーを模倣することから始まりましたが、後には自己改善を通じて超人的なレベルに達しました。
現在の言語モデルが直面している課題は何ですか?
-現在の言語モデルは、基本的な視覚、コーディング能力、長いコンテキストの扱いについて課題に直面しています。これらの制限を克服することは、AIの性能を向上させる上で重要なステップとなります。
ARC AIとは何であり、なぜ重要なのですか?
-ARC AIは、視覚推論のベンチマークであり、新しいルールを推測する必要があります。重要なのは、AIが新しい状況に適応し、基本的な知識のみを使って問題を解決できることを証明できるためです。
GPT-4の現在の限界はどこにありますか?
-GPT-4は、グリッドの視覚理解、コーディングでの基本的な誤り、長いコンテキストの使用において限界があります。これらの限界を超えることが、AIの性能向上に寄与すると予想されています。
AIがARC AIベンチマークを超えるために必要なものは何ですか?
-AIがARC AIベンチマークを超えるためには、基本的な視覚理解の向上、コーディング能力の向上、長いコンテキストの扱い方改善などが必要です。また、より多くのサンプルを用いた学習も重要な要素です。
Outlines
🧠 AIの数学能力の進歩
最近の研究によると、大きな言語モデルは数学問題を解くのに非常に優れていることがわかりました。Googleがアルファゴを解決するために使ったモンテカルロ木探索とバックプロペーションの技術を用いて、LLaMA 8億パラメータモデルはGSM数学ベンチマークで96.7%の高精度を記録しました。これは、パラメータが200倍少ないにもかかわらず、GPT-4 ClaudeやGeminiを上回る驚くべき結果です。
🤖 アルファゼロから学んだ教訓
アルファゼロは、人間選手を模倣することから始まり、自己改善を通じて超人的なスキルを獲得しました。これは、大きな言語モデルにも適用できる可能性があり、現在は人間のみを模倣する段階にあります。しかし、言語モデルに報酬基準が欠けているため、自己改善のステップ2が難しいとされています。
🔍 LLMと検索の組み合わせ
LLMと検索アルゴリズムを組み合わせることで、AIの能力が大幅に向上することが示されています。Alpha Code 2の例では、検索と再ランキングメカニズムを組み合わせたことで、競技プログラミングの性能が向上しました。これは、将来のAIシステムで検索を活用する可能性を示唆しています。
🚀 AGIへの道のり
AIが一般的な知能を獲得するためには、現在の制限を克服する必要があります。ARC AIのベンチマークでは、GPT-4の視覚能力の限界やコーディングのスキルの不足が問題となっています。しかし、これらの制限を克服すれば、AIはより高度なタスクをこなす能力を獲得する可能性があります。
🌐 オープンAIの次のステップ
オープンAIの将来の動向に注目が集まっています。彼らは、検索アルゴリズムを用いた応答の改善や、より長いコンテキストを扱う能力の向上を目指していると予想されています。また、新たな技術的突破が期待されており、AIの能力がさらに高まるかもしれません。
📈 AI技術の成長予測
AI技術の成長は着実に進んでおり、新しいベンチマークを超えることが予想されています。GPT-5のような次世代のモデルでは、基本的な視覚理解能力が向上し、ARC AIベンチマークを超えることが期待されています。これは、AI技術のさらなる発展を示唆しています。
🛠️ AI技術の今後の展望
AI技術の今後の展望は光明です。検索アルゴリズムとLLMの組み合わせが、AIの能力をさらに高める可能性があるとされています。また、新しい技術的突破が、AIがより複雑な問題を解決する能力を高めることを示唆しています。
Mindmap
Keywords
💡QAR
💡Large Language Models (LLMs)
💡Monte Carlo Tree Search
💡GPT
💡Self-improvement
💡Neural Networks
💡Search Algorithms
💡Benchmarks
💡AGI (Artificial General Intelligence)
💡Neurosymbolic AI
Highlights
QAR(Query-Answer Reasoning)可能并未死亡,最近有迹象表明其真实性。
QAR曾因能解决未见过的数学问题而吸引AI社区关注。
研究显示大型语言模型在数学方面表现出色,使用蒙特卡洛树搜索技术。
Llama 8亿参数模型在数学基准测试GSM-8k上的表现优于GPT-4和Claude。
应用搜索到大型语言模型(LLMs)是一个新的研究方向。
GPT-0项目是OpenAI的一个秘密项目,探索使用LLMs解决数学和科学问题。
研究假设给予LLMs更多时间和计算能力可以促进新的学术突破。
论文展示了GPT-4通过蒙特卡洛树搜索访问数学奥林匹克解决方案。
Andrej Karpathy讨论了蒙特卡洛树搜索在AlphaGo中的应用。
AlphaGo通过自我改进超越了人类围棋选手。
大型语言模型目前主要通过模仿人类响应进行训练。
缺乏通用的奖励标准是开放语言模型面临的主要挑战。
AlphaCode 2结合了搜索算法和重排机制,显著提高了编程能力。
计算成本是限制将搜索算法应用于LLMs的主要因素。
Noam Brown讨论了搜索在创建超人类AI系统中的重要性。
Sam Altman暗示了未来系统可能结合搜索算法以提高可靠性。
AR AGI提出了一种新的评估方法,用以测试系统是否达到通用人工智能。
使用GPT-40和F-shot提示的方法在AR AGI基准测试中取得了显著成绩。
尽管GPT-40在视觉和编码方面存在限制,但未来模型有望改善这些弱点。
预测显示,下一代多模态模型有很高的概率在AR AGI上超越人类平均水平。
Transcripts
so if you thought qar was dead or that
we weren't going to get any more updates
think again because recently we actually
got a very small inkling that qar might
just be true now qar did actually
Captivate the AI Community for several
reasons because there were so many
things going on but recently I saw this
tweet on my timeline which was a
reference to a research paper that
actually showed how tiny large language
models are actually good at maths as a
Frontier Model so by using the same
techniques that Google actually used to
solve Alpha go which was Monte Carlo
tree search and of course back
propagation llama 8 billion parameters
llama 3 gets
96.7% on the math benchmark GSM akk and
that is better than GPT 4 Claude and
Gemini with 200 times less parameters so
this is a pretty pretty crazy Revelation
because it was something that I didn't
see coming just yet because the entire
ire framework of applying search to llms
is something that I wouldn't say it's
still in its early stage but it isn't
something that we didn't really think
about initially when we went from GPT
3.5 to GPT 4 and there was a lot of talk
about this being explored and clearly
we're now seeing that this is being
explored and the results of that are
truly truly impressive now I am going to
dive into a little bit of this research
paper and if you don't remember qar qar
was was basically some kind of you know
crazy thing because it was around the
time that Sam Alman got fired there was
this information article that you know
spoke about how opening ey made a
breakthrough before Sam Alman got fired
which was stoking excitement and concern
and of course the crazy thing about qar
that you need to know is that qar was
able to solve math problems that it
hadn't seen before which are an
important technical Milestone and
apparently a demo of the model
circulated within opening ey in the
weeks and the pace of development
alarmed some researchers focus on safety
now what is crazy about all of this is
that you know the team who are working
on this SS cover were actually working
on ways you know to allow llms like GPT
4 to solve tasks that involved reasoning
like math or science problems and in
2021 they actually launched a secret
project called GPT 0 which was a nod to
deep mines Alpha zero program that could
play chess go and shogi so they were
already kind of working on this in those
early early days where they launched a
project called GPT Z so this is not
something that is from a fundamental
like thought point of view something
that people haven't really thought about
before this is something that you can
see some of the top researchers have
thought before and they initially
hypothesized that by giving large
language models more time and more
computing power to generate responses to
questions they could allow them to
develop new academic breakthroughs so of
course that is pretty crazy because what
we see in this paper here is that they
actually did this basically they
actually used GPT 4 accessing GPT 4
level mathematical Olympiad Solutions
via Monte Carlo treesearch with
selfrefined with llama 3 8 billion
parameters technical reports you can see
right here they state that Monte Carlo
selfrefined algorithm represents an
integration of Monte Carlo treesearch
with large language models abstracting
the iter refinement process of
mathematical problem solution into
search tree structure nodes on this tree
represent different versions of answers
while edges denote attempts at
Improvement this algorithm's operational
workflow adheres to the Monte Carlo
research algorithm's General patterns so
basically saying that they're basically
doing the same thing that they did with
Alpha go they're using that same kind of
structure in llms and the craziest thing
about this was that you know Andre
karpathy actually spoke about you know
the fact that if you actually use Monte
Carlo tree search which is something
that they used in Alpha go and basically
how this kind of works you know cuz I
know I'm saying this and I know that a
lot of people might not be familiar with
all of the news from the past couple of
months but basically what they're doing
is they're basically having an AI system
search over all the possible
configurations before making that move
and then once they can make that move
they can then of course on the next move
search again of all the possible board
configurations and then go ahead and
make the move so that is essentially how
that works and it's something that you
know Andre karpathy describes in his 1H
hour talk about language models where he
basically says that you know um this is
something that we kind of need for
future models if we want things to
improve and it's pretty interesting that
now we're seeing a paper that you know
is is so small and its size be able to
you know if we look at like the
benchmarks and stuff you can see that on
the GSM 8K on the eight rollouts like on
the eight rollouts that it did it
achieved 96.66% call it 97% and if you
actually do check some of the frontier
models it literally surpasses them not
by a large amount but you got to
remember guys this is 8 billion
parameters compared to a model that is
1.8 trillion parameters and of course
Gemini Ultras I'm not even sure how many
parameters that is not publicly
available but I know it is a pretty
pretty large language model so basically
this is Andre Kathy stating this and I
think this is his 1 hour talk is
definitely fascinating I really do
believe that it's something that
everyone should watch because kind of
gives a lot of insight into what's going
to be coming in the future um but take a
look at this cuz it's kind of important
a lot of people are broadly inspired by
what happened with alphago so in alphao
um this was a go playing program
developed by deepmind and alphago
actually had two major stages uh the
first release of it did in the first
stage you learn by imitating human
expert players so you take lots of games
that were played by humans uh you kind
of like just filter to the games played
by really good humans and you learn by
imitation you're getting the neural
network to just imitate really good
players and this works and this this
gives you a pretty good um go playing
program but it can't surpass human it's
it's only as good as the best human that
gives you the training data so deep mine
figured out a way to actually surpass
humans and the way this was done is by
self-improvement now in the case of go
this is a simple closed sandbox
environment you have a game and you can
play lots of games in the sandbox and
you can have have a very simple reward
function which is just a winning the
game so you can query this reward
function that tells you if whatever
you've done was good or bad did you win
yes or no this is something that is
available very cheap to evaluate and
automatic and so because of that you can
play millions and millions of games and
Kind of Perfect the system just based on
the probability of winning so there's no
need to imitate you can go beyond human
and that's in fact what the system ended
up doing so here on the right we have
the ELO rating and alphago took 40 days
uh in this case uh to overcome some of
the best human players by
self-improvement so I think a lot of
people are kind of interested in what is
the equivalent of this step number two
for large language models because today
we're only doing step one we are
imitating humans there are as I
mentioned there are human labelers
writing out these answers and we're
imitating their responses and we can
have very good human labelers but
fundamentally it would be hard to go
above sort of human response accuracy if
we only train on the humans so that's
the big question what is the step two
equivalent in the domain of open
language modeling um and the the main
challenge here is that there's a lack of
reward Criterion in the general case so
because we are in a space of language
everything is a lot more open and
there's all these different types of
tasks and fundamentally there's no like
simple reward function you can access
that just tells you if whatever you did
whatever you sampled was good or bad
there's no easy to evaluate fast
Criterion or reward function so yeah
that makes a lot of sense cuz in math SL
language is actually you know I mean I
guess in math it's obviously a bit
easier but of course in language things
are pretty much open to interpretation
like how do you judge whether or not
advice is good or bad you know it's
pretty subjective but within this it's
just like if you win a game you win a
game and then of course you can easily
train on that like of course that's an
easy reward function to use but the
point here is that um you know when we
look at this we can see that you know
this kind of architecture where you have
you know learning that's not just based
on human inputs is truly different
because it actually allows the system to
become you know super intelligent so the
one that they actually built that was
you know the basically you know the
biggest and most superhuman actually
didn't train on human data which is
pretty crazy so we had this AI system
that was literally able to you know
search over multiple different moves um
and that that was a key factor in its
success so basically right here in the
alpha go documentary they actually talk
about how you know Alpha go was being
able to search over you know 50 or 60
moves ahead and that's how it was able
to get you know this remarkable level of
accuracy here
search 50 or 60 moves that's the maximum
number of moves ahead alpago is looking
from the current game position it's
typically over 50 it's often over
60 in the games we see often 150 Al go
goes for the kill we're 115 now so
getting to that so yeah it's a pretty
interesting documentary I'm sure I'm
sure many of you guys have probably seen
it but I think this finding is
definitely really intriguing because it
does show us that a lot of the work and
a lot of the that we've seen you know
some of them are kind of you know
gaining more validity because it shows
us that you know when people were
talking about how LMS plus search could
be a very very fascinating thing to
discover and of course it could lead to
you know potentially maybe you know
superhuman capabilities or maybe even
the capabilities that just vastly exceed
human capabilities I mean right now
we're seeing that with this paper um the
initial results are truly truly
surprising like a 7 billion primet
outpacing you know GPT 4 on on the gsmk
is pretty pretty impressive which shows
us that I mean like people have said
like Leopold Ashen BR has said you know
what kind of growth are we going to be
be experiencing over the next couple of
years when people actually develop you
know literal ways to just improve the
models that we do have now with
different prompting techniques and
different ways to actually use the based
models that we have now so this is
definitely something that is very very
fascinating combining LMS with search is
your building a true true expansion in
terms of the capabilities now what's
crazy about this as well uh something
that I think that most people did also
miss was that combining llms with search
is going to be a big thing in the future
but the one major thing that is actually
stopping it is compute because basically
Alpha go was uh pretty pretty uh compute
intensive because you're searching over
so many different methods but
essentially one thing that you need to
know is that Google actually did an
alpha code 2 paper and I did a video on
this but it didn't get that many views
compared to the Gemini news cuz the
Gemini news was basically just stealing
the spotlight and the thing about that
was the fact that like a lot of people
overlooked what Alpha code 2 was because
it was a very fascinating Insight with
as to what is going to come for the
future so basically Alpha code 2 and
this relates back to this before cuz
basically it also uses a search
algorithm um and a reranking mechanism
but this isn't you know Monti Carlo tree
search but the point is is that when
they combined you know language models
and a bespoke search and reranking
algorithm they were able to form you
know 85% better than competition
participants which is a huge huge huge
Improvement and basically they
essentially used a sampling mechanism
that encourage generating a wide
diversity of code samples to search over
the space of possible programs and
basically what they did was they built a
a model like a large language model kind
of thing and basically they combined
this with advanced search you can see
right here and reranking mechanism
tailored for competitive programming and
this thing was really really good at
competitive programming guys like this
was really really good okay and how did
it get really good they combined it with
search and they were able to get these
you know really insane capabilities on
coding which is why you know I'm
bringing this back to qar because it
shows us that these capabilities you
know when you combine them with search
it seems that some very very very
intriguing initial findings here now if
we look at the alpha code search you you
can see here as well that it says our
sampling approach is close to that of
alpha code we generate up to a million
code samples per problem we using a
randomized temperature parameter for
each sample to encourage diversity we
also randomize targeted metadata
including the prompt such as problem
difficulty rating and its categorical
tags and it says massive sampling allows
us to search the model distribution
thoroughly and generate a large
diversity of code samples maximizing the
likelihood of generating at least some
correct samples so overall you can see
here as well like as well this is what I
wanted to show you guys is that despite
Alpha code 2's impressive results a lot
more remains to be done before we see
systems that can reliably reach the
performance of the best human coders our
system requires a lot of trial and error
would say they maybe solve the problem
of programming but the point is is that
like with search they've had some very
very impressive results the only problem
here is that it's too costly to operate
at scale which means they can't scale
this thing um and that's of course a
kind of like an issue if you of course
do want to use use this in your products
or whatever so of course there are some
things to work on in terms of
optimization and how you kind of get
that down but overall when we look at
this you can see here that it's really
really incredible because you can see
that you know they demonstrate a clear
Trend where increased rollouts correlate
with higher success rates highlighting
the algorithm's potential to improve
performance through iterative refinement
and they also say that these findings
affirm the Monte Carlo selfrefined
algorithms robustness and its utility in
tackling complex unseen mathematical
problems I am wondering if this is very
similar to one of the you know pieces of
the article that said qar that was able
to solve math problems that I hadn't
seen before an important technical
Milestone a demo of the model circulated
within open AI in recent weeks and the
pace of development alarmed some
researchers focused on AI safety now
what's crazy about this as well is that
around the time of T I do remember that
they hired no bran and basically what he
did was he spoke on the Lex Freedman
podcast about how you know in order to
make superhuman systems stuff like
monteola tree search and you know being
able to search over multiple different
things is really important in superhuman
AI system it was very heavily focused on
search um looking many many moves ahead
farther than any human could and that
was key for Wyatt one and then even with
something like alphago I mean alphago is
commonly hailed as a landmark
achievement for neuronet and it is but
there's also this huge component of
search Monte Carlo tree search to
alphago that was key absolutely
essential for the AI to be able to beat
top
humans um I think a good example of this
is you look at the latest versions of
alpha of alphago like it was called
Alpha zero um and there's this metric
called ELO rating where you can compare
different humans and you can compare
Bots to humans now a top human player is
around 3,600 ELO maybe a little bit
higher now um Alpha zero the strongest
version is around 5200 ELO but if you
take out the search that's being done at
test time and but by the way what I mean
by search is the planning ahead the
thinking of like oh if I move my if I
place this Stone here and then he does
this and then you look like five moves
ahead and you see like what the board
state looks like um that's what I mean
by search if you take out the search
that's done during the game the ELO
rating drops to around 3,000 so even
today what 7 years after alphao if you
take out the Monte research that's being
done at when playing against the human
the Bots are not super human nobody has
made a raw neuronet that is superum and
go so yeah I found that clip to be
pretty fascinating the entire interview
is on Lex Friedman but um I think one of
the tweets and I'm sure I've referenced
this before but this is why this tweet
is so important and this is why I
included the clip because I think it
goes to show that this person working at
open AI noan Brown you know he was
tweeting about these kinds of things and
he was stating okay that you know all
these prior methods are specific to the
game but if we can discover a general
version the benefits could be huge yeah
yes inference may be slower like a
thousand times slower and of course be
more costly but what inference cost
would we pay for a new cancer drug or
proof of the rean hypothesis so
basically what he's saying here is that
look if we could get a system that could
really you know truly understand certain
you know areas and truly truly truly
truly like you know give us answers that
are worthwh even if the speed of those
answers is like a thousand times lower
and they cost maybe not a thousand times
more but like 500 times more if those
answers are things that fund Mally
change our level of understanding on
certain topics then entire new paradigms
are going to be built off the back of
that so I truly believe that this is uh
you know a step in the right direction
because many people in in the you know
AI Community even Skeptics like Yan Lan
have actually spoken about this uh and
I've even seen you know Gary Marcus
comment even you know under a few of
these tweets and state that you know
this is some good stuff so it seems that
this might be an area of further
exploration for open aai although we
haven't really seen open AI State
anything just yet because as you know
research from open AI is pretty much
closed off because they are a private
company now I do think interestingly
enough Sam mman may have actually hinted
at this because you remember how you
know Monte col treesearch you're
basically searching you know to be able
to see what kind of you know solution
you can get but Sam Elman in a Bill
Gates interview in a very very very
short clip he actually says you know
something about that along the lines
that I'm going to play it once for you
guys here you know if if you if you ask
gp4 most questions 10,000 times one of
those 10,000 is probably pretty good but
it doesn't always know which one and
you'd like to get the best response of
10,000 each time so that'll be that that
that increase in reliability will be so
one of the things I'm wondering is since
he said that you know if you ask GPT 4 a
question 10,000 times one or two of them
is going to be absolutely amazing so
what if what if like you know this is
what they're working on for future
systems maybe they're working this on
for GPT 6 GPT 5 I don't know maybe
that's why they need all these data
centers because a large majority of what
they're doing is probably going to be
you know maybe search based so that they
can truly get you know reasoning that
gives you the best answer maybe it
literally just generates a bunch of
answers you know whichever kind of
search algorithm they do use but um
these kinds of ideas are definitely a
step in the right direction because I
think this is something that everyone is
truly agreeing on so so far what we have
here is a truly truly fascinating paper
you know some might got cut out but I do
think that this entire topic is very
fascinating I do really really wonder
what open AI are going to come out with
next I do believe that you know if
they've got someone like noan brown on
the team they've already had you know
whatever breakthroughs they've had with
Ilia satova and we're starting to see
slowly that you know a lot of these
other you know labs are starting to
catch up so as I was rendering this
video I actually saw a tweet on my
timeline that kind of uh gave me a
decent amount of information on
something that was going on inside the
AI community and it was something that I
watched recently in an interview with
dush pel he has honestly some of the
most insightful interviews with some of
the brightest Minds in the artificial
intelligence space but um yeah it's
pretty crazy so basically all right the
the news is this okay so there was this
and this actually does link back to the
original qar you know thing that we just
saw so just bear with me a second
basically right there is this AR AGI
thing that basically they're stating
that this is the new Benchmark that
actually tries to prove whether or not
you know a system is Agi or not and you
cannot achieve AGI without surpassing
this Benchmark and that's the only kind
of Benchmark that people are going to
consider I'll show you guys a short clip
of that interview now one Arc puzzle it
looks kind of like an IQ test puzzle
you've got a number of demonstration
input output pairs so a uh one pair is
made of two grids so one grid shows you
an input and the second grid shows you
uh what you should produce as a response
to that input and you get uh a couple uh
pairs like this to demonstrate the
nature of the task to demonstrate what
you're supposed to do with your inputs
and then you get
uh a new test input and your job is to
produce the corresponding uh test output
you look at the demonstration Pairs and
from that you figure out what you're
supposed to do and you show that you've
understood it on this new test
pair and um importantly in order to the
sort of like knowledge basis that you
need in order to approach this this
challenges is you just need core
knowledge and core knowledge is uh it's
basically the the knowledge of wmx an
object uh basic counting basic geometry
topology symmetries and that sort of
thing so extremely basic knowledge LMS
for sure possess such knowledge any
child possesses uh such knowledge um and
what's really interesting is that each
puzzle is new so it's not something that
you're going to find uh uh Elsewhere on
the internet for instance uh and that
means that whether it's as a human or as
a machine every puzzle you have to
approach it from scratch you have to
actually reason your way through it you
cannot just fetch the response from your
memory so the core knowledge so now
after that interview came out okay you
can see that um you know he tells his
coworker Ryan and within six days they
beat the state-ofthe-art on the ark and
are on the heels of average human
performance which would mean that
they're teetering on the edge of
something that some would consider
artificial general intelligence he says
on a heldout subset of the train set
where humans get 85 % accuracy my
Solutions get 72% accuracy so this is
pretty crazy but what's interesting is
that he said I started on this project a
few days before the dakes Patel recorded
the recent podcast with Chalet this was
inspired by dakes talking to my coworker
Buck about Ark AGI and being like come
on surely you can do better than the
current state of the-art using LMS and
basically right here this is the main
argument okay the main argument is that
you know um llms are just mimicking
patterns they aren't true AI systems
it's just you know not possible to get
to AGI with LMS and what we're doing
however you can see right here there's a
meme um and you're seeing that you know
someone's just saying why don't we just
draw more samples and then we can just
get you know infinitely better so
they're basically stating that we might
be able to get to AGI by just providing
more samples now so for context right
here you can see that Arc AGI is a
visual reasoning Benchmark that requires
guessing a rule from a few examples and
its creator F cholet claims that
and then this is the crazy thing okay so
it says Ryan's approach involves
carefully crafted F shot prompts that he
uses to generate many possible python
programs to implement the
Transformations he generates 5,000
guesses selects the best ones using the
examples and then has a debugging step
and the results are incredible they get
71% versus a human Baseline of 85% and
then he gets 51% prior versus the prior
state of the art and you can see here he
says scaling the number of sampled
python rules reliably increased
performance up to 3% accuracy for every
doubling and we are still quite far from
the millions of samples that Alpha code
uses basically what he's stating here is
that they did really really well and
they didn't even need to sample millions
of you know samples like Alpha code did
which we just talked about and franchis
schay actually responded stating that
this has been the most promising branch
of approaches so far leveraging an llm
to help with discrete program search by
using the llm as a way to sample
programs or branching decisions this is
exactly what neuros symbolic AI is for
the record so this is pretty crazy
because even he the guy who created The
Benchmark is stating that this is the
right path to go down and even some of
you know the craziest critics like Gary
Marcus who state that there will be no
AGI without neuros symbolic Ai and he's
got this entire talk like I was
literally listening to this and
basically it's a 30-minute talk in which
he discusses you know it's you know
everyone's got the current approach
wrong and it's a really really
fascinating you know uh you know piece
of where he talks about you know this
this entire you know Paradigm that we
are on is probably probably wrong but
I'm guessing that maybe now with this
new sort of approach that we're doing
things actually might start to get
towards human level in nearly every
aspect now what's actually crazy about
this entire thing here is that you can
see that in his qualitative analysis
there are actually some key things where
GPT 40 is actually pretty limited you
can see GPT 40 is limited by failes
other than the reasoning which we know
is you know I guess you could say pretty
Limited in the fact that GPT 40's vision
is terrible on grids when asked to
describe what is in somewhat a large
grid it often fails to see the input
correctly and States what's wrong facts
about what colors in some location or
what's present in particular it totally
fails to extract the colors of the cells
from an image for images 12 by 12 and is
quite bad at 8 by8 visual abilities as
poor as GPT 40 it would often take them
quite a bit of effort to even solve
simple Arc AGR problems if you want a
frustrating time try solving some Arc
AGR problems without using vision other
than reading that is try to do them
without ever drawing out the grids in 2D
forcing yourself instead just to
interact with a textual representation
of the data for Hard Mode you could try
doing this blindfolded with a friend
allowing you to dictate python lines of
code to run on the image and I think
that this will be quite hard so
basically here he's trying to state that
look okay this system that I built that
is able to get a state-ofthe-art on this
AR AGI evaluation Benchmark is very
limited by the fact that gp4 O's vision
just is innately not that good for this
specific task whereas human vision is
really really good so this is going to
be one of the things that you know
probably on you know future models where
the vision systems do get a lot better
we're going to be see seeing you know if
the interpretation gets a lot better as
well now of course he says GPT 40 isn't
that good at coding and makes simple
mistake like one off you know off by one
errors extremely often and we don't do
multi-round buz because it's probably
cheaper and more effective just to get
more samples in the current regime of
course GPT 40 sometimes hallucinates
which means that this of course could
reduce the reliability of the results
and of course he says GPT 40 is worse at
using long context than other models he
says I think that the long context for
GPT 40 is quite bad and starts taking a
big hit about after 32 to 40,000 tokens
based on my qualitative impression which
is limited by my ability to use longer
prompts with more examples and detailed
representations and he says here is that
it doesn't seem to respect my fuse shot
prompt and often does somewhat worse
stuff than what it should do based on
the few short examples for instance it
systematically responds with much
shorter completions than it is supposed
to even if I give it very specific
instructions to do otherwise so of
course you can see here GPT 40 the
context length this is something that
hasn't you know um really really
increased that much and this is because
openi have I wouldn't say sto but
they've you know not been under the
pressure like any of the other AI labs
to roll out systems that can you know
intake huge context lengths and of
course output you know huge context
length so they also State here as well
um that you know not having flexible
prefix caching substantially limit
approaches which is of course something
that is going to limit the system and
then of course he talks about removing
these non reasoning weaknesses would
improve the performance of my solution
by a significant amount vision is
especially a large weakness so the point
here guys and this is be like not so
much of a revelation but it should be an
eye opener for you that this new
Benchmark wasn't solved immediately but
the guy was able to you know get 50%
using the current uh state-ofthe-art
technology okay and that's even with the
fact that there are these very very you
know obvious limitations to the systems
which means for us that there is still a
very very large room for AI to grow
because these things aren't like where
we've hit a wall where we're like okay
we have no idea how we're going to
improve the vision we have no idea how
we're going to improve the coding uh we
have no idea how we're going to improve
the long context these are kind of
things that you know I wouldn't say have
defined Solutions but people are
actively working on them and we can say
with you know a decent amount of
prediction that you know this is going
to get better and once these get better
once you combine them into certain
Frameworks and certain architectures
like you know using gp40 in this context
I think that this is going to probably
surpass you know the arc AGI Benchmark
probably by the time GPT 5 is released
which would be pretty pretty fascinating
doesn't mean it's going to be AGI but
maybe it's going to be better than the
average human so I'll leave a link to uh
this in the description but of course
there are some predictions in here and
it's pretty interesting because he says
this is uh you know a 70% probability
that a team of three top research
machine learning Engineers with fine
tuning and access GPT 40 $10 million in
compute and one year of time could use
GPT 40 to surpass typical naive M Turk
performance at AI on the test set while
using less than $100 per problem at
runtime and he says here that there is a
60% probability that if a Next
Generation Frontier Model like GPT 5 was
much better at basic visual
understanding for example above 85%
accuracy on the vibe eval hard using
this same exact method with minor
adaptation tweaks as needed that llm
would surpass typical naive mtk
performance and basically what he's
stating here is that you know with
future systems like GPT 5 and there's an
80% probability that the next generation
of multimo models will be able to
substantially improve the performance on
Arc AGI which is pretty pretty
incredible which means that when the
next systems do release it is very
likely that we're going to see a whole
lot of new benchmarks broken especially
with the fact that you know we've got
all of these sort of Frameworks that are
going to be surrounding this technology
so this is just a quick reminder for
those of you who like to support the
channel I recently launched a school
Community but this is a private
Community where we actually focus on
things like the post AGI framework that
I developed you get instant download
access to this where you are just going
to have framework that easily allows you
to navigate the post AGI economy with
ease of course my personal strategy for
making money with AI exclusive tutorials
on how to actually use no code AI agent
Frameworks and of course the AGI proof
investment deck that's helped me make
very very sizable returns if that
interests you don't forget to check it
out if not just enjoy the rest of the
videos on the channel
Weitere ähnliche Videos ansehen
5.0 / 5 (0 votes)