Open AI's Q* Is BACK! - Was AGI Just Solved?

TheAIGRID
23 Jun 202430:54

Summary

TLDRこのビデオスクリプトでは、最近のAIコミュニティにおける興味深い動きを紹介しています。特に、大きな言語モデル(LLM)が数学問題を解決する能力が注目されています。研究によると、Googleがアルファゴを解決するために使用したモンテカルロ木探索技術を用いたLLMは、GSM数学ベンチマークで高い正解率を記録しました。また、OpenAIが数学問題を解決する能力を示したQARの話題も取り上げられており、AIの進化とその可能性が熱く語られています。

Takeaways

  • 🧠 LLM(Large Language Models)は数学問題を解決する能力を示しており、これは技術的なマイルストーンとして注目されています。
  • 📈 Googleがアルファ狗を解決するために使用したモンテカルロ木探索技術を用いた研究により、パラメータ数が1000倍少ないLLMでも高い数学問題解決率を達成できることがわかりました。
  • 🔍 QAR(Quantum AI Research)は、AIコミュニティを騒然とさせ、Sam Almanが退職する直前に話題になりました。
  • 🤖 GPT-0は、DeepMindのアルファゼロプログラムを模倣して、言語モデルが数学や科学の問題を解決する方法を模索する秘密プロジェクトでした。
  • 🎯 Monte Carlo tree searchとbackpropagationを組み合わせた手法により、Llama 3はGSM-8K数学ベンチマークで高いスコアを記録しました。
  • 🚀 QARは、見たことのない数学問題を解決する能力を示し、AIの安全性に関心を集めています。
  • 🔑 Andrej Karpathyが語ったように、未来のモデルではモンテカルロ木探索のような手法が必要不可欠で、これはAIの改善に役立ちます。
  • 📚 AlphaGoは、人間を超えるために自己改善を通じて学習し、これはLLMにとっても重要なステップです。
  • 🛠️ コンピュートのコストが、LLMと検索を組み合わせた手法の普及を阻む主な障害物となっており、これは最適化の必要性を示しています。
  • 🌐 Arc AGIは、新しい基準として提唱されており、システムがAGI(人工一般知能)を達成するかどうかを評価するための重要な指標となっています。
  • 🔮 将来のモデルでは、視覚やコーディング、長いコンテキストの理解を改善することで、Arc AGIのベンチマークを超える可能性が示唆されています。

Q & A

  • QARとは何ですか?また、なぜAIコミュニティに注目されていますか?

    -QARは数学問題を未見olvedのものを解決できるという重要な技術的マイルストーンを持っており、AIコミュニティに注目される要因の一つです。QARはSam Almanが退職する直前にOpenAIで突破を果たしたとされる情報記事の話題となりました。

  • 最近の研究で、大きな言語モデルが数学にどのように役立つか説明してください。

    -最近の研究では、大きな言語モデルがMonte Carlo tree searchとback propagationを用いて数学問題を解決することができ、Llama 8 billion parametersでは96.7%の正解率を記録しました。これは、パラメータ数が200倍少ないにもかかわらず、GPT 4 ClaudeやGeminiよりも優れていることを意味します。

  • GPT 0プロジェクトとは何で、なぜ秘密裏に行われたのですか?

    -GPT 0は、言語モデルが数学や科学の問題を含む推論タスクを解決できるようにするため、OpenAIが2021年に開始した秘密プロジェクトです。Alpha Zeroプログラムにちなんでおり、チェスや囲碁をプレイするプログラムとして開発されました。

  • Monte Carlo tree searchとは何であり、どのようにAIに適用される可能性があると述べていますか?

    -Monte Carlo tree searchは、Alpha Goで使用されたアルゴリズムで、AIシステムがすべての可能な構成を探索し、次に最も良い手を決定するものです。この手法は、言語モデルに適用され、数学問題の解決に役立つとされています。

  • Andrej Karpathyが言語モデルの未来について何を述べていますか?

    -Andrej Karpathyは、言語モデルの未来においてはMonte Carlo tree searchのような手法が必要であると述べています。彼は1時間の講演で、これらのモデルが改善されるために必要なものであると語っています。

  • Alpha Goの成功の鍵は何でしたか?

    -Alpha Goの成功の鍵は、自己改善とMonte Carlo tree searchでした。Alpha Goは人間エキスパートプレイヤーを模倣することから始まりましたが、後には自己改善を通じて超人的なレベルに達しました。

  • 現在の言語モデルが直面している課題は何ですか?

    -現在の言語モデルは、基本的な視覚、コーディング能力、長いコンテキストの扱いについて課題に直面しています。これらの制限を克服することは、AIの性能を向上させる上で重要なステップとなります。

  • ARC AIとは何であり、なぜ重要なのですか?

    -ARC AIは、視覚推論のベンチマークであり、新しいルールを推測する必要があります。重要なのは、AIが新しい状況に適応し、基本的な知識のみを使って問題を解決できることを証明できるためです。

  • GPT-4の現在の限界はどこにありますか?

    -GPT-4は、グリッドの視覚理解、コーディングでの基本的な誤り、長いコンテキストの使用において限界があります。これらの限界を超えることが、AIの性能向上に寄与すると予想されています。

  • AIがARC AIベンチマークを超えるために必要なものは何ですか?

    -AIがARC AIベンチマークを超えるためには、基本的な視覚理解の向上、コーディング能力の向上、長いコンテキストの扱い方改善などが必要です。また、より多くのサンプルを用いた学習も重要な要素です。

Outlines

00:00

🧠 AIの数学能力の進歩

最近の研究によると、大きな言語モデルは数学問題を解くのに非常に優れていることがわかりました。Googleがアルファゴを解決するために使ったモンテカルロ木探索とバックプロペーションの技術を用いて、LLaMA 8億パラメータモデルはGSM数学ベンチマークで96.7%の高精度を記録しました。これは、パラメータが200倍少ないにもかかわらず、GPT-4 ClaudeやGeminiを上回る驚くべき結果です。

05:02

🤖 アルファゼロから学んだ教訓

アルファゼロは、人間選手を模倣することから始まり、自己改善を通じて超人的なスキルを獲得しました。これは、大きな言語モデルにも適用できる可能性があり、現在は人間のみを模倣する段階にあります。しかし、言語モデルに報酬基準が欠けているため、自己改善のステップ2が難しいとされています。

10:05

🔍 LLMと検索の組み合わせ

LLMと検索アルゴリズムを組み合わせることで、AIの能力が大幅に向上することが示されています。Alpha Code 2の例では、検索と再ランキングメカニズムを組み合わせたことで、競技プログラミングの性能が向上しました。これは、将来のAIシステムで検索を活用する可能性を示唆しています。

15:06

🚀 AGIへの道のり

AIが一般的な知能を獲得するためには、現在の制限を克服する必要があります。ARC AIのベンチマークでは、GPT-4の視覚能力の限界やコーディングのスキルの不足が問題となっています。しかし、これらの制限を克服すれば、AIはより高度なタスクをこなす能力を獲得する可能性があります。

20:07

🌐 オープンAIの次のステップ

オープンAIの将来の動向に注目が集まっています。彼らは、検索アルゴリズムを用いた応答の改善や、より長いコンテキストを扱う能力の向上を目指していると予想されています。また、新たな技術的突破が期待されており、AIの能力がさらに高まるかもしれません。

25:08

📈 AI技術の成長予測

AI技術の成長は着実に進んでおり、新しいベンチマークを超えることが予想されています。GPT-5のような次世代のモデルでは、基本的な視覚理解能力が向上し、ARC AIベンチマークを超えることが期待されています。これは、AI技術のさらなる発展を示唆しています。

30:10

🛠️ AI技術の今後の展望

AI技術の今後の展望は光明です。検索アルゴリズムとLLMの組み合わせが、AIの能力をさらに高める可能性があるとされています。また、新しい技術的突破が、AIがより複雑な問題を解決する能力を高めることを示唆しています。

Mindmap

Keywords

💡QAR

QARとは、OpenAIが開発したAIモデルのことで、数学問題を未見せで解決できるという技術的マイルストーンを打ち出しました。このモデルは、AIコミュニティを躍らせ、多くの注目を集めました。動画ではQARがAIの数学能力を示す重要な技術突破の1つとして紹介されており、それが今後のAI開発の方向性を示す可能性があると語られています。

💡Large Language Models (LLMs)

大型言語モデルとは、大量のデータを学習し、自然言語処理タスクに応じるための高度なAIモデルです。動画ではLLMsが数学問題を解く能力を示しており、これらはAIの理解力と応用範囲を広げる上で重要な役割を果たしています。特に、Llama 3というモデルがGSM-8K数学ベンチマークで高いスコアを記録していることが触れられています。

💡Monte Carlo Tree Search

モンテカルロ木探索は、アルゴリズムの一種で、GoogleがAlphaGoで使用して有名になりました。動画では、この技術がLLMsと組み合わせて数学問題を解決するのに使われていることについて説明しており、これはAIの自己改善能力を高める上で重要な技術とされています。

💡GPT

GPTとは、OpenAIが開発した一連の言語モデルのことを指し、特にGPT-3やGPT-4といったバージョンがAI分野で注目されています。動画ではGPT-4が数学問題解決に挑戦し、他のモデルと比較される例として挙げられています。

💡Self-improvement

自己改善は、AIシステムが自己学習を通じて性能を向上させるプロセスを指します。動画ではAlphaGoが自己改善を通じて人間のチェスプレイヤーを超えた例を通じて、AIが自己改善を通じて新しい学問的突破を達成する可能性について議論しています。

💡Neural Networks

神経ネットワークは、人間の脳の構造を模倣したAIアルゴリズムの1つで、DeepMindのAlphaGoがその応用例です。動画では、神経ネットワークがAIの能力向上に寄与し、特にAlphaGoの成功に不可欠だったことが強調されています。

💡Search Algorithms

探索アルゴリズムは、問題解決のための候補を探索するAIの手法です。動画では、探索アルゴリズムがAIの数学問題解決能力に寄与し、特にMonte Carlo Tree Searchがその一例として紹介されています。

💡Benchmarks

ベンチマークは、AIの性能を評価するための標準化されたテストです。動画ではGSM-8K数学ベンチマークが挙げられ、Llama 3が他のモデルに比べて低いパラメータで高いスコアを記録したことが示しています。

💡AGI (Artificial General Intelligence)

人工一般知能とは、人間の知能と同じように多様なタスクをこなす能力を持つAIを指します。動画では、AIがAGIを達成するためには、基本的な視覚やコーディング能力の向上が必要であると示唆しています。

💡Neurosymbolic AI

神経的記号論的AIは、神経ネットワークと記号処理を組み合わせたAIのアプローチです。動画では、このアプローチがAGIの実現に向けて重要な役割を果たすとされ、特にARC AIベンチマークでの成功例が紹介されています。

Highlights

QAR(Query-Answer Reasoning)可能并未死亡,最近有迹象表明其真实性。

QAR曾因能解决未见过的数学问题而吸引AI社区关注。

研究显示大型语言模型在数学方面表现出色,使用蒙特卡洛树搜索技术。

Llama 8亿参数模型在数学基准测试GSM-8k上的表现优于GPT-4和Claude。

应用搜索到大型语言模型(LLMs)是一个新的研究方向。

GPT-0项目是OpenAI的一个秘密项目,探索使用LLMs解决数学和科学问题。

研究假设给予LLMs更多时间和计算能力可以促进新的学术突破。

论文展示了GPT-4通过蒙特卡洛树搜索访问数学奥林匹克解决方案。

Andrej Karpathy讨论了蒙特卡洛树搜索在AlphaGo中的应用。

AlphaGo通过自我改进超越了人类围棋选手。

大型语言模型目前主要通过模仿人类响应进行训练。

缺乏通用的奖励标准是开放语言模型面临的主要挑战。

AlphaCode 2结合了搜索算法和重排机制,显著提高了编程能力。

计算成本是限制将搜索算法应用于LLMs的主要因素。

Noam Brown讨论了搜索在创建超人类AI系统中的重要性。

Sam Altman暗示了未来系统可能结合搜索算法以提高可靠性。

AR AGI提出了一种新的评估方法,用以测试系统是否达到通用人工智能。

使用GPT-40和F-shot提示的方法在AR AGI基准测试中取得了显著成绩。

尽管GPT-40在视觉和编码方面存在限制,但未来模型有望改善这些弱点。

预测显示,下一代多模态模型有很高的概率在AR AGI上超越人类平均水平。

Transcripts

play00:00

so if you thought qar was dead or that

play00:02

we weren't going to get any more updates

play00:04

think again because recently we actually

play00:07

got a very small inkling that qar might

play00:09

just be true now qar did actually

play00:12

Captivate the AI Community for several

play00:14

reasons because there were so many

play00:16

things going on but recently I saw this

play00:18

tweet on my timeline which was a

play00:20

reference to a research paper that

play00:22

actually showed how tiny large language

play00:24

models are actually good at maths as a

play00:28

Frontier Model so by using the same

play00:30

techniques that Google actually used to

play00:33

solve Alpha go which was Monte Carlo

play00:35

tree search and of course back

play00:37

propagation llama 8 billion parameters

play00:41

llama 3 gets

play00:43

96.7% on the math benchmark GSM akk and

play00:46

that is better than GPT 4 Claude and

play00:49

Gemini with 200 times less parameters so

play00:53

this is a pretty pretty crazy Revelation

play00:55

because it was something that I didn't

play00:57

see coming just yet because the entire

play01:00

ire framework of applying search to llms

play01:03

is something that I wouldn't say it's

play01:04

still in its early stage but it isn't

play01:06

something that we didn't really think

play01:08

about initially when we went from GPT

play01:10

3.5 to GPT 4 and there was a lot of talk

play01:14

about this being explored and clearly

play01:17

we're now seeing that this is being

play01:19

explored and the results of that are

play01:21

truly truly impressive now I am going to

play01:24

dive into a little bit of this research

play01:26

paper and if you don't remember qar qar

play01:29

was was basically some kind of you know

play01:31

crazy thing because it was around the

play01:34

time that Sam Alman got fired there was

play01:36

this information article that you know

play01:38

spoke about how opening ey made a

play01:40

breakthrough before Sam Alman got fired

play01:43

which was stoking excitement and concern

play01:45

and of course the crazy thing about qar

play01:47

that you need to know is that qar was

play01:49

able to solve math problems that it

play01:51

hadn't seen before which are an

play01:53

important technical Milestone and

play01:55

apparently a demo of the model

play01:57

circulated within opening ey in the

play01:59

weeks and the pace of development

play02:01

alarmed some researchers focus on safety

play02:04

now what is crazy about all of this is

play02:06

that you know the team who are working

play02:08

on this SS cover were actually working

play02:10

on ways you know to allow llms like GPT

play02:14

4 to solve tasks that involved reasoning

play02:17

like math or science problems and in

play02:19

2021 they actually launched a secret

play02:21

project called GPT 0 which was a nod to

play02:24

deep mines Alpha zero program that could

play02:27

play chess go and shogi so they were

play02:30

already kind of working on this in those

play02:32

early early days where they launched a

play02:33

project called GPT Z so this is not

play02:36

something that is from a fundamental

play02:39

like thought point of view something

play02:40

that people haven't really thought about

play02:42

before this is something that you can

play02:43

see some of the top researchers have

play02:45

thought before and they initially

play02:47

hypothesized that by giving large

play02:49

language models more time and more

play02:51

computing power to generate responses to

play02:54

questions they could allow them to

play02:55

develop new academic breakthroughs so of

play02:58

course that is pretty crazy because what

play03:00

we see in this paper here is that they

play03:03

actually did this basically they

play03:04

actually used GPT 4 accessing GPT 4

play03:08

level mathematical Olympiad Solutions

play03:10

via Monte Carlo treesearch with

play03:12

selfrefined with llama 3 8 billion

play03:15

parameters technical reports you can see

play03:17

right here they state that Monte Carlo

play03:19

selfrefined algorithm represents an

play03:21

integration of Monte Carlo treesearch

play03:24

with large language models abstracting

play03:26

the iter refinement process of

play03:28

mathematical problem solution into

play03:30

search tree structure nodes on this tree

play03:32

represent different versions of answers

play03:34

while edges denote attempts at

play03:35

Improvement this algorithm's operational

play03:37

workflow adheres to the Monte Carlo

play03:39

research algorithm's General patterns so

play03:41

basically saying that they're basically

play03:43

doing the same thing that they did with

play03:45

Alpha go they're using that same kind of

play03:47

structure in llms and the craziest thing

play03:49

about this was that you know Andre

play03:51

karpathy actually spoke about you know

play03:54

the fact that if you actually use Monte

play03:56

Carlo tree search which is something

play03:58

that they used in Alpha go and basically

play04:01

how this kind of works you know cuz I

play04:02

know I'm saying this and I know that a

play04:04

lot of people might not be familiar with

play04:05

all of the news from the past couple of

play04:07

months but basically what they're doing

play04:10

is they're basically having an AI system

play04:12

search over all the possible

play04:14

configurations before making that move

play04:16

and then once they can make that move

play04:18

they can then of course on the next move

play04:20

search again of all the possible board

play04:22

configurations and then go ahead and

play04:24

make the move so that is essentially how

play04:26

that works and it's something that you

play04:28

know Andre karpathy describes in his 1H

play04:30

hour talk about language models where he

play04:33

basically says that you know um this is

play04:35

something that we kind of need for

play04:36

future models if we want things to

play04:39

improve and it's pretty interesting that

play04:41

now we're seeing a paper that you know

play04:43

is is so small and its size be able to

play04:46

you know if we look at like the

play04:47

benchmarks and stuff you can see that on

play04:49

the GSM 8K on the eight rollouts like on

play04:52

the eight rollouts that it did it

play04:54

achieved 96.66% call it 97% and if you

play04:58

actually do check some of the frontier

play04:59

models it literally surpasses them not

play05:01

by a large amount but you got to

play05:03

remember guys this is 8 billion

play05:04

parameters compared to a model that is

play05:06

1.8 trillion parameters and of course

play05:09

Gemini Ultras I'm not even sure how many

play05:11

parameters that is not publicly

play05:13

available but I know it is a pretty

play05:15

pretty large language model so basically

play05:17

this is Andre Kathy stating this and I

play05:19

think this is his 1 hour talk is

play05:21

definitely fascinating I really do

play05:23

believe that it's something that

play05:24

everyone should watch because kind of

play05:26

gives a lot of insight into what's going

play05:27

to be coming in the future um but take a

play05:29

look at this cuz it's kind of important

play05:31

a lot of people are broadly inspired by

play05:33

what happened with alphago so in alphao

play05:37

um this was a go playing program

play05:39

developed by deepmind and alphago

play05:41

actually had two major stages uh the

play05:42

first release of it did in the first

play05:44

stage you learn by imitating human

play05:46

expert players so you take lots of games

play05:48

that were played by humans uh you kind

play05:50

of like just filter to the games played

play05:53

by really good humans and you learn by

play05:55

imitation you're getting the neural

play05:56

network to just imitate really good

play05:58

players and this works and this this

play05:59

gives you a pretty good um go playing

play06:02

program but it can't surpass human it's

play06:04

it's only as good as the best human that

play06:07

gives you the training data so deep mine

play06:09

figured out a way to actually surpass

play06:10

humans and the way this was done is by

play06:13

self-improvement now in the case of go

play06:15

this is a simple closed sandbox

play06:19

environment you have a game and you can

play06:21

play lots of games in the sandbox and

play06:23

you can have have a very simple reward

play06:24

function which is just a winning the

play06:26

game so you can query this reward

play06:29

function that tells you if whatever

play06:30

you've done was good or bad did you win

play06:32

yes or no this is something that is

play06:34

available very cheap to evaluate and

play06:36

automatic and so because of that you can

play06:38

play millions and millions of games and

play06:40

Kind of Perfect the system just based on

play06:42

the probability of winning so there's no

play06:44

need to imitate you can go beyond human

play06:47

and that's in fact what the system ended

play06:49

up doing so here on the right we have

play06:50

the ELO rating and alphago took 40 days

play06:54

uh in this case uh to overcome some of

play06:56

the best human players by

play06:58

self-improvement so I think a lot of

play07:00

people are kind of interested in what is

play07:01

the equivalent of this step number two

play07:03

for large language models because today

play07:05

we're only doing step one we are

play07:07

imitating humans there are as I

play07:08

mentioned there are human labelers

play07:10

writing out these answers and we're

play07:11

imitating their responses and we can

play07:13

have very good human labelers but

play07:15

fundamentally it would be hard to go

play07:17

above sort of human response accuracy if

play07:20

we only train on the humans so that's

play07:22

the big question what is the step two

play07:24

equivalent in the domain of open

play07:26

language modeling um and the the main

play07:29

challenge here is that there's a lack of

play07:30

reward Criterion in the general case so

play07:33

because we are in a space of language

play07:34

everything is a lot more open and

play07:36

there's all these different types of

play07:37

tasks and fundamentally there's no like

play07:39

simple reward function you can access

play07:41

that just tells you if whatever you did

play07:42

whatever you sampled was good or bad

play07:45

there's no easy to evaluate fast

play07:46

Criterion or reward function so yeah

play07:49

that makes a lot of sense cuz in math SL

play07:51

language is actually you know I mean I

play07:53

guess in math it's obviously a bit

play07:54

easier but of course in language things

play07:55

are pretty much open to interpretation

play07:57

like how do you judge whether or not

play07:58

advice is good or bad you know it's

play08:00

pretty subjective but within this it's

play08:01

just like if you win a game you win a

play08:03

game and then of course you can easily

play08:05

train on that like of course that's an

play08:06

easy reward function to use but the

play08:08

point here is that um you know when we

play08:10

look at this we can see that you know

play08:12

this kind of architecture where you have

play08:14

you know learning that's not just based

play08:16

on human inputs is truly different

play08:19

because it actually allows the system to

play08:20

become you know super intelligent so the

play08:23

one that they actually built that was

play08:24

you know the basically you know the

play08:26

biggest and most superhuman actually

play08:28

didn't train on human data which is

play08:31

pretty crazy so we had this AI system

play08:33

that was literally able to you know

play08:34

search over multiple different moves um

play08:37

and that that was a key factor in its

play08:39

success so basically right here in the

play08:41

alpha go documentary they actually talk

play08:42

about how you know Alpha go was being

play08:44

able to search over you know 50 or 60

play08:47

moves ahead and that's how it was able

play08:48

to get you know this remarkable level of

play08:51

accuracy here

play08:53

search 50 or 60 moves that's the maximum

play08:58

number of moves ahead alpago is looking

play09:00

from the current game position it's

play09:04

typically over 50 it's often over

play09:07

60 in the games we see often 150 Al go

play09:12

goes for the kill we're 115 now so

play09:15

getting to that so yeah it's a pretty

play09:18

interesting documentary I'm sure I'm

play09:20

sure many of you guys have probably seen

play09:21

it but I think this finding is

play09:24

definitely really intriguing because it

play09:26

does show us that a lot of the work and

play09:28

a lot of the that we've seen you know

play09:30

some of them are kind of you know

play09:32

gaining more validity because it shows

play09:35

us that you know when people were

play09:36

talking about how LMS plus search could

play09:39

be a very very fascinating thing to

play09:41

discover and of course it could lead to

play09:43

you know potentially maybe you know

play09:45

superhuman capabilities or maybe even

play09:48

the capabilities that just vastly exceed

play09:50

human capabilities I mean right now

play09:52

we're seeing that with this paper um the

play09:55

initial results are truly truly

play09:56

surprising like a 7 billion primet

play09:59

outpacing you know GPT 4 on on the gsmk

play10:04

is pretty pretty impressive which shows

play10:06

us that I mean like people have said

play10:09

like Leopold Ashen BR has said you know

play10:11

what kind of growth are we going to be

play10:12

be experiencing over the next couple of

play10:14

years when people actually develop you

play10:16

know literal ways to just improve the

play10:17

models that we do have now with

play10:19

different prompting techniques and

play10:21

different ways to actually use the based

play10:23

models that we have now so this is

play10:24

definitely something that is very very

play10:26

fascinating combining LMS with search is

play10:29

your building a true true expansion in

play10:32

terms of the capabilities now what's

play10:34

crazy about this as well uh something

play10:37

that I think that most people did also

play10:38

miss was that combining llms with search

play10:41

is going to be a big thing in the future

play10:43

but the one major thing that is actually

play10:45

stopping it is compute because basically

play10:48

Alpha go was uh pretty pretty uh compute

play10:51

intensive because you're searching over

play10:53

so many different methods but

play10:55

essentially one thing that you need to

play10:56

know is that Google actually did an

play10:59

alpha code 2 paper and I did a video on

play11:01

this but it didn't get that many views

play11:03

compared to the Gemini news cuz the

play11:04

Gemini news was basically just stealing

play11:06

the spotlight and the thing about that

play11:08

was the fact that like a lot of people

play11:10

overlooked what Alpha code 2 was because

play11:13

it was a very fascinating Insight with

play11:16

as to what is going to come for the

play11:18

future so basically Alpha code 2 and

play11:20

this relates back to this before cuz

play11:22

basically it also uses a search

play11:24

algorithm um and a reranking mechanism

play11:26

but this isn't you know Monti Carlo tree

play11:28

search but the point is is that when

play11:30

they combined you know language models

play11:32

and a bespoke search and reranking

play11:34

algorithm they were able to form you

play11:36

know 85% better than competition

play11:39

participants which is a huge huge huge

play11:42

Improvement and basically they

play11:44

essentially used a sampling mechanism

play11:46

that encourage generating a wide

play11:48

diversity of code samples to search over

play11:50

the space of possible programs and

play11:52

basically what they did was they built a

play11:54

a model like a large language model kind

play11:55

of thing and basically they combined

play11:58

this with advanced search you can see

play11:59

right here and reranking mechanism

play12:01

tailored for competitive programming and

play12:03

this thing was really really good at

play12:05

competitive programming guys like this

play12:06

was really really good okay and how did

play12:08

it get really good they combined it with

play12:10

search and they were able to get these

play12:11

you know really insane capabilities on

play12:13

coding which is why you know I'm

play12:15

bringing this back to qar because it

play12:17

shows us that these capabilities you

play12:19

know when you combine them with search

play12:20

it seems that some very very very

play12:24

intriguing initial findings here now if

play12:27

we look at the alpha code search you you

play12:29

can see here as well that it says our

play12:30

sampling approach is close to that of

play12:32

alpha code we generate up to a million

play12:34

code samples per problem we using a

play12:37

randomized temperature parameter for

play12:39

each sample to encourage diversity we

play12:42

also randomize targeted metadata

play12:44

including the prompt such as problem

play12:46

difficulty rating and its categorical

play12:48

tags and it says massive sampling allows

play12:50

us to search the model distribution

play12:52

thoroughly and generate a large

play12:53

diversity of code samples maximizing the

play12:56

likelihood of generating at least some

play12:58

correct samples so overall you can see

play13:00

here as well like as well this is what I

play13:02

wanted to show you guys is that despite

play13:04

Alpha code 2's impressive results a lot

play13:06

more remains to be done before we see

play13:07

systems that can reliably reach the

play13:09

performance of the best human coders our

play13:11

system requires a lot of trial and error

play13:13

would say they maybe solve the problem

play13:14

of programming but the point is is that

play13:16

like with search they've had some very

play13:18

very impressive results the only problem

play13:20

here is that it's too costly to operate

play13:22

at scale which means they can't scale

play13:24

this thing um and that's of course a

play13:27

kind of like an issue if you of course

play13:28

do want to use use this in your products

play13:29

or whatever so of course there are some

play13:32

things to work on in terms of

play13:33

optimization and how you kind of get

play13:34

that down but overall when we look at

play13:37

this you can see here that it's really

play13:39

really incredible because you can see

play13:41

that you know they demonstrate a clear

play13:42

Trend where increased rollouts correlate

play13:44

with higher success rates highlighting

play13:46

the algorithm's potential to improve

play13:48

performance through iterative refinement

play13:50

and they also say that these findings

play13:52

affirm the Monte Carlo selfrefined

play13:53

algorithms robustness and its utility in

play13:56

tackling complex unseen mathematical

play13:59

problems I am wondering if this is very

play14:01

similar to one of the you know pieces of

play14:04

the article that said qar that was able

play14:06

to solve math problems that I hadn't

play14:08

seen before an important technical

play14:09

Milestone a demo of the model circulated

play14:12

within open AI in recent weeks and the

play14:14

pace of development alarmed some

play14:15

researchers focused on AI safety now

play14:19

what's crazy about this as well is that

play14:21

around the time of T I do remember that

play14:24

they hired no bran and basically what he

play14:26

did was he spoke on the Lex Freedman

play14:28

podcast about how you know in order to

play14:30

make superhuman systems stuff like

play14:32

monteola tree search and you know being

play14:33

able to search over multiple different

play14:35

things is really important in superhuman

play14:37

AI system it was very heavily focused on

play14:38

search um looking many many moves ahead

play14:41

farther than any human could and that

play14:43

was key for Wyatt one and then even with

play14:46

something like alphago I mean alphago is

play14:49

commonly hailed as a landmark

play14:52

achievement for neuronet and it is but

play14:54

there's also this huge component of

play14:55

search Monte Carlo tree search to

play14:57

alphago that was key absolutely

play15:00

essential for the AI to be able to beat

play15:02

top

play15:02

humans um I think a good example of this

play15:05

is you look at the latest versions of

play15:07

alpha of alphago like it was called

play15:09

Alpha zero um and there's this metric

play15:12

called ELO rating where you can compare

play15:14

different humans and you can compare

play15:16

Bots to humans now a top human player is

play15:19

around 3,600 ELO maybe a little bit

play15:22

higher now um Alpha zero the strongest

play15:24

version is around 5200 ELO but if you

play15:27

take out the search that's being done at

play15:30

test time and but by the way what I mean

play15:31

by search is the planning ahead the

play15:34

thinking of like oh if I move my if I

play15:36

place this Stone here and then he does

play15:38

this and then you look like five moves

play15:39

ahead and you see like what the board

play15:40

state looks like um that's what I mean

play15:43

by search if you take out the search

play15:45

that's done during the game the ELO

play15:47

rating drops to around 3,000 so even

play15:49

today what 7 years after alphao if you

play15:52

take out the Monte research that's being

play15:54

done at when playing against the human

play15:57

the Bots are not super human nobody has

play15:59

made a raw neuronet that is superum and

play16:01

go so yeah I found that clip to be

play16:03

pretty fascinating the entire interview

play16:05

is on Lex Friedman but um I think one of

play16:08

the tweets and I'm sure I've referenced

play16:09

this before but this is why this tweet

play16:11

is so important and this is why I

play16:12

included the clip because I think it

play16:14

goes to show that this person working at

play16:16

open AI noan Brown you know he was

play16:19

tweeting about these kinds of things and

play16:20

he was stating okay that you know all

play16:23

these prior methods are specific to the

play16:24

game but if we can discover a general

play16:26

version the benefits could be huge yeah

play16:28

yes inference may be slower like a

play16:30

thousand times slower and of course be

play16:32

more costly but what inference cost

play16:34

would we pay for a new cancer drug or

play16:36

proof of the rean hypothesis so

play16:39

basically what he's saying here is that

play16:40

look if we could get a system that could

play16:41

really you know truly understand certain

play16:44

you know areas and truly truly truly

play16:46

truly like you know give us answers that

play16:48

are worthwh even if the speed of those

play16:51

answers is like a thousand times lower

play16:53

and they cost maybe not a thousand times

play16:55

more but like 500 times more if those

play16:57

answers are things that fund Mally

play16:59

change our level of understanding on

play17:02

certain topics then entire new paradigms

play17:04

are going to be built off the back of

play17:05

that so I truly believe that this is uh

play17:08

you know a step in the right direction

play17:10

because many people in in the you know

play17:12

AI Community even Skeptics like Yan Lan

play17:14

have actually spoken about this uh and

play17:16

I've even seen you know Gary Marcus

play17:18

comment even you know under a few of

play17:20

these tweets and state that you know

play17:21

this is some good stuff so it seems that

play17:24

this might be an area of further

play17:26

exploration for open aai although we

play17:28

haven't really seen open AI State

play17:30

anything just yet because as you know

play17:32

research from open AI is pretty much

play17:34

closed off because they are a private

play17:36

company now I do think interestingly

play17:39

enough Sam mman may have actually hinted

play17:41

at this because you remember how you

play17:43

know Monte col treesearch you're

play17:45

basically searching you know to be able

play17:47

to see what kind of you know solution

play17:49

you can get but Sam Elman in a Bill

play17:51

Gates interview in a very very very

play17:53

short clip he actually says you know

play17:55

something about that along the lines

play17:57

that I'm going to play it once for you

play17:58

guys here you know if if you if you ask

play17:59

gp4 most questions 10,000 times one of

play18:02

those 10,000 is probably pretty good but

play18:04

it doesn't always know which one and

play18:06

you'd like to get the best response of

play18:08

10,000 each time so that'll be that that

play18:11

that increase in reliability will be so

play18:13

one of the things I'm wondering is since

play18:15

he said that you know if you ask GPT 4 a

play18:17

question 10,000 times one or two of them

play18:19

is going to be absolutely amazing so

play18:21

what if what if like you know this is

play18:22

what they're working on for future

play18:24

systems maybe they're working this on

play18:25

for GPT 6 GPT 5 I don't know maybe

play18:28

that's why they need all these data

play18:29

centers because a large majority of what

play18:31

they're doing is probably going to be

play18:33

you know maybe search based so that they

play18:35

can truly get you know reasoning that

play18:37

gives you the best answer maybe it

play18:39

literally just generates a bunch of

play18:40

answers you know whichever kind of

play18:42

search algorithm they do use but um

play18:44

these kinds of ideas are definitely a

play18:46

step in the right direction because I

play18:48

think this is something that everyone is

play18:50

truly agreeing on so so far what we have

play18:53

here is a truly truly fascinating paper

play18:55

you know some might got cut out but I do

play18:57

think that this entire topic is very

play18:59

fascinating I do really really wonder

play19:01

what open AI are going to come out with

play19:03

next I do believe that you know if

play19:04

they've got someone like noan brown on

play19:06

the team they've already had you know

play19:07

whatever breakthroughs they've had with

play19:09

Ilia satova and we're starting to see

play19:11

slowly that you know a lot of these

play19:12

other you know labs are starting to

play19:14

catch up so as I was rendering this

play19:17

video I actually saw a tweet on my

play19:19

timeline that kind of uh gave me a

play19:21

decent amount of information on

play19:23

something that was going on inside the

play19:24

AI community and it was something that I

play19:26

watched recently in an interview with

play19:28

dush pel he has honestly some of the

play19:30

most insightful interviews with some of

play19:31

the brightest Minds in the artificial

play19:33

intelligence space but um yeah it's

play19:35

pretty crazy so basically all right the

play19:37

the news is this okay so there was this

play19:39

and this actually does link back to the

play19:41

original qar you know thing that we just

play19:42

saw so just bear with me a second

play19:44

basically right there is this AR AGI

play19:47

thing that basically they're stating

play19:48

that this is the new Benchmark that

play19:50

actually tries to prove whether or not

play19:52

you know a system is Agi or not and you

play19:53

cannot achieve AGI without surpassing

play19:55

this Benchmark and that's the only kind

play19:57

of Benchmark that people are going to

play19:58

consider I'll show you guys a short clip

play20:00

of that interview now one Arc puzzle it

play20:02

looks kind of like an IQ test puzzle

play20:04

you've got a number of demonstration

play20:06

input output pairs so a uh one pair is

play20:11

made of two grids so one grid shows you

play20:13

an input and the second grid shows you

play20:16

uh what you should produce as a response

play20:19

to that input and you get uh a couple uh

play20:22

pairs like this to demonstrate the

play20:24

nature of the task to demonstrate what

play20:26

you're supposed to do with your inputs

play20:27

and then you get

play20:29

uh a new test input and your job is to

play20:33

produce the corresponding uh test output

play20:35

you look at the demonstration Pairs and

play20:37

from that you figure out what you're

play20:40

supposed to do and you show that you've

play20:42

understood it on this new test

play20:44

pair and um importantly in order to the

play20:49

sort of like knowledge basis that you

play20:51

need in order to approach this this

play20:53

challenges is you just need core

play20:55

knowledge and core knowledge is uh it's

play20:58

basically the the knowledge of wmx an

play21:00

object uh basic counting basic geometry

play21:03

topology symmetries and that sort of

play21:06

thing so extremely basic knowledge LMS

play21:08

for sure possess such knowledge any

play21:10

child possesses uh such knowledge um and

play21:14

what's really interesting is that each

play21:16

puzzle is new so it's not something that

play21:18

you're going to find uh uh Elsewhere on

play21:21

the internet for instance uh and that

play21:24

means that whether it's as a human or as

play21:26

a machine every puzzle you have to

play21:29

approach it from scratch you have to

play21:30

actually reason your way through it you

play21:32

cannot just fetch the response from your

play21:35

memory so the core knowledge so now

play21:38

after that interview came out okay you

play21:40

can see that um you know he tells his

play21:42

coworker Ryan and within six days they

play21:45

beat the state-ofthe-art on the ark and

play21:47

are on the heels of average human

play21:49

performance which would mean that

play21:50

they're teetering on the edge of

play21:52

something that some would consider

play21:53

artificial general intelligence he says

play21:55

on a heldout subset of the train set

play21:57

where humans get 85 % accuracy my

play21:59

Solutions get 72% accuracy so this is

play22:02

pretty crazy but what's interesting is

play22:04

that he said I started on this project a

play22:06

few days before the dakes Patel recorded

play22:09

the recent podcast with Chalet this was

play22:12

inspired by dakes talking to my coworker

play22:14

Buck about Ark AGI and being like come

play22:16

on surely you can do better than the

play22:18

current state of the-art using LMS and

play22:20

basically right here this is the main

play22:23

argument okay the main argument is that

play22:25

you know um llms are just mimicking

play22:27

patterns they aren't true AI systems

play22:29

it's just you know not possible to get

play22:32

to AGI with LMS and what we're doing

play22:34

however you can see right here there's a

play22:36

meme um and you're seeing that you know

play22:38

someone's just saying why don't we just

play22:40

draw more samples and then we can just

play22:41

get you know infinitely better so

play22:43

they're basically stating that we might

play22:45

be able to get to AGI by just providing

play22:47

more samples now so for context right

play22:49

here you can see that Arc AGI is a

play22:51

visual reasoning Benchmark that requires

play22:53

guessing a rule from a few examples and

play22:56

its creator F cholet claims that

play22:58

and then this is the crazy thing okay so

play23:00

it says Ryan's approach involves

play23:02

carefully crafted F shot prompts that he

play23:04

uses to generate many possible python

play23:06

programs to implement the

play23:07

Transformations he generates 5,000

play23:09

guesses selects the best ones using the

play23:11

examples and then has a debugging step

play23:13

and the results are incredible they get

play23:15

71% versus a human Baseline of 85% and

play23:19

then he gets 51% prior versus the prior

play23:22

state of the art and you can see here he

play23:23

says scaling the number of sampled

play23:26

python rules reliably increased

play23:28

performance up to 3% accuracy for every

play23:30

doubling and we are still quite far from

play23:32

the millions of samples that Alpha code

play23:34

uses basically what he's stating here is

play23:36

that they did really really well and

play23:38

they didn't even need to sample millions

play23:40

of you know samples like Alpha code did

play23:42

which we just talked about and franchis

play23:44

schay actually responded stating that

play23:46

this has been the most promising branch

play23:48

of approaches so far leveraging an llm

play23:50

to help with discrete program search by

play23:52

using the llm as a way to sample

play23:54

programs or branching decisions this is

play23:57

exactly what neuros symbolic AI is for

play23:59

the record so this is pretty crazy

play24:01

because even he the guy who created The

play24:03

Benchmark is stating that this is the

play24:05

right path to go down and even some of

play24:07

you know the craziest critics like Gary

play24:09

Marcus who state that there will be no

play24:11

AGI without neuros symbolic Ai and he's

play24:14

got this entire talk like I was

play24:15

literally listening to this and

play24:17

basically it's a 30-minute talk in which

play24:19

he discusses you know it's you know

play24:21

everyone's got the current approach

play24:22

wrong and it's a really really

play24:23

fascinating you know uh you know piece

play24:26

of where he talks about you know this

play24:27

this entire you know Paradigm that we

play24:29

are on is probably probably wrong but

play24:32

I'm guessing that maybe now with this

play24:34

new sort of approach that we're doing

play24:35

things actually might start to get

play24:37

towards human level in nearly every

play24:39

aspect now what's actually crazy about

play24:41

this entire thing here is that you can

play24:43

see that in his qualitative analysis

play24:45

there are actually some key things where

play24:46

GPT 40 is actually pretty limited you

play24:49

can see GPT 40 is limited by failes

play24:52

other than the reasoning which we know

play24:53

is you know I guess you could say pretty

play24:55

Limited in the fact that GPT 40's vision

play24:58

is terrible on grids when asked to

play25:00

describe what is in somewhat a large

play25:01

grid it often fails to see the input

play25:04

correctly and States what's wrong facts

play25:06

about what colors in some location or

play25:08

what's present in particular it totally

play25:10

fails to extract the colors of the cells

play25:11

from an image for images 12 by 12 and is

play25:14

quite bad at 8 by8 visual abilities as

play25:17

poor as GPT 40 it would often take them

play25:19

quite a bit of effort to even solve

play25:21

simple Arc AGR problems if you want a

play25:24

frustrating time try solving some Arc

play25:26

AGR problems without using vision other

play25:28

than reading that is try to do them

play25:30

without ever drawing out the grids in 2D

play25:33

forcing yourself instead just to

play25:34

interact with a textual representation

play25:37

of the data for Hard Mode you could try

play25:39

doing this blindfolded with a friend

play25:40

allowing you to dictate python lines of

play25:42

code to run on the image and I think

play25:44

that this will be quite hard so

play25:46

basically here he's trying to state that

play25:48

look okay this system that I built that

play25:50

is able to get a state-ofthe-art on this

play25:52

AR AGI evaluation Benchmark is very

play25:55

limited by the fact that gp4 O's vision

play25:58

just is innately not that good for this

play26:01

specific task whereas human vision is

play26:03

really really good so this is going to

play26:06

be one of the things that you know

play26:08

probably on you know future models where

play26:10

the vision systems do get a lot better

play26:12

we're going to be see seeing you know if

play26:14

the interpretation gets a lot better as

play26:17

well now of course he says GPT 40 isn't

play26:20

that good at coding and makes simple

play26:22

mistake like one off you know off by one

play26:25

errors extremely often and we don't do

play26:27

multi-round buz because it's probably

play26:29

cheaper and more effective just to get

play26:31

more samples in the current regime of

play26:33

course GPT 40 sometimes hallucinates

play26:36

which means that this of course could

play26:38

reduce the reliability of the results

play26:41

and of course he says GPT 40 is worse at

play26:43

using long context than other models he

play26:45

says I think that the long context for

play26:47

GPT 40 is quite bad and starts taking a

play26:49

big hit about after 32 to 40,000 tokens

play26:53

based on my qualitative impression which

play26:55

is limited by my ability to use longer

play26:57

prompts with more examples and detailed

play26:59

representations and he says here is that

play27:01

it doesn't seem to respect my fuse shot

play27:03

prompt and often does somewhat worse

play27:05

stuff than what it should do based on

play27:06

the few short examples for instance it

play27:09

systematically responds with much

play27:11

shorter completions than it is supposed

play27:13

to even if I give it very specific

play27:15

instructions to do otherwise so of

play27:17

course you can see here GPT 40 the

play27:20

context length this is something that

play27:21

hasn't you know um really really

play27:23

increased that much and this is because

play27:25

openi have I wouldn't say sto but

play27:27

they've you know not been under the

play27:28

pressure like any of the other AI labs

play27:30

to roll out systems that can you know

play27:32

intake huge context lengths and of

play27:34

course output you know huge context

play27:35

length so they also State here as well

play27:38

um that you know not having flexible

play27:40

prefix caching substantially limit

play27:42

approaches which is of course something

play27:44

that is going to limit the system and

play27:46

then of course he talks about removing

play27:48

these non reasoning weaknesses would

play27:50

improve the performance of my solution

play27:52

by a significant amount vision is

play27:54

especially a large weakness so the point

play27:56

here guys and this is be like not so

play27:59

much of a revelation but it should be an

play28:01

eye opener for you that this new

play28:03

Benchmark wasn't solved immediately but

play28:06

the guy was able to you know get 50%

play28:09

using the current uh state-ofthe-art

play28:10

technology okay and that's even with the

play28:13

fact that there are these very very you

play28:15

know obvious limitations to the systems

play28:18

which means for us that there is still a

play28:20

very very large room for AI to grow

play28:23

because these things aren't like where

play28:25

we've hit a wall where we're like okay

play28:26

we have no idea how we're going to

play28:28

improve the vision we have no idea how

play28:29

we're going to improve the coding uh we

play28:31

have no idea how we're going to improve

play28:32

the long context these are kind of

play28:34

things that you know I wouldn't say have

play28:35

defined Solutions but people are

play28:37

actively working on them and we can say

play28:39

with you know a decent amount of

play28:40

prediction that you know this is going

play28:42

to get better and once these get better

play28:44

once you combine them into certain

play28:45

Frameworks and certain architectures

play28:47

like you know using gp40 in this context

play28:51

I think that this is going to probably

play28:53

surpass you know the arc AGI Benchmark

play28:56

probably by the time GPT 5 is released

play28:58

which would be pretty pretty fascinating

play29:00

doesn't mean it's going to be AGI but

play29:01

maybe it's going to be better than the

play29:03

average human so I'll leave a link to uh

play29:05

this in the description but of course

play29:07

there are some predictions in here and

play29:09

it's pretty interesting because he says

play29:11

this is uh you know a 70% probability

play29:14

that a team of three top research

play29:16

machine learning Engineers with fine

play29:18

tuning and access GPT 40 $10 million in

play29:21

compute and one year of time could use

play29:23

GPT 40 to surpass typical naive M Turk

play29:27

performance at AI on the test set while

play29:29

using less than $100 per problem at

play29:32

runtime and he says here that there is a

play29:34

60% probability that if a Next

play29:36

Generation Frontier Model like GPT 5 was

play29:39

much better at basic visual

play29:41

understanding for example above 85%

play29:43

accuracy on the vibe eval hard using

play29:46

this same exact method with minor

play29:48

adaptation tweaks as needed that llm

play29:50

would surpass typical naive mtk

play29:53

performance and basically what he's

play29:55

stating here is that you know with

play29:56

future systems like GPT 5 and there's an

play29:59

80% probability that the next generation

play30:01

of multimo models will be able to

play30:03

substantially improve the performance on

play30:05

Arc AGI which is pretty pretty

play30:07

incredible which means that when the

play30:09

next systems do release it is very

play30:11

likely that we're going to see a whole

play30:13

lot of new benchmarks broken especially

play30:15

with the fact that you know we've got

play30:17

all of these sort of Frameworks that are

play30:19

going to be surrounding this technology

play30:21

so this is just a quick reminder for

play30:22

those of you who like to support the

play30:23

channel I recently launched a school

play30:25

Community but this is a private

play30:27

Community where we actually focus on

play30:29

things like the post AGI framework that

play30:30

I developed you get instant download

play30:32

access to this where you are just going

play30:33

to have framework that easily allows you

play30:35

to navigate the post AGI economy with

play30:37

ease of course my personal strategy for

play30:39

making money with AI exclusive tutorials

play30:41

on how to actually use no code AI agent

play30:43

Frameworks and of course the AGI proof

play30:45

investment deck that's helped me make

play30:47

very very sizable returns if that

play30:49

interests you don't forget to check it

play30:50

out if not just enjoy the rest of the

play30:52

videos on the channel

Rate This

5.0 / 5 (0 votes)

Related Tags
人工知能数学能力研究アルファゴモンテカルロLLMパラメータAI開発技術突破未来予測
Do you need a summary in English?