Claude 3.5 struggle too?! The $Million dollar challenge

AI Jason
26 Jun 202423:31

Summary

TLDRThe script discusses the challenges AI faces in learning new tasks not present in its training data, contrasting this with human adaptability. It introduces the ARC challenge, a benchmark for measuring AI's ability to learn from limited examples. The speaker explores various approaches to solving ARC tasks, including using large language models, multi-agent systems, and active inference. The goal is to develop AI that can match human-like learning and adaptability.

Takeaways

  • 🧠 Large language models like GPT-4 struggle with tasks not present in their training data, highlighting their reliance on memorization rather than true reasoning or intelligence.
  • 👶 Humans can adapt to new situations with very little data, unlike current AI systems, demonstrating a fundamental difference in learning capabilities.
  • 📊 The ARC benchmark, introduced by Franc Charot in 2019, measures AI's ability to learn and adapt to new tasks from minimal examples, aiming to assess general intelligence.
  • 💡 The ARC challenge presents a collection of unique tasks where AI must identify patterns from input and output examples to predict correct outcomes.
  • 🌟 As of June 2024, the best-performing AI systems achieve only around 39% correctness on the ARC benchmark, indicating significant room for improvement.
  • 🚀 A global competition with a $1 million prize pool incentivizes the development of AI systems that can achieve superhuman performance on the ARC test set.
  • 🔍 HPOT's research provides insights into integrating AI into data analysis workflows, offering best practices and a checklist for companies to leverage AI effectively.
  • 🛠️ Participants in the ARC competition can access training and evaluation datasets to build and test AI systems, with the goal of generating accurate outputs based on given inputs.
  • 🤖 Different approaches to solving ARC tasks include using large language models, prompting engineering, multi-agent systems, and discrete program search.
  • 📈 Active inference, a method of fine-tuning AI models on synthetic data, has shown promise in improving performance on ARC-like tasks by simulating an active learning process.

Q & A

  • What is the main challenge presented by the script?

    -The main challenge is to identify patterns in matrix transformations with minimal examples and generate corresponding outputs, which is a task that large language models like GPD-40 struggle with due to their reliance on training data sets.

  • Why are large language models poor at handling new things they weren't trained on?

    -Large language models are poor at handling new things because they predict the next word based on probability within their training data set. They don't truly understand or think through problems but rather memorize and spit out answers based on past data.

  • What does the script suggest as the definition of true intelligence?

    -The script suggests that true intelligence is the ability to adapt and learn new things, as opposed to just relying on past experiences and knowledge.

  • What is the ARK benchmark mentioned in the script?

    -The ARK benchmark is a collection of unique training and evaluation tasks designed to measure the efficiency of AI skill acquisition on unknown tasks. It represents abstraction and reasoning and is used to test AI systems' ability to learn and adapt to new scenarios.

  • How does the ARK benchmark work?

    -The ARK benchmark presents a grid where each square can be one of 10 colors. The goal is to build an AI system that can predict the exact output based on a new input, using multiple inputs and output examples to showcase a pattern.

  • What is the current performance of AI systems on the ARK benchmark as of June 2024?

    -As of June 2024, the latest version of AI systems is able to answer 39% of the ARK tasks correctly.

  • What is the goal of the ARK challenge competition?

    -The goal of the ARK challenge competition is to build an AI system that can achieve superhuman level performance, which is defined as 85% correctness on the ARK testing data set.

  • What is the prize for winning the ARK challenge competition?

    -The total prize pool for the winning teams of the ARK challenge competition is $1 million.

  • How can one participate in the ARK challenge?

    -One can participate in the ARK challenge by going to Kaggle and searching for 'Arc Challenge 2024', where they can join and submit predictions.

  • What are some of the methods explored in the script to solve the ARK challenges?

    -The script explores methods such as using large language models, breaking down problems into multiple steps, using multi-agent systems, and leveraging discrete program search with a huge amount of code generation and verification.

Outlines

00:00

🧠 Understanding Matrix Patterns and AI's Learning Limitations

The paragraph discusses the challenge of identifying patterns in matrix transformations with minimal examples. It highlights the difficulty faced by AI models like GPD-40 in handling tasks not present in their training data. The text contrasts AI's pattern recognition and memorization capabilities with human adaptability and the concept of true intelligence, which involves learning new things with limited data. It references a paper by Franc Charot on the measure of intelligence and introduces the ARK benchmark for testing AI's ability to learn from a few examples, comparing the progress of AI with human performance on these tasks.

05:00

💡 The Potential of Solving ARC and Its Impact on Programming

This section explores the potential of solving the ARC challenge and its implications for creating a new programming paradigm. It suggests that a solution to ARC could revolutionize programming by allowing people to describe problems with a few examples, and having AI generate programs that can generalize to new data. The paragraph also discusses the excitement around the project and its potential contribution to the progress towards Artificial General Intelligence (AGI). It mentions different approaches people have tried to build AI systems capable of adapting and learning new things, and how one can participate in the ARC challenge by accessing data sets and submitting predictions.

10:02

🔧 Setting Up the ARC Challenge and Initial Attempts with AI Models

The paragraph outlines the process of setting up an environment to participate in the ARC challenge, including loading data sets and creating functions to validate AI-generated answers. It describes the structure of the data sets and the goal of building an AI system that can accurately predict outputs based on new inputs. The speaker shares their initial attempt at using a large language model called GPD-40 to solve one of the challenges, noting the success and limitations of this approach.

15:03

🤖 Exploring Advanced AI Techniques for ARC Challenge

This section delves into more advanced techniques for tackling the ARC challenge, such as using multiple large language model chains, agents, or a multi-agent system. The speaker experiments with breaking down the problem-solving process into two steps, using one model to identify patterns and another to apply those patterns. They also discuss the concept of using a 'coder agent' to write code that transforms inputs into outputs, and a 'program verifier agent' to test the code. The paragraph explores the idea of discrete program search and the challenges of combinatorial explosion in program synthesis.

20:04

📊 Active Inference and the Future of Solving ARC

The final paragraph discusses the concept of active inference as a method for improving AI performance on the ARC challenge. It explains how fine-tuning a large language model on a few examples and artificially expanding them can lead to better performance. The speaker mentions the use of synthetic data to find and train the model during the evaluation stage, which is a novel approach not commonly seen with LMs. The paragraph concludes with a call to action for participants to explore various methods, including fine-tuning with synthetic data, and to share their findings and progress in solving the ARC challenge.

Mindmap

Keywords

💡Pattern Recognition

Pattern recognition refers to the ability of a system to identify and generalize rules from given examples. In the context of the video, it is crucial for AI systems to recognize patterns in matrices and apply these patterns to new, unseen data. The script discusses how AI models are challenged to identify patterns with minimal examples, which is a fundamental aspect of intelligence.

💡Large Language Models (LLMs)

Large Language Models, such as GPD-40 mentioned in the script, are AI models trained on vast amounts of text data to understand and generate human-like text. The video discusses the limitations of LLMs in handling tasks that they were not trained on, highlighting the difference between memorization and true intelligence.

💡Abstraction and Reasoning Corpus (ARC)

ARC is a benchmark introduced in the script for measuring AI's ability to learn and solve problems that it hasn't been trained on. It consists of tasks that require abstract reasoning to identify patterns from a few examples and apply them to new situations. ARC is central to the video's discussion on advancing AI towards general intelligence.

💡General Intelligence

General intelligence, as discussed in the video, refers to an AI system's ability to understand, learn, and apply knowledge across a wide range of tasks, not just those it has been explicitly trained on. It is the overarching goal of the AI research highlighted in the script.

💡Adaptability

Adaptability is the capacity to adjust to new environments or conditions. In the script, it is used to contrast human intelligence with current AI capabilities, emphasizing the human ability to quickly adapt to new situations with limited data, which current AI struggles to replicate.

💡Training Data Set

A training data set is a collection of data used to train machine learning models. The video script discusses the limitations of relying solely on large training data sets for AI learning, suggesting that true intelligence requires the ability to learn from minimal examples.

💡Memory Ability

Memory ability, as mentioned in the video, refers to the capacity of AI systems to store and retrieve information. It is contrasted with intelligence, as AI's reliance on memory to answer questions is seen as a limitation compared to the adaptive learning exhibited by humans.

💡Benchmark

A benchmark, in the context of the video, is a standard or point of reference used to evaluate the performance of AI systems. ARC serves as a benchmark for testing AI's ability to learn and reason, independent of the size of its training data.

💡Programming Paradigm

A programming paradigm, as discussed in the video, is a style or approach to programming. The script suggests that solving ARC could lead to a new programming paradigm where programs are created based on a few input-output demonstrations rather than extensive coding.

💡Multi-Agent System

A multi-agent system, highlighted in the script, is a system composed of multiple interacting intelligent agents. The video discusses using such systems to improve AI reasoning by distributing tasks like identifying patterns and verifying outputs across different agents.

💡Discrete Program Search

Discrete program search, as mentioned in the script, is a method of exploring many different program options to find a solution that fits certain criteria. It is compared to exploring all possible paths in a search for the best solution, emphasizing efficiency and effectiveness in AI problem-solving.

Highlights

AI's challenge in identifying patterns from limited examples, a task that large language models like GPD-40 struggle with.

The importance of adaptability and learning new things as a measure of true intelligence, contrasting with memorization.

ARC, a benchmark for measuring AI's skill acquisition on unknown tasks, introduced by Franc Charot in 2019.

The ARC challenge presents a grid where each square can be one of 10 colors, requiring AI to predict outputs based on input patterns.

By June 2024, the best performing AI system achieved only 39% correctness on the ARC test set.

A study shows that an average human can answer 84% of ARC puzzles correctly, highlighting the gap between AI and human intelligence.

The ARKHI competition aims to build an AI system that can achieve superhuman performance on the ARC test set.

The potential of solving ARC as a new programming paradigm where describing input and output examples is enough to generate a program.

Different approaches to building AI systems that can adapt and learn new things, including prompt engineering and multi-agent systems.

A method involving a large language model to generate code to answer ARC challenges, with a success rate of 50% on public test data.

The concept of discrete program search, exploring a vast amount of code possibilities to find a solution.

Active inference as a method to fine-tune large language models on a small number of examples, improving performance on ARC.

The potential of synthetic data in fine-tuning models to adapt to new tasks, a method that has not been commonly explored with LMs.

The importance of active inference in human intelligence versus the static nature of LMs at inference time.

The current state of AI's ability to solve problems it hasn't been trained on and the potential for future breakthroughs.

An invitation for the audience to participate in the ARKHI competition and explore innovative solutions to ARC challenges.

Transcripts

play00:03

if I ask you to identify the pattern of

play00:06

how does a matrix on the left side

play00:08

transform into Matrix on the right with

play00:10

as little as just one example you'll

play00:13

probably be able to figure out the

play00:14

pattern and generate output on the right

play00:16

side which looks something like this and

play00:18

even for some more complicated example

play00:20

like this with as little as two or three

play00:23

different examples you'll probably be

play00:25

able to identify the pattern that it is

play00:27

taking the smallest rectangle shape outp

play00:30

put just said which means now you are

play00:31

able to answer this new question those

play00:34

simly intuitive and simple task that you

play00:37

can answer is something that statea of

play00:39

art large Range model like GPD 40 will

play00:41

be very struggling to answer because

play00:43

those information are not part of the

play00:45

training data set even though large

play00:47

langage model has shown impressive

play00:48

ability to solve problems especially

play00:51

with the recent agentic Behavior where

play00:53

it is capable to generate complex code

play00:55

or explain deep Concepts it is

play00:57

fundamentally very poor at handling new

play01:00

things that it wasn't train on because

play01:02

the way large language model works is

play01:04

basically trying to predict the next

play01:06

word based on the probability within its

play01:08

massive training data set and the reason

play01:10

it can answer some basic math question

play01:13

or things that require some logic and

play01:15

reasoning is not necessary because they

play01:17

actually think things through but you

play01:19

just memorize it and spit out the answer

play01:21

based on its memory and some might argue

play01:23

that this is not a Showcase of true

play01:25

intelligence but more a Showcase of

play01:28

strong memory ability and and that's a

play01:30

big difference because many believe the

play01:32

definition of true intelligence is the

play01:34

ability to adapt and learn new things as

play01:37

a s psychologist Sean mentioned

play01:40

intelligence is what you use when you

play01:42

don't know what to do if we always just

play01:44

rely on past experience and knowledge no

play01:46

matter how big the amount of data we can

play01:49

ever have it will always be limited by

play01:51

past experience as human we won't really

play01:52

have breakthrough to learn how to use

play01:54

new skills and tools never seen before

play01:57

because it was not part of training data

play01:59

but if you put a baby or kid in a new

play02:01

neighborhood even though they never go

play02:03

through any training able to adapt the

play02:06

new environment or language without

play02:08

pouring millions of training data in and

play02:10

that is ability to solve problems that

play02:12

never seen before with very little

play02:14

training data this exact problem has

play02:16

been described by Franc charot back 2019

play02:19

where he published a paper code on the

play02:21

measure of intelligence where he

play02:23

introduced a benchmark that we can use

play02:25

to measure the efficiency of AI skill

play02:27

acquisition on unknown tasks called Arc

play02:30

which represent for abstraction and

play02:32

reasoning corpse this Benchmark is

play02:34

basically a collection of unique

play02:36

training and evaluation tasks each task

play02:39

contain multiple inputs and output

play02:40

examples to Showcase a pattern and the

play02:43

puzzle like inputs and outputs present a

play02:45

grd where each Square can be one of 10

play02:48

colors and goal is build a AI system and

play02:50

will be able to predict exact output

play02:52

based on a new input some of them might

play02:54

feel simple and straightforward but many

play02:57

of them are actually not that

play02:58

straightforward and can be quite a comp

play02:59

complex and also cover a wide variety of

play03:02

different type of tasks that is fairly

play03:04

unique compared with each other and

play03:06

those tasks didn't require any power

play03:08

knowledge it's a pure test of its D

play03:11

intelligence and ever since this

play03:12

Benchmark was introduced back in 2019

play03:15

there were many people tried to build AI

play03:17

system that can complete those puzzles

play03:19

back 2020 the best performing AI system

play03:21

was able to answer 20% of those task

play03:24

correctly and the latest version to now

play03:26

June 2024 is 39% which we dive a bit

play03:30

deeper into their messes but Meanwhile

play03:32

Back 2021 New York University actually

play03:34

did a study to get a human Benchmark

play03:36

according to that study an average human

play03:38

is able to answer 84% of all the puzzles

play03:42

which means if we have ai system that

play03:43

can achieve similar level of performance

play03:46

as human that means we actually build a

play03:47

system that can learn and adapt to all

play03:50

sorts of different new scenarios just

play03:52

like human do and that is basically goal

play03:54

of the arkhi competition that is

play03:56

happening right now it is global

play03:58

competition everyone can join and

play04:00

whoever build a AI system that can

play04:02

achieve super human level performance

play04:04

which is 85% correctness of the AR

play04:07

testing data set will be winning the

play04:08

competition there's total amount of 1

play04:10

million price pool for the winning teams

play04:12

and what does this actually means let's

play04:14

say by the end of this year let's

play04:15

someone build AI system can actually

play04:17

achieve this and this is what French

play04:19

said given that Arc is a minimum

play04:21

reproduction on of general intelligence

play04:23

how important is it once we discover a

play04:26

solution so at the very least I think a

play04:29

reliable solution to Arc would be at the

play04:32

very least it would amount to a new

play04:34

programming Paradigm if it works on more

play04:37

the mains than just Arc then it means

play04:40

that you found a way to given just a

play04:43

handful of input output pair

play04:46

demonstrations produce a program that

play04:49

matches what you described with your

play04:51

examples and because you only need a

play04:54

couple examples but it means that anyone

play04:57

is not able to program computers just

play05:00

describing here's my input here's my

play05:02

output and and do this twice or three

play05:05

times uh and now they have a program

play05:07

that they can run and that will

play05:09

generalize to new data in very much the

play05:12

same way that the human that has seen

play05:13

the same examples would generalize and

play05:17

hopefully that's a gii but if it's not I

play05:20

think it's at the very least

play05:22

revolutionary new programing Paradigm

play05:24

that will make everyone a program in a

play05:27

much true way I think than llms that

play05:30

basically speedback code Snippets that

play05:33

are similar to things they' WR on GitHub

play05:35

or stack Overflow so this is really

play05:37

exciting project and going to help on

play05:39

the progress towards AGI massively but

play05:42

what are the all different type of

play05:43

approach and methods people have tried

play05:45

so far to build such AI system that can

play05:48

adapt and learn new things so I'm going

play05:49

to share a few examples about how

play05:51

different teams are trying to tackle

play05:53

this problem as well as how can you

play05:55

participate but before we dive into that

play05:57

I know many of you or your company

play05:59

actually have have a huge amount of data

play06:00

that can be analyzed or can be leveraged

play06:03

by AI to extract additional insights but

play06:05

not exactly sure how can you do it

play06:08

popularly because there are just so many

play06:09

different unknowns and it's not very

play06:11

clear what is best practice in terms of

play06:13

like choosing the right data set to use

play06:16

project to prioritize or how can you

play06:18

pach those project internally that's why

play06:20

I want to introduce you to a research

play06:22

that hpot did where they interview lots

play06:24

of different people from top companies

play06:26

about best practice of how they are

play06:28

integrating a AI into their data

play06:30

analysis workflow it showcase common

play06:32

challenges and pitifuls that you might

play06:34

experience when adopting AI into your

play06:36

data analysis process as well as some

play06:39

best practice process about how to start

play06:41

and plan such projects within your

play06:43

company it even include a comprehensive

play06:45

checklist for specific things like how

play06:48

to ensure data privacy and security

play06:50

while you're sending data to different

play06:52

large L model and AI systems so if

play06:54

you're planning to leverage AI to

play06:55

analyze data in your company I

play06:57

definitely recommend you go download

play06:59

this free research and get more prepared

play07:02

about things that you need to do to

play07:03

launch such project you can click on the

play07:05

link in description below to download

play07:07

this report for free and now let's get

play07:09

back to the arhi project so how does it

play07:11

actually work and how can you

play07:12

participate if you go to KGO and search

play07:14

for Arc price 2024 you have this page

play07:17

where you can join and submit prediction

play07:20

in this project it has data set if

play07:22

you're look into this data set it has

play07:23

both evaluation training and the tests

play07:26

and evaluation and training are

play07:28

basically two different data sets us

play07:29

that you can use to build the AI system

play07:31

where the training is a bit easier and

play07:33

evaluation is bit harder and more

play07:36

difficult tasks each challenge in those

play07:38

file has both test and train and if you

play07:40

open them they are basically input and

play07:42

outputs and each are a matrix of numbers

play07:45

that is representing the chart so

play07:47

basically each visual chart is

play07:49

represented in list of array like this

play07:51

so that you can feed to different system

play07:53

like large Range model and each data set

play07:55

look something like this where the

play07:56

training basically just three or five

play07:58

examples that is provided to help you

play08:00

identify the pattern where the test is

play08:03

actual challenge so you can use

play08:05

basically all those training data set

play08:06

and input from test as the input for the

play08:09

system and the goal of system is to be

play08:11

able to generate an output that is

play08:13

exactly 100% match to the correct answer

play08:15

that is provided here there are loads of

play08:17

different challenges that you can use to

play08:19

train and test the system your building

play08:21

and that's pretty much it you can start

play08:22

trying different approach and methods to

play08:24

try to tackle those testing Thea sets

play08:26

and I'm going to show you very quickly

play08:28

how can you set up some basic

play08:29

environment to start building the system

play08:31

so if you go to KGO and click on create

play08:33

new notebook you can click on ADD input

play08:36

and just search for Arc price 2024 and

play08:40

there will be one of them you can click

play08:41

on that to add to your notebook if you

play08:44

close that you will see that this one

play08:45

input called Arc price 2024 which has

play08:48

all those Json file that we will need

play08:50

and what I'm quite curious is I want to

play08:52

test what's grow ability from large Lage

play08:55

model to try to solve this problem even

play08:57

though large langage model itself is

play08:59

system one fast thinking with just

play09:01

pre-train data but there are a lot of

play09:03

techniques that could be used to

play09:05

actually increase the reasoning ability

play09:07

so I'm quite curious to just try it out

play09:09

and see what kind of roll ability output

play09:11

will look like and that's what I'm going

play09:13

to do here so first thing we want to

play09:14

load data so I will import a few

play09:17

different libraries and create a

play09:18

function to split task that have

play09:20

multiple different test input output

play09:21

Pairs and this will make handling a bit

play09:23

easier then I just run a quick function

play09:25

to load the evaluation data set and here

play09:27

what I'm loading is the evaluation dat

play09:29

data set which is the one that's a bit

play09:30

harder and in the end I will create a

play09:32

pan a data front Okay and last we can

play09:34

just try it out to see how data set look

play09:37

like so if I do data set zero you will

play09:39

see that each data set looks something

play09:41

like this it has training with input

play09:43

outputs examples as well as test data

play09:46

including output that we can use to

play09:48

verify and that is what we're going to

play09:49

do next I want to create a quick

play09:51

function to be able to validate whether

play09:53

the generate answer is correct or not as

play09:55

well as some help function to extract

play09:57

the answers and to do that to install a

play10:00

few libraries I'm going to use open AI

play10:02

for this test but just be aware when you

play10:04

actually participate the competition you

play10:06

are not allowed to use openi model

play10:08

because there won't be internet access

play10:10

so you have to use some open source

play10:12

model so I create a few different

play10:14

functions to extract the final output

play10:16

generated by the system because if I'm

play10:18

using large L model the answer might

play10:20

include all sorts different things like

play10:22

the reasoning itself so I need one

play10:24

function to do exactly that and then

play10:25

I'll create a function to be able to

play10:28

compare the result result generated by

play10:30

the AI system versus the correct output

play10:33

from the data set and output if it is

play10:35

100% correct or not and that's pretty

play10:37

much it now we can start building

play10:38

functions that can take in the testing

play10:40

data set to generate answer compare the

play10:43

result and firstly I'm just going to try

play10:45

some basic vanilla L model call with GPD

play10:48

40 so I create a function called Soft

play10:51

single L model where it will take in the

play10:53

task a few short examples if I have any

play10:56

so I create system prompt where it will

play10:58

taking the few shop example if there's

play11:00

any uh and then show the training data

play11:03

and in the end input data and here I

play11:05

didn't even ask it to do the train of s

play11:07

so it's very basic large Lage model call

play11:09

and then I just take one challenge data

play11:11

set get the true solution generate

play11:14

answer from lar Lage model and then

play11:16

compare the results in the end output

play11:18

the final result I click this okay so we

play11:21

got answer so this is s process from gbd

play11:24

40 the final output this one and we can

play11:27

see that it actually answered correctly

play11:29

correct percentage is 100% And if we

play11:32

check this file name which is

play11:34

0576 224 this the actual visualized

play11:38

challenge you can see basically it will

play11:40

just repeat the pattern multiple times

play11:42

and it is actually answer correctly so

play11:45

that's pretty good start but let's try a

play11:47

next one to see how well it perform and

play11:50

this is the second one from what I can

play11:51

see it basically just takes the biggest

play11:53

shape and the color is differently

play11:55

assume the color is based on the actual

play11:58

number which some transformation if we

play12:00

go back to the notebook you can see that

play12:03

the answer here is incorrect and gbt 4

play12:07

or vanilla Al both shaping as well as

play12:09

color real so obviously there's some

play12:12

limitations with direct large L Moc in

play12:14

my question now is that were some

play12:16

tradition prompt engineer tactics like

play12:19

multiple l m chain or agents or even

play12:22

multi-agents going to help in this

play12:24

situation so I'm pretty Keen to just try

play12:26

it out so this second thing I'm going to

play12:28

try basically the idea here is instead

play12:30

of just getting through one large L

play12:32

model call can I increase its

play12:34

intelligence by break down into two

play12:36

steps the step one will be just looking

play12:38

at the um examples and trying to

play12:40

identify the pattern and step two will

play12:42

be trying to solve that so I would have

play12:44

first large L model call to identify the

play12:46

patterns and explain the transformation

play12:49

that it that observe from input and

play12:51

output and then I will put this rules as

play12:54

part of prompt to second large L mod

play12:57

call and this is basically the and let's

play12:59

see if it actually improves the

play13:01

reasoning so I would basically do the

play13:03

same thing but swap the function to be

play13:05

this new function that I created okay so

play13:07

now we get the answer and if I look at

play13:10

it now I can see the reasoning actually

play13:12

improved compar this previous one

play13:14

because if you see the result it does

play13:16

remove all small symbol from the output

play13:20

which in the previous one it totally

play13:21

ignor this rule but the part it got

play13:23

wrong is the color of the actual shape

play13:26

and this to be honest it's a bit

play13:27

challenging I don't even know what rule

play13:29

here is because it looks like there are

play13:31

different colors based on different

play13:33

shapes and if you look at reasoning stab

play13:35

it did actually capture that so it

play13:37

captur a few different rules one is

play13:39

number a represent some particular

play13:40

structure that need to be transformed

play13:42

but it just wasn't sure what the rules

play13:44

there and each input grid that contains

play13:47

one which is smaller one that should be

play13:49

removed from grid after and what I want

play13:51

to try next is I want to see whether a

play13:53

multi-agent system can actually help in

play13:55

this situation and in my mind that the

play13:58

next idea is whether we can build a

play14:00

multi-agent system to improve the

play14:02

reasoning here and one key concept here

play14:04

is that instead of getting large Range

play14:06

model to generate answer directly we can

play14:08

actually get a coder agent to write a

play14:11

piece of code that can do this

play14:12

transformation and benefit of that is

play14:14

that then we can get a program verifier

play14:17

agent to run this code against example

play14:19

input output pairs to say if the program

play14:22

actually deliver result that we want if

play14:25

so use that program to round the result

play14:27

if not give feedback back coder to

play14:29

actually iterate code so this kind of

play14:32

the concept I'm pretty curious to see

play14:34

whether it works so for that method I'm

play14:35

going to use Auto to set a multi-agent

play14:38

system so I install auto package and

play14:41

then set up a multi-agent system so I'll

play14:43

create some temporary folder to stall

play14:45

the program that generate by coder and I

play14:47

first create agent called patent

play14:49

identifier whose role is specifically to

play14:52

identify the pattern and explain what

play14:54

the requirements that the program should

play14:56

meet and next one is I want to create a

play14:58

coder who will actually generate code in

play15:00

this code I wanted to take two inputs

play15:02

one is input grade and output grade

play15:04

where output grade would be an optional

play15:07

param and this program should run the

play15:08

code and then return whether the program

play15:11

running correctly 100% match or program

play15:13

is incorrect and in the end I will write

play15:16

a program verifier who will play the

play15:18

role as QA and run the code against all

play15:20

the examples input output pair any of

play15:24

those input output example pair return

play15:26

incorrect then give feedback to the

play15:28

coder to iterate code and if the result

play15:30

is actually correct then run the test

play15:33

input and return final result terminated

play15:36

this agent should be able to execute

play15:37

code as well so I will put a code

play15:39

execution config and in the end I will

play15:41

put a user proxy agent and create a

play15:43

group chat as well as Define a group

play15:45

chat manager so that's pretty much it I

play15:47

will do the same thing to run this code

play15:49

and in this case you can see that it

play15:51

will pass on to the pattern identifier

play15:53

which in this case successfully identify

play15:56

a requirement for the code which is it

play15:59

should map the value eight to a new

play16:01

consistent number and remove all the

play16:03

other value that not eight and then the

play16:05

coder start generating a code that can

play16:07

meet this criteria if output grade is

play16:09

non know and the result is 100% match

play16:11

that it will return program running

play16:13

correctly otherwise it will return

play16:15

program Incorrect and then the program

play16:17

verifier start wronging this code

play16:19

multiple times against examples that's

play16:21

that it seems that the output has

play16:23

successfully returned the results in the

play16:26

end even though the program verifier

play16:29

Arrow out for some reason the answer is

play16:32

correct this time and if you look deep

play16:34

into the program it written you realize

play16:37

that this program probably is not

play16:40

correct cuz it is always mapping 8 to7 a

play16:44

specific number which is probably not

play16:46

true so let me just try to run again

play16:49

okay in the second attempt it seems

play16:51

still not very good I just keep failing

play16:53

generating the whole program for some

play16:55

reason but I can clearly see that this

play16:58

mess three in teral reasoning steps seem

play17:01

to be more reliable than the other two

play17:03

methods and later I did read an article

play17:06

from Ry who achieved 50% score on the

play17:09

public test data set and the method he

play17:10

was using is actually very interesting

play17:12

so in this sh the method he was using is

play17:15

also getting the large L model to

play17:17

generate code to actually answer the

play17:19

question but instead it is a lot more

play17:21

sophisticated with a lot of optimization

play17:23

so it will give the large language model

play17:25

which is GPD 40 both the image

play17:27

representation of input and outputs as

play17:29

well as text representation and first it

play17:32

alls the L model to reason and plan what

play17:34

kind of code should be written but next

play17:36

step is instead of it right once or use

play17:39

a kind of agent to iterate multiple

play17:41

times based on one piece of code it get

play17:44

agent to run 3,000 to 5,000 different

play17:47

attempts to get large L model generated

play17:50

code so this is a crazy amount of the

play17:53

exploration and generation and from

play17:55

those 3,000 to 5,000 examples it pick up

play17:58

12 best programs and then run against

play18:01

example input and output to verify and

play18:04

if there's any error it will iterate so

play18:07

the high idea is actually similar but

play18:09

the crazy part is it generates huge

play18:11

amount of codes and also looks like

play18:13

multimodal can lead to better reasoning

play18:16

in this example and I took a look at his

play18:18

GitHub it actually few short prompt

play18:20

examples to teach large L model how to

play18:23

explain the image and text

play18:25

representation of the challenge as well

play18:28

as the python code example to and this

play18:30

method of exploring a huge amount of

play18:32

possibility is something we normally

play18:33

call discrete program search it Bas this

play18:36

process of getting the model to search

play18:38

and explore massive amount of different

play18:41

options then do the check to verify

play18:43

which option actually leads to right

play18:44

pass like running the code to verify the

play18:46

output and in the end to select the ones

play18:49

that actually worked it's a similar

play18:50

concept that ARA go adopt where it

play18:53

explore a huge amount of different PA

play18:55

and possibilities so this seem to be a

play18:58

really effective method except this is

play19:00

going to burn a huge amount of cash to

play19:02

run such program and frenches also post

play19:05

a tweet to talk about this he said the

play19:07

main issue with program synthesis which

play19:09

is basically this program search is

play19:11

combinatory explosion which means the

play19:14

space of all programs grow combinator

play19:17

with number of available operators and

play19:19

the best solution to fight this

play19:20

explosion of all the possible passes is

play19:23

leverage intuition over the structure of

play19:25

program space which means you can

play19:27

probably find a specific spefic model to

play19:29

sample program and suggest the right

play19:31

branching Deion so instead of exploring

play19:34

all the possible different passes you

play19:36

can use certain model to help deciding

play19:39

which path is more worth exploring

play19:41

versus others and even though that

play19:43

Intuition or judgment might not be

play19:45

correct it will still guided to a more

play19:48

possible pass with much more efficient

play19:50

structure and he actually has a Google

play19:52

slides that explain a few different

play19:54

concepts in details I put a link in the

play19:56

description below where you can click on

play19:58

that to check more details and on the

play20:00

other hand I also noticed something that

play20:02

called active inference that seem to

play20:04

deliver really good results with r model

play20:07

and the concept here is basically using

play20:08

synthetic Arc AGI like data to find your

play20:11

L model during the evaluation stage and

play20:14

here's a quick clips where Fishers talk

play20:17

about this specific method the trick to

play20:19

making this LM based solution work so

play20:21

the thing is if you take a

play20:22

state-ofthe-art LM and you additionally

play20:25

pre-train it on millions of CTIC

play20:28

generated AR like tasks that still

play20:32

doesn't work very well at all it's maybe

play20:34

on the order of 10% so not very good in

play20:37

fact much worse than relatively basic

play20:38

discrete problems of solutions and the

play20:41

trick to actually making this approach

play20:42

work well is active inference which is

play20:45

this idea that when you're presented

play20:47

with one of the test tasks that you're

play20:50

supposed to solve so you're presented

play20:52

with a small number of input output

play20:54

demonstration examples and the idea is

play20:57

going to be to fine tune the llm on

play21:00

these examples and of course because you

play21:02

only have a couple of them that's not

play21:04

really enough to uh get stastical and

play21:07

this to work uh you're going to want to

play21:09

expand them artificially uh using using

play21:12

a DSL that's going to just the pl

play21:14

Transformations uh to them uh so trying

play21:16

to make them more more more diverse to

play21:19

have basically enough data points to fit

play21:20

a curve uh but still trying to make them

play21:23

match original task and this trick of

play21:27

doing active inference tuning is

play21:29

actually what unlocks the performance of

play21:31

this approach and I don't think this is

play21:33

something that I've seen with with LMS

play21:36

before actually I don't think anyone

play21:37

else is is doing that and the fact that

play21:40

it has this outsized impact on the

play21:43

solution I think it's really interesting

play21:44

and it feels intuitive that llms are not

play21:48

on their own the solution like llms near

play21:52

things like gini orp and so on because

play21:55

they're basically Frozen at inference

play21:58

time they're not actually learning

play22:00

anything when you show them a new task

play22:03

but that's not how humans operate

play22:05

obviously when you look at a task you're

play22:07

forced to adapt to it you're not just

play22:10

fetching something from your memory that

play22:12

matches this task if it's a task you've

play22:14

already seen before then sure maybe

play22:15

that's what you're doing uh but the idea

play22:17

is that you can be exposed to these

play22:19

stars that you've never seen before and

play22:20

you need to make sense of them and so

play22:22

you need to basically learn from them

play22:24

right you need this active inference

play22:26

step which vanilla on doing and I think

play22:30

that's really one of the Big Blocks on

play22:33

their performance especially on Arc so

play22:35

those are few example implementation of

play22:37

how you can attempt to solve the arc

play22:40

challenges and the key thing here is

play22:41

that the real breakthrough hasn't even

play22:44

show up yet so there's loads of method

play22:46

that you can try so I definitely

play22:48

recommend you to go ahead play you can

play22:50

go to k go Arc price 2024 to just join

play22:53

and submit your solution if you want to

play22:55

find some example you can go to code

play22:57

there are some examples already like

play22:59

this llama 3ab example that you can use

play23:02

as a reference to see how to approach

play23:04

the solution you did I'm really Keen to

play23:06

see what kind of interesting solutions

play23:08

that people are start exploring this is

play23:10

definitely a project I'm going to

play23:11

continue monitoring I'm quite

play23:13

interesting to try the methods of

play23:15

fine-tuning with synthetic data as well

play23:17

to see how well the method can work if

play23:19

you do want to keep update please

play23:21

comment below about interesting method

play23:23

that you heard or tried I'm really Keen

play23:25

to discuss more on this and if you want

play23:27

to keep update please And subscribe

play23:29

thank you and I see you next time

Rate This

5.0 / 5 (0 votes)

Ähnliche Tags
Artificial IntelligenceMachine LearningAdaptive LearningAI CompetitionProgramming ParadigmData AnalysisInnovationTech TrendsResearchIntelligence
Benötigen Sie eine Zusammenfassung auf Englisch?