Build Computing Olympiad Agents with LangGraph

LangChain
25 Apr 202430:43

Summary

TLDRこのビデオスクリプトでは、言語モデルがオリンピックプログラミング問題を解く能力を向上させるための手法について解説しています。まず、言語モデルGPT-4がUSAコンピューティングオリンピックの問題に挑戦し、低い成功率を示した後、自己反映と呼ばれるプロンプト技術を用いて改善を試みます。次に、エピソード記憶と呼ばれるフェイクショットリトリーブアル手法を導入し、過去の問題と解答のペアから高質な例を取得して、さらに性能を向上させます。最後に、人間がループに参加できるように、人間インザループインターフェースを追加し、問題解決に向けてアドバイスを提供します。これらの手法により、言語モデルは独自で高い難易度の問題を解決できるようになる可能性がありますが、まだ完璧ではないことが示唆されています。

Takeaways

  • 🤖 最初のGPT-4の挑戦では、プログラミング問題の正解率は8.7%でした。
  • 📈 推論最適化技術を用いて、正解率を平均20.2%に向上させることができました。
  • 📚 データセットは、USA Computing Olympiadの問題から成る307の問題で構成されています。
  • 🧠 LLM(大規模言語モデル)は、限界に挑戦され、理論的思考の限界が明らかになります。
  • 🔍 自己反映とエピソード記憶の復旧が、モデルの性能向上に大きく貢献しました。
  • 📝 チュートリアルでは、3つのステップに分けて、より高度な問題を解決できるエージェントを作成しました。
  • 🔄 ループ構造を使って、問題を解決するまでエージェントを繰り返し実行します。
  • 🧑‍🤝‍🧑 最終的に、人間とエージェントの共同作業(コパイロット型設定)が、最良の結果を生むことが示唆されました。
  • 🛠️ LangGraphを使用して、タスクドメイン内でエージェントを制御し、全体的な結果を向上させることができます。
  • 🔬 このチュートリアルでは、実際にアプリケーションを構築する際のアプローチとして、人間インザループが重要であることが強調されました。
  • 📈 自己反映、エピソード記憶、そして人間インザループの導入を通じて、解くことができる問題の難易度が徐々に向上しました。
  • 🌟 今後のLLMの進化と、より高度な問題解決能力を持つシステムの開発が期待されます。

Q & A

  • Willはどのような組織から来ていますか?

    -WillはLangChainという組織から来ています。

  • Princeton大学の研究チームが出した論文のタイトルは何ですか?

    -Princeton大学の研究チームが出した論文のタイトルは「Can Language Models Solve Olympiad Programming?」です。

  • GPT-4がUSA Computing Olympiadの問題に挑戦したときの合格率は何パーセントですか?

    -GPT-4がUSA Computing Olympiadの問題に挑戦したときの合格率は8.7%です。

  • Willが紹介するチュートリアルの3つのパートで何を構築しますか?

    -Willが紹介するチュートリアルの3つのパートでは、競技プログラミング問題を解決するためのエージェントを構築します。

  • 最初のパートで実装されるreflection agentとは何ですか?

    -reflection agentは、ゼロショットで自己反映を行うエージェントであり、問題を解決するまでループして実行されます。

  • チュートリアルの第二部分で実装されるepisodic memoryとは何ですか?

    -episodic memoryは、過去の問題解決エピソードから高質な例を取得し、それをプロンプトに含めることでモデルの論理性能を向上させる手法です。

  • チュートリアルの第三部分で追加されるhuman interruptとは何ですか?

    -human interruptは、ユーザーがエージェントの答えに介入し、正しい答えに導く手助けをする機能です。

  • LangGraphを使用することでどのような状態を持つことができます?

    -LangGraphを使用することで、状態機械を持ち、タスクドメイン内でエージェントを制御およびガイドし、全体的な結果を向上させることができます。

  • チュートリアルで使用されるLangGraphモデルとは何ですか?

    -チュートリアルで使用されるLangGraphモデルは、Anthropic社が提供するものです。

  • Willが提案するチュートリアルの最終的な目標は何ですか?

    -Willが提案するチュートリアルの最終的な目標は、競技プログラミング問題を解決する能力を持つエージェントを作成し、自主エージェントだけでなく、人間を含む協同型のシステムを通じて、より優れた結果を得ることです。

  • チュートリアルの最後にWillが述べている「それでは、次の機会に」とはどういう意味ですか?

    -「それでは、次の機会に」とは、Willが今後も同様のチュートリアルを行う予定であり、また視聴者が今後のコンテンツに期待してほしいという意味です。

Outlines

00:00

😀 言語モデルでコンピューティングオリンピック問題を解く

Will Fu-Hinthornは、言語モデルがコンピューティングオリンピックのプログラミング問題を解く能力について語ります。プリンストン大学の研究チームが発表した論文を紹介し、GPT-4のパフォーマンスと、それが改善された方法について説明します。また、問題の難易度や言語モデルの限界についても触れています。

05:00

🤖 自己反映とループを使用したゼロショットエージェント

LangGraphを使って、自己反映を行うゼロショットエージェントを構築します。このエージェントは、問題を理解し、応答を生成し、テストケースを実行する簡単なループ構造を持っています。自己反映によって、生成されたコードを改善し、問題を解決するように試みます。

10:02

🧠 エピソード記憶を使ったフェイシャルショットリトリーブアル

研究で提案されたエピソード記憶を利用したフェイシャルショットリトリーブアルを実装します。これは、過去の問題と解に対するペアから出力をリコールすることで、後の問題解決に役立てます。BM25リトリーバーを使用し、質の高い例を取得して、プロンプトに組み込みます。これにより、モデルのロジックパフォーマンスが向上することが期待されます。

15:04

🔄 グラフの最適化と人間介入

より高度な問題に挑戦するために、グラフに人間介入を加えます。テストケースに失敗した後、人間がエージェントにフィードバックを提供し、解決策を導くことができます。LangGraphの機能を使って、チェックポイントから再開し、繰り返しのループを避けながら最適解に辿り着くことができます。

20:05

🚀 人間とエージェントの協同による問題解決

チュートリアルの最終パートでは、人間がループ内に介入することで、より高度な問題を解決できるデモンストレーションを行います。人間が提供するフィードバックをエージェントが利用し、問題を解決するプロセスを辿ります。これは、言語モデルが現在持つ能力の限界に直面していることを示すとともに、人間と協働することで到達できる可能性を広げます。

25:08

📚 チュートリアルの総括と今後の展望

Willは、言語モデルがオリンピックプログラミング問題を解くための能力について語り合い、チュートリアルを締めくくります。言語モデル単独では限界があるものの、プロンプト技術やシステム設計の改善でパフォーマンスが向上し、人間と協働することでさらに成果を上げることができると結論づけます。今後、より優れたシステムが登場し、さらなる性能向上が期待されると述べています。

Mindmap

Keywords

💡言語モデル

言語モデルとは、自然言語の処理や生成を行うための計算モデルのことで、このビデオでは特にGPT-4が言及されています。言語モデルは、コンピューティングオリンピックの問題を解くためのエージェントの性能向上に使用されていますが、独自の思考プロセスを持たないため、複雑な問題では限界があります。ビデオでは、言語モデルがどのようにして問題解決に挑戦し、人間の介入なしでは十分な性能を発揮できない理由について説明しています。

💡コンピューティングオリンピック

コンピューティングオリンピックは、プログラミング能力を競う国際大会です。ビデオでは、オリンピックの競技問題が言語モデルの限界を示すテストケースとして使用されています。これらの問題は、高度なデータ構造とアルゴリズムを必要とし、言語モデルがそれらを創造的に組み合わせて正しい解決策を見つけ出すことが必要です。

💡ゼロショット学習

ゼロショット学習とは、事前に与えられたタスクに関連するデータセットでモデルをトレーニングせずに、新しいタスクに対して直接モデルを適用する学習方式です。ビデオでは、言語モデルがゼロショット学習を使用して、事前に見たことのない問題に対処する試みが失敗した例があります。

💡自己反映

自己反映は、エージェントが生成した結果を自身で分析し、フィードバックを提供するプロセスです。ビデオでは、自己反映を使用することで、言語モデルが生成したコードの質を向上させる方法が説明されています。自己反映は、問題解決プロセスで重要な役割を果たしており、エージェントがより良い解決策を見つける手助けをします。

💡エピソード記憶

エピソード記憶は、過去の経験を記憶し、それらを将来の判断や行動に活かす能力です。ビデオでは、言語モデルがエピソード記憶を模倣して、過去の問題解決経験を利用して新しい問題に対処しようとする手法が提案されています。これは、モデルが特定の問題に対するより良い解決策を見つける可能性を高めるのに役立ちます。

💡ハイパーパラメータ

ハイパーパラメータは、機械学習モデルの学習プロセスを制御するパラメータで、通常は事前に設定されます。ビデオでは、ハイパーパラメータの調整を使用して、言語モデルの性能を最適化する例があります。ハイパーパラメータの調整は、問題解決の成功率を向上させるために重要な役割を果たします。

💡グラフ循環エラー

グラフ循環エラーは、プログラムが無限ループに陥る場合に発生するエラーの一種です。ビデオでは、言語モデルが問題を解決するために繰り返し同じ操作を行っているときにこのエラーが発生し、モデルが問題を解決できないことを示しています。このエラーは、言語モデルが複雑な問題に対する限界を象徴しています。

💡人間の介入

人間の介入とは、人間が自動化されたシステムの判断や行動に直接参加することで、システムの性能を向上させる方法です。ビデオでは、人間の介入を通じて、言語モデルがより困難な問題を解決できるようになることが示されています。これは、人間の知識と経験を活用して、言語モデルの限界を克服する戦略の一形態です。

💡状態機械

状態機械は、コンピュータサイエンスの分野で、あるシステムの状態を定義し、それらの状態間の遷移を管理する理論モデルです。ビデオでは、状態機械を使用して、言語モデルをより効果的に制御し、特定のタスクを実行するように導く方法が説明されています。状態機械は、複雑なタスクを単純なステップに分割し、システムの振る舞いを予測可能にするために使用されます。

💡ハイブリッドシステム

ハイブリッドシステムとは、異なるタイプのシステムやアルゴリズムを組み合わせて、それぞれの長所を活かした新しいシステムを作成する方法です。ビデオでは、言語モデルと人間の協力を組み合わせたハイブリッドシステムが、より高度な問題解決能力を持つことが示されています。ハイブリッドシステムは、言語モデルの限界を克服し、より良い結果を生むために役立ちます。

Highlights

Will from LangChain introduces a project to build computing Olympiad agents, referencing a paper by a team from Princeton.

The paper presents a challenge benchmark of 307 competitive programming problems from the USA Computing Olympiad.

GPT 4's initial pass rate on the benchmark is only about 8.7%, which is low compared to other benchmarks like HumanEval and MMLU.

Inference optimizations are discussed, which improve the performance from 8.7% to an average of 20.2%.

The problems in the benchmark are complex word problems requiring advanced data structures and algorithms.

The tutorial aims to push the limits of Large Language Models (LLMs) to see where they break down in reasoning.

Different techniques such as self-reflection and retrieval from semantic or episodic knowledge are explored to improve model performance.

The best-performing retrieval type, episodic knowledge, is chosen for implementation due to its complementarity with self-reflection.

A three-part tutorial structure is outlined, starting with a zero-shot reflection agent, then adding episodic memory, and finally introducing a human interrupt.

The reflection agent corresponds to a 12.38% pass rate, which is better than the base zero-shot agent.

The addition of episodic memory as a form of retrieval improves logical performance in the model, achieving about 22.2% on the benchmark.

The human interrupt allows the user to guide the agent towards the correct answer, which is not benchmarked but is pragmatic for application designs.

The tutorial demonstrates the creation of a state machine using LangGraph to control and guide the agent within a task domain.

LangSmith is used for tracing and debugging each step of the agent's operation to catch mistakes and understand the process.

The state graph in LangGraph is defined by nodes as units of work and edges as control flow, with the state holding the agent's messages list and test cases.

The solve node composes a prompt with an LLM to generate a candidate solution, while the evaluate node tests the solution against test cases.

The tutorial shows how to implement a human-in-the-loop interface to allow for user guidance and potentially achieve a better outcome.

The final success of the agent in solving a silver level question demonstrates the potential of combining autonomous agents with human input.

The tutorial concludes by emphasizing the current limitations of LLMs and the potential of hybrid systems for improving performance on challenging problems.

Transcripts

play00:01

Will Fu-Hinthorn: Hi all, this is Will from LangChain.

play00:02

Let's build computing Olympiad agents together.

play00:06

So about a week ago, this team out of Princeton came out with a

play00:08

paper called Can Language Models Solve Olympiad Programming?

play00:11

It was done by the folks such as Chen Xie and Shunyu Yao, who you might

play00:14

recognize from the React paper, Tree of Thoughts paper, and Reflexion papers.

play00:19

This paper really has two interesting components to it.

play00:21

on one hand, it's a data set paper.

play00:23

They come out with this challenge benchmark of 307 competitive programming

play00:27

problems from the USA Computing Olympiad.

play00:30

And they showed that GPT 4 only has about an 8.

play00:32

7 percent pass rate when trying to do this with a simple zero

play00:36

shot React agent framework.

play00:38

This is in contrast to some of the existing benchmarks, like HumanEval,

play00:41

MMLU, that are mostly saturated by this crop of language models.

play00:46

They then show some inference optimizations, basically prompting

play00:50

techniques or systems engineering types of approaches to improve the

play00:55

performance on average from that 8.

play00:57

7 percent up to 20.

play00:59

2%.

play01:00

And we'll go more in detail on each of those techniques as we

play01:04

go throughout this tutorial.

play01:06

Let's get a sense for the difficulty of the types of problems in this benchmark.

play01:10

Alright, so here's an example question.

play01:12

You can see they're mostly word problems that require you to identify the

play01:16

underlying mathematical problem you need to solve, use advanced data structures and

play01:21

algorithms, and compose them in a creative way to come up with the correct solution.

play01:26

It also has to be done and implemented in a way that solves

play01:29

it within a given time limit.

play01:32

As you can see from this diagram, there's a lot of different usage of

play01:34

sets and , other types of things there.

play01:37

The questions are challenging.

play01:38

There's a reason why it's called an Olympiad.

play01:40

What I find interesting about this benchmark is that it really

play01:43

pushes LLMs to the limit, so you can see where it breaks down.

play01:46

I think when we're using them in normal life, you often start

play01:48

to anthropomorphize them and see them think that they're reasoning.

play01:52

Whenever you get to this extent, you can see when they start doing things that

play01:56

look like, interesting, close to correct, but don't have this logical property.

play02:01

So as I mentioned before, when they first run GPT 4 on each of these

play02:05

problems, it gets a really low pass rate.

play02:08

They later show a number of inference techniques that

play02:10

they could do to improve it.

play02:11

Some of these include self reflection and some of these involve

play02:15

retrieval, either from semantic knowledge or episodic knowledge.

play02:18

They do a lot of experiments to show what types of retrieval actually

play02:22

improve the performance of the model.

play02:24

We're going to implement the best performing retrieval type, this episodic

play02:28

knowledge type, since it seems to be very complementary to self reflection and

play02:32

works well within our tutorial structure.

play02:34

Let's check out our LandGraph docs to see how we're going to

play02:37

implement this in the tutorial.

play02:39

Alright, so for the remainder of the video, we're going to be creating

play02:45

an agent to solve these types of competitive programming problems.

play02:48

We'll break it down into three parts, following the paper's structure and making

play02:53

agents that are increasingly capable of solving these advanced questions.

play02:58

In the first step, we'll implement the reflection agent, the zero shot

play03:01

agent that does self reflection.

play03:04

This corresponds to here, this block in our graph that we're going to be creating.

play03:08

It's going to have two simple nodes and just run in a loop until it can solve

play03:11

the answer correctly, or ran out of time.

play03:14

This is roughly analogous to this reflexion agent they

play03:18

created, that gets about a 12.

play03:20

38%.

play03:21

So again, better than the base zero shot agent, not as good

play03:24

as what they can get overall.

play03:26

In part two of the tutorial, we're going to implement retrieval as a form of

play03:31

episodic memory, what the paper calls it.

play03:35

This is part two here, where we'll be then retrieving some high quality

play03:39

examples to include within the prompt and hopefully induce better

play03:43

logical performance in the model.

play03:46

This is analogous to this section here that gets about the 22.

play03:49

2 percent on the benchmark overall.

play03:52

In part three, we're going to add a human interrupt.

play03:56

So we're going to let us, the user, actually weigh in on the

play03:59

answer and help guide the agent to come to the correct answer.

play04:03

You may consider this cheating, and in fact we aren't actually going to

play04:07

benchmark this on the whole dataset, and the authors don't as well.

play04:11

But, in a lot of application designs, we're What you actually

play04:15

want is the best outcome overall.

play04:17

And so co pilot type setups are a great pragmatic approach to getting

play04:22

a better overall outcome where either perhaps the human alone or the agent

play04:26

alone couldn't actually get there.

play04:28

A big theme throughout this is that autonomous agents really aren't quite

play04:31

there yet especially when you're just prompting in a simple zero shot manner.

play04:36

But you can create state machines using frameworks like LangGraph to

play04:40

really control and guide it within your task domain and hopefully

play04:44

get better outcomes overall.

play04:47

Let's download this notebook and then run through it together.

play04:52

All right, now we're ready to start running through the tutorial.

play04:55

We've opened this up in Jupyter.

play04:57

And we're going to install some prerequisites here.

play05:00

It's basically LangGraph.

play05:01

We're going to add some tracing here with LangSmith.

play05:03

The agent is going to be powered by Anthropic's LangGraph model, in our case.

play05:07

And we've got a couple of other things to pull things from the hub.

play05:10

We'll set our environment.

play05:12

In our case, it'll be Anthropic's API key to connect to their API.

play05:15

And then we'll also configure tracing.

play05:17

There's a lot of steps that go on here, and each of the programs can

play05:20

be quite long, so visualizing it in a notebook can be pretty cluttered.

play05:24

I'd recommend using Langsmith just so you can debug each step

play05:27

and see exactly what's happening.

play05:28

It's easier to catch mistakes, easier to see what's going on.

play05:33

We'll then fetch the data we've stored in Google Cloud Bucket

play05:36

and we'll load it into memory.

play05:37

And then finally the utility here.

play05:40

is to run test cases.

play05:42

I'll note that this is executing code locally, so proceed with caution.

play05:45

There's an inherent risk if the LLM does generate malicious code or something

play05:49

that will have an OOM or something.

play05:52

An example run of here of the test case runner in this print hello world.

play05:56

If it passes, it would turn passed.

play05:58

If not, it would turn wrong answer along with the expected output.

play06:01

All right, so now I can finally start defining a graph.

play06:05

In part one again, we'll be doing this simple zero shot agent with reflection.

play06:10

Note in the paper they used an explicit Reflexion prompt.

play06:13

We've adapted it a bit and then prompted our agent here to be reflecting on its

play06:18

output here when it's doing tool calling.

play06:20

This one's going to be relatively rudimentary.

play06:22

It corresponds to that agent that gets about a 12 percent

play06:25

pass rate on the benchmark.

play06:26

And so for our first question, we kind of expect it not to

play06:30

pass, but I think that's okay.

play06:31

We've got some other tricks up our sleeve.

play06:34

Now the next part may be review if you've already built with LangGraph,

play06:37

but I want to review it anyways, because I think it is important.

play06:40

The main primitive in LangGraph is a state graph.

play06:43

It's how you define a state machine.

play06:45

Basically the nodes define the units of work.

play06:48

And the edges define the control flow.

play06:50

Once a node completes, the edge defines which node to pass to next

play06:55

in order to continue operation.

play06:57

In its most simple case, you essentially have two nodes in our cases and it loops

play07:01

back and forth, back through and then will output once it has reached some end state.

play07:08

The final piece of this that's really important is the state.

play07:11

Of course, this defines the interface for all the nodes, so each node

play07:15

receives the state as an input and then returns an update to the state.

play07:19

That could be an entire state itself, or just some subset

play07:23

that then the graph is able to incorporate into the previous state.

play07:27

In our case, the main aspect of this state is going to be this messages

play07:31

list that we're annotating as being append only using this function,

play07:35

using Python's annotated syntax.

play07:37

This basically keeps the agent scratch pad as it generates a

play07:41

candidate and then receives a tool response, and then continues to

play07:44

iterate and try to improve the game.

play07:47

The rest of the state here holds the test cases essentially, and the runtime limit

play07:51

to configure how the evaluate node runs.

play07:54

The agent itself is gonna ignore this, but the test runner will incorporate it.

play07:58

Now that we've defined the state, we'll update our data set to be in the right.

play08:04

format essentially to be able to pass it in and we can

play08:07

start defining the good parts.

play08:09

Remember, this has two nodes.

play08:11

We've got the solve node and the evaluate node.

play08:12

So first the solver here is pretty simple.

play08:16

We're basically taking a prompt and we're going to compose it with an LLM.

play08:20

So this format into a prompt and then pass it to the LLM.

play08:24

This bind tools operation is just configuring a schema for the

play08:28

LLM, so it knows the structure that it should respond with.

play08:31

In our case, we're going to use this write Python tool.

play08:34

Let's write Python schema.

play08:35

We're basically going to tell it for some reasoning and pseudocode to induce some

play08:38

chain of thought reasoning beforehand.

play08:41

And then we'll finally have it write all of the Python 3 code.

play08:44

into the code string here.

play08:46

This just makes it a lot easier to parse on the tail end, so you don't

play08:49

have to be parsing out raw strings.

play08:52

I'll run this here and then we'll define a solver.

play08:55

So we pulled this, this solver prompt from the hub.

play08:58

It's pretty simple.

play08:59

I didn't do too much prompt engineering here.

play09:01

Note that we do have this variable examples that we'll use later in part two.

play09:05

Right now we're just going to fill this in through partialing with an empty string.

play09:08

That'll be the placeholder that we then populate with.

play09:11

additional information that we retrieve, but more on that later.

play09:17

Here's an example run.

play09:19

We're just going to ask it, how do I get a perfectly random

play09:21

sample from an infinite stream?

play09:23

All right, we've gotten the response.

play09:25

So you can see it generates this thinking tag, and then eventually

play09:29

outputs things with the reasoning.

play09:31

You might see the pseudocode, and then eventually the code, and it is doing

play09:34

reservoir sampling as we'd expect.

play09:36

It at least has studied some code.

play09:38

That's a good thing.

play09:39

One thing that I like about Claude as opposed to GPT 4 is that it was trained

play09:45

to output this thinking or prelude text, you can think of it, before

play09:50

actually doing the tool invocation.

play09:51

I think there's often this question when using GPT 4 with tool calling of

play09:55

how to incorporate chain of thought.

play09:57

And we've found that it sometimes suffers with some of these

play10:00

more complex tasks as a result.

play10:01

I think it's quite nice that they've trained the model to output this before

play10:05

doing the tool invocation so you don't have to be making sacrifices there.

play10:09

So the second node that we'll define in this loop, this agent

play10:12

loop, is the evaluate node.

play10:14

And there's a good amount of error handling and stuff here, but really the

play10:16

key bit here is we're going to iterate through all the test cases in our state.

play10:20

Again, recall that each node is going to receive an instance of

play10:23

the state and return an output, so in this case an updated message.

play10:26

But it iterates through the test cases, runs them, and then if it succeeds,

play10:29

it'll update the status in our state.

play10:32

Otherwise, it's going to format each of these things individually,

play10:35

and then add the messages.

play10:39

You are there, you can create the graph, and we'll visualize it.

play10:45

So, again, this corresponds to that graph we had above, it's a little more simple.

play10:49

We put the initial problem in here, tries to generate a solution, it tests it

play10:54

out, and then we enter this control edge.

play10:56

If it's successful we end, otherwise we go back to the solver and we keep on looping.

play11:01

Here the control edge, you can see there, it just also receives the state and then

play11:06

checks that state to see if it succeeded.

play11:09

It then returns, this is a string, and string with double underscores,

play11:13

this just tells the graph that it doesn't need to continue looping.

play11:16

Otherwise, it says go back to the solver node.

play11:19

That's how we would define it here with the conditional edges.

play11:24

Let's see the first question.

play11:25

Question about Farmer John.

play11:27

Looking at, you know, he's got some productivity of the day.

play11:31

Bessie, you've got the cow.

play11:33

Seems pretty complicated.

play11:34

It's got some number of sample inputs, sample outputs, as a part of the question.

play11:40

Let's try running our agent here.

play11:42

Okay.

play11:43

So again, this is the simplest version of the agent that we're

play11:46

going to be building today.

play11:48

I fully expect it not to work, honestly.

play11:50

And the way lingraph indicates this typically is through

play11:53

a graph recursion error.

play11:54

Basically by default, the graph has a limited number of steps.

play11:58

You can configure this.

play11:59

We show this in the docs elsewhere.

play12:01

I won't go into detail here, but if it surpasses the maximum number of configured

play12:05

steps, it'll raise a recursion error.

play12:08

Let's wait for this to happen.

play12:09

Continue to populate.

play12:11

All right.

play12:12

It looks like, as we said, it's going to hit the rehearsal limit.

play12:15

Wasn't able to actually saw it.

play12:16

You can see each of these steps below, and we'll actually jump over to

play12:20

Langsmith to check out the trajectory.

play12:23

One second.

play12:25

All right, so we've checked out this trajectory here.

play12:28

You can see it going through the loop as we defined it.

play12:31

You have the prompt and LLM and then you have the evaluate known and

play12:35

it goes in this loop and continues until it eventually has this error.

play12:39

I always like to jump into one of the later LLM calls because it does collect

play12:43

this full history of messages and you can see exactly what it's doing.

play12:46

So initially it says solve this problem.

play12:48

I'll do this.

play12:49

The key insight here and tries to write out the pseudocode and

play12:52

thinking and then generates it all.

play12:54

There and the second one is the current solution times out in a

play12:58

larger test case is likely because it iterates through all possible lanes.

play13:01

So it's trying to actually self correct here, but it's unable to it

play13:06

doesn't have a lot of good examples of solving this type of problem,

play13:08

I guess, and -it's in its memory.

play13:10

And so, as you can see, it goes through quite a number of things.

play13:15

So it wasn't able to get it correct.

play13:17

That's all right.

play13:18

The paper presents at least one more automatic improvement that we

play13:22

can be incorporating into it and then also presents the idea of more

play13:26

of a co pilot type of scenario, which we'll get to in part three.

play13:28

But in part two, let's jump into the memory and retrieval optimization.

play13:34

All right, so back in the notebook, in part two, we're going to be

play13:36

implementing this few shot retrieval optimization that the paper proposes.

play13:40

And the authors call this an episodic memory because it's retrieving

play13:44

these outputs from the other question answer pairs in the corpus.

play13:48

So if you pretend that the algorithm, or that the agent, has already

play13:51

solved all these other problems, it could then recall this and use

play13:55

this for solving later problems.

play13:57

It's kind of an interesting framing, and kind of in contrast to the rule of

play14:02

thumb, where people tend to talk about RAG and retrieval as a way to improve

play14:06

knowledge and update knowledge, but not as a mechanism for actually improving

play14:10

the reasoning capabilities of the agent.

play14:13

Though since these are extremely well selected, well crafted in domain examples,

play14:19

this does then align more with few shot instruction and optimizations like that.

play14:23

That's more of what it is.

play14:25

The paper also explores a semantic memory where it's retrieving over

play14:27

textbooks and things, and that does show a brief boost, but doesn't seem to hold

play14:32

whenever they incorporate that later with the reflection and other things.

play14:35

So, it seems to be a technique that doesn't really scale quite to the

play14:39

same extent as this high quality instruction type of data set.

play14:42

So, we'll skip that for here.

play14:44

Following the paper, we'll use as the retriever this BM25 retriever.

play14:50

It's essentially a more old fashioned, non vector based, TFIDF based retrieval

play14:56

mechanism here that's high quality.

play14:59

And then to accommodate these steps, we're going to add two more keys

play15:03

to our state compared to part one.

play15:06

We're going to add this candidate message.

play15:08

that we generate first, and it's going to be used in the retrieval step.

play15:11

And then we'll have the examples formatted in strings.

play15:14

If you remember the prompt at first, it had that examples template.

play15:18

We'll finally be populating that here.

play15:21

And then again, recall that this episodic memory happens before our agent loop.

play15:26

This part is going to be untouched here.

play15:28

But here we're going to be having the retrieval step.

play15:31

And then we'll still ignore this and save this for part three.

play15:34

So once we define our new state, we can define the solver.

play15:37

This is mostly repeat from before, except we're going to have a little if statement

play15:40

to generate populate the candidate step, if it's still that first stage there.

play15:44

So we're going to make this draft solver and solver, they're pretty much

play15:46

identical, except the draft solver of course hasn't already had questions here.

play15:52

Well then, to just make sure that we avoid cheating by putting the

play15:56

actual answer to our question in the retriever, we're going to separate out

play16:00

this as train and test , corpora and then we'll create the retriever here.

play16:05

Finally, it's time to define the retrieve node.

play16:10

So as before, it receives the state here as all the nodes do, and

play16:14

then returns this updated state.

play16:16

So it's going to update the examples key in particular.

play16:19

And then within this, it calls the retriever here.

play16:22

So, Retriever invoke picks out the top k there, and then formats

play16:28

this in a string that we're going to be updating in the prompt.

play16:31

You'll notice that we add this runnable config here in this graph

play16:35

that we're defining, we're going to let you configure whenever you

play16:37

invoke the actual agent the number of retrieved examples that you'll have.

play16:42

One way to do that is through these configurable params in the config, which

play16:45

is always the second positional argument.

play16:48

One more thing to note about this retriever setup that I think is

play16:52

quite interesting, is that we're retrieving, and we're treating the

play16:55

candidate program as the query, rather than the initial question.

play17:00

This is kind of similar to techniques you might have come across, such as HyDE, or

play17:04

some of these other, like raft, or other types of indexing strategies for RAG.

play17:09

The observation here is that The distribution of queries is different

play17:14

than the distribution of documents that you're trying to retrieve from.

play17:17

And so you either want to be creating hypothetical queries from the documents

play17:22

that will better align with the type of queries that we're going to be

play17:24

putting into the system, or we want to map from the queries to what you'd

play17:28

expect the document to look like and then maybe retrieve from that.

play17:32

And then there's some other variants there, but basically you're saying

play17:35

the type of text and the type of words that we're going to put in these two

play17:37

things is not going to be the same.

play17:38

And so you get better results if you can try to translate them.

play17:42

Finally, it's time to build the graph.

play17:43

So again, we've had most of this is the same here.

play17:47

So you see solve, evaluate, and like all this stuff is untouched.

play17:51

The things that we've added here really are this draft node at the

play17:54

beginning, where we're putting in that solver, and the retrieve node here.

play17:58

So we're setting the draft node as the entry point, retrieving, and

play18:01

then we're going to always progress in a directed edge from draft to

play18:04

retrieve, and then retrieve to solve.

play18:06

And then, again, we'll create the loop of solve to evaluate, and then from evaluate,

play18:10

either go to the end or back to solve.

play18:13

So we're creating that.

play18:14

Here's the visualization, in case that's easier to see.

play18:17

So again, the rest of this is all the same, we've just added

play18:20

these two steps at the beginning.

play18:22

Let's try it out.

play18:23

We're gonna, since we added a checkpointer here, we're gonna add

play18:26

just ignore this here, but we're gonna say retrieve three of these examples

play18:30

from the corpus and pass those in.

play18:34

This is also going to take some time, so bear with us.

play18:37

All right, looks like our graph finished already.

play18:39

You can see this is a little bit truncated.

play18:40

Let's jump over to Langsmith to see what exactly was done.

play18:43

But you can see that from the state that we've got, you can see

play18:46

that it was a success this time.

play18:49

All right, now that we've jumped over to the Langsmith trace, you can see it's only

play18:52

a few steps this time, so that's great.

play18:53

And following our graph structure, we had the draft node here, which again,

play18:58

you put in the initial question, the system prompted the question, and it

play19:01

outputs the initial answer to the problem.

play19:05

We retrieve some examples from it.

play19:06

So again, see the queries, this candidate program that we talked about, and

play19:09

it retrieves other examples from the corpus, and then we pass that in here.

play19:13

So you can see, Now those examples are formatted here in the system prompt

play19:17

for the solver, and then it has that.

play19:20

All of it we don't have the initial candidate program in here because we've

play19:23

saved it in a different key in the state, but then it tries to generate an answer.

play19:27

This time it's correct, and , test cases are successful.

play19:31

So, it's great.

play19:32

I'll jump back to part three because we see it solves this bronze level

play19:36

question, but how well does it solve some of the more difficult

play19:39

challenging ones in this benchmark?

play19:42

Alright, so Jumping back to the notebook, let's test it on a

play19:45

harder, silver level question.

play19:47

So we've got this from the DES dataset.

play19:49

You can see the question here.

play19:53

It's a river cruise one.

play19:54

Basically you're trying to detect cycles and then simulate

play19:58

different steps of it as well.

play20:00

It gives a couple of sample inputs here, but it's a much more challenging

play20:03

question than the first one here.

play20:05

We'll format that and then run it.

play20:06

And we'll see how it does.

play20:08

I fully expect this not to work.

play20:10

It may.

play20:11

But I expect this to fail.

play20:13

Because it's just a more challenging problem.

play20:16

And these OLMs, while they have been trained in a lot of code, a

play20:19

lot of reasoning types of problems, whenever there's new ones, they

play20:22

just kind of struggle to be composing them in creative ways.

play20:25

So some of these techniques can help, but I think we're running into some of

play20:29

the limits of the reasoning capabilities of agents as they stand today.

play20:35

Alright, so our optimized graph has finished, and we got

play20:39

another graph recursion error.

play20:40

It wasn't able to correctly answer the problem in the allocated number of steps.

play20:46

That's okay.

play20:47

In fact, we expected it.

play20:48

These are extremely challenging problems, and they push LLMs, at least as they

play20:54

are trained and designed today, to the limits of their reasoning capabilities.

play20:58

Thank you.

play20:58

It requires some novel combination of algorithms and data

play21:03

structures in challenging ways.

play21:06

The paper explores then one final inference time optimization, which really

play21:11

gets us from the realm of autonomous agents to the realm of co pilots.

play21:15

So that they can benchmark it, they restricted human involvement to simple

play21:19

guidance and prodding without revealing any parts of the answer or anything.

play21:24

But When you're building an actual application, you often want to

play21:27

optimize really for the end user experience and maximizing the

play21:31

chance of accomplishing the goal.

play21:33

If you are going to be creating application where a user is in the

play21:37

loop, you want to give them a nice ability really to be providing guidance

play21:41

whenever and wherever they want.

play21:43

LangGraph makes this pretty easy.

play21:45

So for the sake of this tutorial, and in part 3, we're just going to add a generic

play21:50

human in the loop interface to our agent.

play21:54

And we're going to insert it right here.

play21:56

So the agent graph is going to be structured exactly

play21:59

the same way as part two.

play22:01

We'll insert the problem, the agent will generate a candidate program,

play22:05

the program will retrieve similar high quality examples from this corpora of

play22:09

semantic memory, or episodic memory, and then the agent tries to solve

play22:14

it, it generates the program here, then runs the test cases on them.

play22:19

Then what we're changing in part three is we're going to interrupt here, and then

play22:23

we'll say, A human is allowed to then look at these keys, state of the graph

play22:28

at this point, perhaps optionally add a message saying to consider some alternate

play22:33

route, consider looking into a specific part of the of the generated program, etc.

play22:39

We can then resume execution at any time, since LangGraph lets you just persist

play22:43

this in a checkpoint, and continue trying.

play22:45

You can continue this loop and continue to intercede and provide feedback as it

play22:50

goes through and hopefully prevent it, prevent the LLM from falling into these

play22:54

local optima, these local pit holes, where it's just looping through and unable

play22:58

to actually accomplish the real task.

play23:00

Once you can collaboratively come to the final answer, the agent can, and

play23:05

the graph can finally finish executing.

play23:08

In theory, this type of design is only restricted by the quality or

play23:13

the capability of the user involved.

play23:15

Since really we can be providing the, the correct answer or any type of feedback

play23:19

and the LLM will be able to synthesize this and incorporate it into it.

play23:23

So let's jump into here and create this human little loop agent for

play23:27

solving computing Olympiad problems.

play23:29

This code block here is exactly the same as in part two.

play23:32

We get our checkpoint right here, and we're just going to do that in memory.

play23:36

We've got our state graph, we're using the exact same state here.

play23:39

We've got the prompt in LLM.

play23:40

We create this draft solver, add it to the node.

play23:43

We set that as the entry point.

play23:45

We create the retrieve node.

play23:47

We create the solver node.

play23:49

And the evaluation node to run the test cases.

play23:51

Then we start connecting them.

play23:52

So we add an edge for draft to retrieve, retrieve to solve.

play23:55

Solve to evaluate and then we add these conditional edges to

play23:58

define the conditional looping.

play24:00

So we say, once you run the test cases we'll either go back to the

play24:04

solver or we'll finish if we succeed.

play24:06

And we'll create the check pointer as well.

play24:10

The one different thing in part 3 compared to part 2 is we're

play24:13

going to add this interrupt after the evaluation node command.

play24:16

So basically before it goes to a human step, we're going

play24:20

to tell the graph, Hey, stop.

play24:22

and allow the human or any other process to be modifying the state.

play24:28

Let's visualize this here.

play24:29

As you can see, the graph looks exactly the same as in part

play24:31

two, and we'll start running it.

play24:34

So again, this is just going to continue executing until it reaches that interrupt.

play24:41

All right, so our graph has stopped executing.

play24:45

You can see the current state by looking at this snapshot using the config, and

play24:50

you can see again the problem there.

play24:52

Note that it doesn't say a graph recursion error, but it still has gotten the

play24:57

incorrect submission as we expected.

play24:59

Since we added the interrupt, it actually will stop this loop and then

play25:03

we can resume it So we can look again.

play25:07

Here's the silver problem that we were looking at before.

play25:09

It was unable to solve it so far.

play25:11

We're going to look at its current candidate solution.

play25:14

So this is again printed out from the agent right now.

play25:17

Looks okay.

play25:18

Maybe a little simple.

play25:19

Definitely doesn't handle all of the edge cases.

play25:22

And then we can look at this test here, because that's the last tool message here.

play25:25

Incorrect submission.

play25:26

It actually got 8 out of 10, so pretty close.

play25:29

Let's say let's give it some recommendations here.

play25:32

Okay.

play25:33

as a human message.

play25:35

And then we'll check to make sure that that actually was reflecting the state.

play25:38

So you can see, we now have this human message we've done by

play25:41

calling graph dot update state and providing that config there.

play25:45

So it tells which snapshot to be updating.

play25:48

And then you have the human message here and we can resume the way we resume here

play25:55

is we pass in a null values and none.

play25:58

And then since we're using this.

play25:59

Same config that we're using it knows to load the current state from the check

play26:04

pointer that we've compiled into the graph Again, we use that in memory SQLite check

play26:09

pointer here for the sake of the tutorial But there's a lot of implementations you

play26:12

can use to connect with your own storage architecture This is going to take a

play26:17

little bit of time, so again, we'll resume whenever it's done executing.

play26:21

Alright, so in our case, it actually was enough to get it to succeed.

play26:25

I had this other code here to try to prompt it into the right position, because

play26:30

occasionally it doesn't succeed even after that first bit of human feedback.

play26:33

But in our case, it does.

play26:35

As you can see, you can list through all of the states here, so I was just

play26:37

getting the most recent checkpoint from this graph's execution,

play26:41

and we see that it's a success.

play26:42

You can actually check out the Langsmith trace again here,

play26:46

I'll jump back to see it run.

play26:51

And we see that again, we passed in null.

play26:53

If you remember from the loop in the notebook, it goes, it loads.

play26:57

We go back to the solver because that's the next node slated

play27:00

to be executed in that graph.

play27:02

It already run the draft and recall and retriever and all those steps.

play27:05

You can see the full list of messages to see that yes, indeed, it has been fetching

play27:09

these things from memory and includes this recommendation that we've added.

play27:14

We've put this all in a notebook, but again, you can put in any sort of UI

play27:18

above the land graph implementation and allow the user to be interacting

play27:21

with your copilot in arbitrary ways.

play27:24

So the AI is able to then incorporate this breakdown into an updated response

play27:28

that then passes all of the test cases.

play27:31

So I'd say it's a success.

play27:32

Thanks.

play27:44

Alright, so that brings us to the end of this tutorial.

play27:47

As we saw, LLMs, as they're trained today, alone aren't super great at

play27:52

solving this challenging type of reasoning problem problems posed

play27:55

by olympiad programming questions.

play27:58

However, through some prompting techniques and the better systems

play28:02

design, you can improve the average performance of your programming.

play28:05

Greatly from the low point of under 9 percent to above 20%.

play28:10

And when you're building real applications to solve challenging problems, you can

play28:14

create these sort of human in the loop interfaces easily with Langraph to make

play28:18

it so that you can reach a better overall outcome compared to just the agent alone

play28:23

and compared to just the human alone.

play28:25

I think these general techniques are fairly, fairly expansive and

play28:29

can be applied in a lot of domains.

play28:31

So you'll recall, first we started with a zero shot agent with reflection.

play28:35

It basically prompted the agent to be looking at the test case results, looking

play28:38

at the current solution, and then try to incorporate those results and feedback

play28:42

into an updated candidate to eventually get out and solve the right problem.

play28:46

We saw how even on a bronze level problem, this doesn't always

play28:49

work, and so we then added in an additional optimization for retrieval.

play28:55

This, as the authors posed as a sort of , episodic memory, allowed the

play29:00

model to then fetch these really high quality examples from the corpus,

play29:04

and use that to try to trigger a little bit better of an output that

play29:08

follows a similar design and approach.

play29:12

This can induce better reasoning in this case because of

play29:14

these few shot instructions.

play29:17

We saw that that was able to solve the bronze level question, but it

play29:20

failed on the silver level question.

play29:22

So then we added in this human in a loop interface with the interrupt after that

play29:26

allowed us to then go in and modify the checkpointed state of the graph to then

play29:30

guide the agent to come to a proper solution to the problem that's hard.

play29:36

As you can see from all of these steps, autonomous agents are really cool.

play29:40

They're not quite there yet in all of these challenging problems, but

play29:44

through better engineering, through using state machines, and all of this,

play29:46

you can come to some better designed systems that are actually capable of

play29:50

doing some pretty impressive work.

play29:52

I'm excited by this new data set because it is a lot more challenging

play29:55

and shows the cracks in the abilities of our current types of language models

play29:59

as they are today, while also showing how these hybrid systems, these neuro

play30:03

symbolic approaches can be really powerful in improving performance.

play30:06

I'm excited to see other people come out with better systems that can surpass those

play30:11

presented by the authors and hopefully get to the point where they can solve all

play30:14

these types of questions, even without resorting on much, much larger models.

play30:21

That's all that we have for our tutorial today.

play30:23

If you have any questions or comments, feel free to leave

play30:25

them in the comments below.

play30:27

Check out the links in the description as well to see the

play30:29

code and run it for yourself.

play30:30

And let us know what other types of tutorials you'd like to see that would

play30:33

be helpful in implementing your own agents and chatbots and assistants.

play30:38

Until next time, this is Will, and hope you have a great day.

play30:41

Bye!

Rate This

5.0 / 5 (0 votes)

Related Tags
プログラミングオリンピックAIエージェント言語モデルゼロショットハイブリッド手法チュートリアル問題解決リファレクションリトリーバル人間インザループシステム設計コーディング
Do you need a summary in English?