Build Computing Olympiad Agents with LangGraph
Summary
TLDRこのビデオスクリプトでは、言語モデルがオリンピックプログラミング問題を解く能力を向上させるための手法について解説しています。まず、言語モデルGPT-4がUSAコンピューティングオリンピックの問題に挑戦し、低い成功率を示した後、自己反映と呼ばれるプロンプト技術を用いて改善を試みます。次に、エピソード記憶と呼ばれるフェイクショットリトリーブアル手法を導入し、過去の問題と解答のペアから高質な例を取得して、さらに性能を向上させます。最後に、人間がループに参加できるように、人間インザループインターフェースを追加し、問題解決に向けてアドバイスを提供します。これらの手法により、言語モデルは独自で高い難易度の問題を解決できるようになる可能性がありますが、まだ完璧ではないことが示唆されています。
Takeaways
- 🤖 最初のGPT-4の挑戦では、プログラミング問題の正解率は8.7%でした。
- 📈 推論最適化技術を用いて、正解率を平均20.2%に向上させることができました。
- 📚 データセットは、USA Computing Olympiadの問題から成る307の問題で構成されています。
- 🧠 LLM(大規模言語モデル)は、限界に挑戦され、理論的思考の限界が明らかになります。
- 🔍 自己反映とエピソード記憶の復旧が、モデルの性能向上に大きく貢献しました。
- 📝 チュートリアルでは、3つのステップに分けて、より高度な問題を解決できるエージェントを作成しました。
- 🔄 ループ構造を使って、問題を解決するまでエージェントを繰り返し実行します。
- 🧑🤝🧑 最終的に、人間とエージェントの共同作業(コパイロット型設定)が、最良の結果を生むことが示唆されました。
- 🛠️ LangGraphを使用して、タスクドメイン内でエージェントを制御し、全体的な結果を向上させることができます。
- 🔬 このチュートリアルでは、実際にアプリケーションを構築する際のアプローチとして、人間インザループが重要であることが強調されました。
- 📈 自己反映、エピソード記憶、そして人間インザループの導入を通じて、解くことができる問題の難易度が徐々に向上しました。
- 🌟 今後のLLMの進化と、より高度な問題解決能力を持つシステムの開発が期待されます。
Q & A
Willはどのような組織から来ていますか?
-WillはLangChainという組織から来ています。
Princeton大学の研究チームが出した論文のタイトルは何ですか?
-Princeton大学の研究チームが出した論文のタイトルは「Can Language Models Solve Olympiad Programming?」です。
GPT-4がUSA Computing Olympiadの問題に挑戦したときの合格率は何パーセントですか?
-GPT-4がUSA Computing Olympiadの問題に挑戦したときの合格率は8.7%です。
Willが紹介するチュートリアルの3つのパートで何を構築しますか?
-Willが紹介するチュートリアルの3つのパートでは、競技プログラミング問題を解決するためのエージェントを構築します。
最初のパートで実装されるreflection agentとは何ですか?
-reflection agentは、ゼロショットで自己反映を行うエージェントであり、問題を解決するまでループして実行されます。
チュートリアルの第二部分で実装されるepisodic memoryとは何ですか?
-episodic memoryは、過去の問題解決エピソードから高質な例を取得し、それをプロンプトに含めることでモデルの論理性能を向上させる手法です。
チュートリアルの第三部分で追加されるhuman interruptとは何ですか?
-human interruptは、ユーザーがエージェントの答えに介入し、正しい答えに導く手助けをする機能です。
LangGraphを使用することでどのような状態を持つことができます?
-LangGraphを使用することで、状態機械を持ち、タスクドメイン内でエージェントを制御およびガイドし、全体的な結果を向上させることができます。
チュートリアルで使用されるLangGraphモデルとは何ですか?
-チュートリアルで使用されるLangGraphモデルは、Anthropic社が提供するものです。
Willが提案するチュートリアルの最終的な目標は何ですか?
-Willが提案するチュートリアルの最終的な目標は、競技プログラミング問題を解決する能力を持つエージェントを作成し、自主エージェントだけでなく、人間を含む協同型のシステムを通じて、より優れた結果を得ることです。
チュートリアルの最後にWillが述べている「それでは、次の機会に」とはどういう意味ですか?
-「それでは、次の機会に」とは、Willが今後も同様のチュートリアルを行う予定であり、また視聴者が今後のコンテンツに期待してほしいという意味です。
Outlines
😀 言語モデルでコンピューティングオリンピック問題を解く
Will Fu-Hinthornは、言語モデルがコンピューティングオリンピックのプログラミング問題を解く能力について語ります。プリンストン大学の研究チームが発表した論文を紹介し、GPT-4のパフォーマンスと、それが改善された方法について説明します。また、問題の難易度や言語モデルの限界についても触れています。
🤖 自己反映とループを使用したゼロショットエージェント
LangGraphを使って、自己反映を行うゼロショットエージェントを構築します。このエージェントは、問題を理解し、応答を生成し、テストケースを実行する簡単なループ構造を持っています。自己反映によって、生成されたコードを改善し、問題を解決するように試みます。
🧠 エピソード記憶を使ったフェイシャルショットリトリーブアル
研究で提案されたエピソード記憶を利用したフェイシャルショットリトリーブアルを実装します。これは、過去の問題と解に対するペアから出力をリコールすることで、後の問題解決に役立てます。BM25リトリーバーを使用し、質の高い例を取得して、プロンプトに組み込みます。これにより、モデルのロジックパフォーマンスが向上することが期待されます。
🔄 グラフの最適化と人間介入
より高度な問題に挑戦するために、グラフに人間介入を加えます。テストケースに失敗した後、人間がエージェントにフィードバックを提供し、解決策を導くことができます。LangGraphの機能を使って、チェックポイントから再開し、繰り返しのループを避けながら最適解に辿り着くことができます。
🚀 人間とエージェントの協同による問題解決
チュートリアルの最終パートでは、人間がループ内に介入することで、より高度な問題を解決できるデモンストレーションを行います。人間が提供するフィードバックをエージェントが利用し、問題を解決するプロセスを辿ります。これは、言語モデルが現在持つ能力の限界に直面していることを示すとともに、人間と協働することで到達できる可能性を広げます。
📚 チュートリアルの総括と今後の展望
Willは、言語モデルがオリンピックプログラミング問題を解くための能力について語り合い、チュートリアルを締めくくります。言語モデル単独では限界があるものの、プロンプト技術やシステム設計の改善でパフォーマンスが向上し、人間と協働することでさらに成果を上げることができると結論づけます。今後、より優れたシステムが登場し、さらなる性能向上が期待されると述べています。
Mindmap
Keywords
💡言語モデル
💡コンピューティングオリンピック
💡ゼロショット学習
💡自己反映
💡エピソード記憶
💡ハイパーパラメータ
💡グラフ循環エラー
💡人間の介入
💡状態機械
💡ハイブリッドシステム
Highlights
Will from LangChain introduces a project to build computing Olympiad agents, referencing a paper by a team from Princeton.
The paper presents a challenge benchmark of 307 competitive programming problems from the USA Computing Olympiad.
GPT 4's initial pass rate on the benchmark is only about 8.7%, which is low compared to other benchmarks like HumanEval and MMLU.
Inference optimizations are discussed, which improve the performance from 8.7% to an average of 20.2%.
The problems in the benchmark are complex word problems requiring advanced data structures and algorithms.
The tutorial aims to push the limits of Large Language Models (LLMs) to see where they break down in reasoning.
Different techniques such as self-reflection and retrieval from semantic or episodic knowledge are explored to improve model performance.
The best-performing retrieval type, episodic knowledge, is chosen for implementation due to its complementarity with self-reflection.
A three-part tutorial structure is outlined, starting with a zero-shot reflection agent, then adding episodic memory, and finally introducing a human interrupt.
The reflection agent corresponds to a 12.38% pass rate, which is better than the base zero-shot agent.
The addition of episodic memory as a form of retrieval improves logical performance in the model, achieving about 22.2% on the benchmark.
The human interrupt allows the user to guide the agent towards the correct answer, which is not benchmarked but is pragmatic for application designs.
The tutorial demonstrates the creation of a state machine using LangGraph to control and guide the agent within a task domain.
LangSmith is used for tracing and debugging each step of the agent's operation to catch mistakes and understand the process.
The state graph in LangGraph is defined by nodes as units of work and edges as control flow, with the state holding the agent's messages list and test cases.
The solve node composes a prompt with an LLM to generate a candidate solution, while the evaluate node tests the solution against test cases.
The tutorial shows how to implement a human-in-the-loop interface to allow for user guidance and potentially achieve a better outcome.
The final success of the agent in solving a silver level question demonstrates the potential of combining autonomous agents with human input.
The tutorial concludes by emphasizing the current limitations of LLMs and the potential of hybrid systems for improving performance on challenging problems.
Transcripts
Will Fu-Hinthorn: Hi all, this is Will from LangChain.
Let's build computing Olympiad agents together.
So about a week ago, this team out of Princeton came out with a
paper called Can Language Models Solve Olympiad Programming?
It was done by the folks such as Chen Xie and Shunyu Yao, who you might
recognize from the React paper, Tree of Thoughts paper, and Reflexion papers.
This paper really has two interesting components to it.
on one hand, it's a data set paper.
They come out with this challenge benchmark of 307 competitive programming
problems from the USA Computing Olympiad.
And they showed that GPT 4 only has about an 8.
7 percent pass rate when trying to do this with a simple zero
shot React agent framework.
This is in contrast to some of the existing benchmarks, like HumanEval,
MMLU, that are mostly saturated by this crop of language models.
They then show some inference optimizations, basically prompting
techniques or systems engineering types of approaches to improve the
performance on average from that 8.
7 percent up to 20.
2%.
And we'll go more in detail on each of those techniques as we
go throughout this tutorial.
Let's get a sense for the difficulty of the types of problems in this benchmark.
Alright, so here's an example question.
You can see they're mostly word problems that require you to identify the
underlying mathematical problem you need to solve, use advanced data structures and
algorithms, and compose them in a creative way to come up with the correct solution.
It also has to be done and implemented in a way that solves
it within a given time limit.
As you can see from this diagram, there's a lot of different usage of
sets and , other types of things there.
The questions are challenging.
There's a reason why it's called an Olympiad.
What I find interesting about this benchmark is that it really
pushes LLMs to the limit, so you can see where it breaks down.
I think when we're using them in normal life, you often start
to anthropomorphize them and see them think that they're reasoning.
Whenever you get to this extent, you can see when they start doing things that
look like, interesting, close to correct, but don't have this logical property.
So as I mentioned before, when they first run GPT 4 on each of these
problems, it gets a really low pass rate.
They later show a number of inference techniques that
they could do to improve it.
Some of these include self reflection and some of these involve
retrieval, either from semantic knowledge or episodic knowledge.
They do a lot of experiments to show what types of retrieval actually
improve the performance of the model.
We're going to implement the best performing retrieval type, this episodic
knowledge type, since it seems to be very complementary to self reflection and
works well within our tutorial structure.
Let's check out our LandGraph docs to see how we're going to
implement this in the tutorial.
Alright, so for the remainder of the video, we're going to be creating
an agent to solve these types of competitive programming problems.
We'll break it down into three parts, following the paper's structure and making
agents that are increasingly capable of solving these advanced questions.
In the first step, we'll implement the reflection agent, the zero shot
agent that does self reflection.
This corresponds to here, this block in our graph that we're going to be creating.
It's going to have two simple nodes and just run in a loop until it can solve
the answer correctly, or ran out of time.
This is roughly analogous to this reflexion agent they
created, that gets about a 12.
38%.
So again, better than the base zero shot agent, not as good
as what they can get overall.
In part two of the tutorial, we're going to implement retrieval as a form of
episodic memory, what the paper calls it.
This is part two here, where we'll be then retrieving some high quality
examples to include within the prompt and hopefully induce better
logical performance in the model.
This is analogous to this section here that gets about the 22.
2 percent on the benchmark overall.
In part three, we're going to add a human interrupt.
So we're going to let us, the user, actually weigh in on the
answer and help guide the agent to come to the correct answer.
You may consider this cheating, and in fact we aren't actually going to
benchmark this on the whole dataset, and the authors don't as well.
But, in a lot of application designs, we're What you actually
want is the best outcome overall.
And so co pilot type setups are a great pragmatic approach to getting
a better overall outcome where either perhaps the human alone or the agent
alone couldn't actually get there.
A big theme throughout this is that autonomous agents really aren't quite
there yet especially when you're just prompting in a simple zero shot manner.
But you can create state machines using frameworks like LangGraph to
really control and guide it within your task domain and hopefully
get better outcomes overall.
Let's download this notebook and then run through it together.
All right, now we're ready to start running through the tutorial.
We've opened this up in Jupyter.
And we're going to install some prerequisites here.
It's basically LangGraph.
We're going to add some tracing here with LangSmith.
The agent is going to be powered by Anthropic's LangGraph model, in our case.
And we've got a couple of other things to pull things from the hub.
We'll set our environment.
In our case, it'll be Anthropic's API key to connect to their API.
And then we'll also configure tracing.
There's a lot of steps that go on here, and each of the programs can
be quite long, so visualizing it in a notebook can be pretty cluttered.
I'd recommend using Langsmith just so you can debug each step
and see exactly what's happening.
It's easier to catch mistakes, easier to see what's going on.
We'll then fetch the data we've stored in Google Cloud Bucket
and we'll load it into memory.
And then finally the utility here.
is to run test cases.
I'll note that this is executing code locally, so proceed with caution.
There's an inherent risk if the LLM does generate malicious code or something
that will have an OOM or something.
An example run of here of the test case runner in this print hello world.
If it passes, it would turn passed.
If not, it would turn wrong answer along with the expected output.
All right, so now I can finally start defining a graph.
In part one again, we'll be doing this simple zero shot agent with reflection.
Note in the paper they used an explicit Reflexion prompt.
We've adapted it a bit and then prompted our agent here to be reflecting on its
output here when it's doing tool calling.
This one's going to be relatively rudimentary.
It corresponds to that agent that gets about a 12 percent
pass rate on the benchmark.
And so for our first question, we kind of expect it not to
pass, but I think that's okay.
We've got some other tricks up our sleeve.
Now the next part may be review if you've already built with LangGraph,
but I want to review it anyways, because I think it is important.
The main primitive in LangGraph is a state graph.
It's how you define a state machine.
Basically the nodes define the units of work.
And the edges define the control flow.
Once a node completes, the edge defines which node to pass to next
in order to continue operation.
In its most simple case, you essentially have two nodes in our cases and it loops
back and forth, back through and then will output once it has reached some end state.
The final piece of this that's really important is the state.
Of course, this defines the interface for all the nodes, so each node
receives the state as an input and then returns an update to the state.
That could be an entire state itself, or just some subset
that then the graph is able to incorporate into the previous state.
In our case, the main aspect of this state is going to be this messages
list that we're annotating as being append only using this function,
using Python's annotated syntax.
This basically keeps the agent scratch pad as it generates a
candidate and then receives a tool response, and then continues to
iterate and try to improve the game.
The rest of the state here holds the test cases essentially, and the runtime limit
to configure how the evaluate node runs.
The agent itself is gonna ignore this, but the test runner will incorporate it.
Now that we've defined the state, we'll update our data set to be in the right.
format essentially to be able to pass it in and we can
start defining the good parts.
Remember, this has two nodes.
We've got the solve node and the evaluate node.
So first the solver here is pretty simple.
We're basically taking a prompt and we're going to compose it with an LLM.
So this format into a prompt and then pass it to the LLM.
This bind tools operation is just configuring a schema for the
LLM, so it knows the structure that it should respond with.
In our case, we're going to use this write Python tool.
Let's write Python schema.
We're basically going to tell it for some reasoning and pseudocode to induce some
chain of thought reasoning beforehand.
And then we'll finally have it write all of the Python 3 code.
into the code string here.
This just makes it a lot easier to parse on the tail end, so you don't
have to be parsing out raw strings.
I'll run this here and then we'll define a solver.
So we pulled this, this solver prompt from the hub.
It's pretty simple.
I didn't do too much prompt engineering here.
Note that we do have this variable examples that we'll use later in part two.
Right now we're just going to fill this in through partialing with an empty string.
That'll be the placeholder that we then populate with.
additional information that we retrieve, but more on that later.
Here's an example run.
We're just going to ask it, how do I get a perfectly random
sample from an infinite stream?
All right, we've gotten the response.
So you can see it generates this thinking tag, and then eventually
outputs things with the reasoning.
You might see the pseudocode, and then eventually the code, and it is doing
reservoir sampling as we'd expect.
It at least has studied some code.
That's a good thing.
One thing that I like about Claude as opposed to GPT 4 is that it was trained
to output this thinking or prelude text, you can think of it, before
actually doing the tool invocation.
I think there's often this question when using GPT 4 with tool calling of
how to incorporate chain of thought.
And we've found that it sometimes suffers with some of these
more complex tasks as a result.
I think it's quite nice that they've trained the model to output this before
doing the tool invocation so you don't have to be making sacrifices there.
So the second node that we'll define in this loop, this agent
loop, is the evaluate node.
And there's a good amount of error handling and stuff here, but really the
key bit here is we're going to iterate through all the test cases in our state.
Again, recall that each node is going to receive an instance of
the state and return an output, so in this case an updated message.
But it iterates through the test cases, runs them, and then if it succeeds,
it'll update the status in our state.
Otherwise, it's going to format each of these things individually,
and then add the messages.
You are there, you can create the graph, and we'll visualize it.
So, again, this corresponds to that graph we had above, it's a little more simple.
We put the initial problem in here, tries to generate a solution, it tests it
out, and then we enter this control edge.
If it's successful we end, otherwise we go back to the solver and we keep on looping.
Here the control edge, you can see there, it just also receives the state and then
checks that state to see if it succeeded.
It then returns, this is a string, and string with double underscores,
this just tells the graph that it doesn't need to continue looping.
Otherwise, it says go back to the solver node.
That's how we would define it here with the conditional edges.
Let's see the first question.
Question about Farmer John.
Looking at, you know, he's got some productivity of the day.
Bessie, you've got the cow.
Seems pretty complicated.
It's got some number of sample inputs, sample outputs, as a part of the question.
Let's try running our agent here.
Okay.
So again, this is the simplest version of the agent that we're
going to be building today.
I fully expect it not to work, honestly.
And the way lingraph indicates this typically is through
a graph recursion error.
Basically by default, the graph has a limited number of steps.
You can configure this.
We show this in the docs elsewhere.
I won't go into detail here, but if it surpasses the maximum number of configured
steps, it'll raise a recursion error.
Let's wait for this to happen.
Continue to populate.
All right.
It looks like, as we said, it's going to hit the rehearsal limit.
Wasn't able to actually saw it.
You can see each of these steps below, and we'll actually jump over to
Langsmith to check out the trajectory.
One second.
All right, so we've checked out this trajectory here.
You can see it going through the loop as we defined it.
You have the prompt and LLM and then you have the evaluate known and
it goes in this loop and continues until it eventually has this error.
I always like to jump into one of the later LLM calls because it does collect
this full history of messages and you can see exactly what it's doing.
So initially it says solve this problem.
I'll do this.
The key insight here and tries to write out the pseudocode and
thinking and then generates it all.
There and the second one is the current solution times out in a
larger test case is likely because it iterates through all possible lanes.
So it's trying to actually self correct here, but it's unable to it
doesn't have a lot of good examples of solving this type of problem,
I guess, and -it's in its memory.
And so, as you can see, it goes through quite a number of things.
So it wasn't able to get it correct.
That's all right.
The paper presents at least one more automatic improvement that we
can be incorporating into it and then also presents the idea of more
of a co pilot type of scenario, which we'll get to in part three.
But in part two, let's jump into the memory and retrieval optimization.
All right, so back in the notebook, in part two, we're going to be
implementing this few shot retrieval optimization that the paper proposes.
And the authors call this an episodic memory because it's retrieving
these outputs from the other question answer pairs in the corpus.
So if you pretend that the algorithm, or that the agent, has already
solved all these other problems, it could then recall this and use
this for solving later problems.
It's kind of an interesting framing, and kind of in contrast to the rule of
thumb, where people tend to talk about RAG and retrieval as a way to improve
knowledge and update knowledge, but not as a mechanism for actually improving
the reasoning capabilities of the agent.
Though since these are extremely well selected, well crafted in domain examples,
this does then align more with few shot instruction and optimizations like that.
That's more of what it is.
The paper also explores a semantic memory where it's retrieving over
textbooks and things, and that does show a brief boost, but doesn't seem to hold
whenever they incorporate that later with the reflection and other things.
So, it seems to be a technique that doesn't really scale quite to the
same extent as this high quality instruction type of data set.
So, we'll skip that for here.
Following the paper, we'll use as the retriever this BM25 retriever.
It's essentially a more old fashioned, non vector based, TFIDF based retrieval
mechanism here that's high quality.
And then to accommodate these steps, we're going to add two more keys
to our state compared to part one.
We're going to add this candidate message.
that we generate first, and it's going to be used in the retrieval step.
And then we'll have the examples formatted in strings.
If you remember the prompt at first, it had that examples template.
We'll finally be populating that here.
And then again, recall that this episodic memory happens before our agent loop.
This part is going to be untouched here.
But here we're going to be having the retrieval step.
And then we'll still ignore this and save this for part three.
So once we define our new state, we can define the solver.
This is mostly repeat from before, except we're going to have a little if statement
to generate populate the candidate step, if it's still that first stage there.
So we're going to make this draft solver and solver, they're pretty much
identical, except the draft solver of course hasn't already had questions here.
Well then, to just make sure that we avoid cheating by putting the
actual answer to our question in the retriever, we're going to separate out
this as train and test , corpora and then we'll create the retriever here.
Finally, it's time to define the retrieve node.
So as before, it receives the state here as all the nodes do, and
then returns this updated state.
So it's going to update the examples key in particular.
And then within this, it calls the retriever here.
So, Retriever invoke picks out the top k there, and then formats
this in a string that we're going to be updating in the prompt.
You'll notice that we add this runnable config here in this graph
that we're defining, we're going to let you configure whenever you
invoke the actual agent the number of retrieved examples that you'll have.
One way to do that is through these configurable params in the config, which
is always the second positional argument.
One more thing to note about this retriever setup that I think is
quite interesting, is that we're retrieving, and we're treating the
candidate program as the query, rather than the initial question.
This is kind of similar to techniques you might have come across, such as HyDE, or
some of these other, like raft, or other types of indexing strategies for RAG.
The observation here is that The distribution of queries is different
than the distribution of documents that you're trying to retrieve from.
And so you either want to be creating hypothetical queries from the documents
that will better align with the type of queries that we're going to be
putting into the system, or we want to map from the queries to what you'd
expect the document to look like and then maybe retrieve from that.
And then there's some other variants there, but basically you're saying
the type of text and the type of words that we're going to put in these two
things is not going to be the same.
And so you get better results if you can try to translate them.
Finally, it's time to build the graph.
So again, we've had most of this is the same here.
So you see solve, evaluate, and like all this stuff is untouched.
The things that we've added here really are this draft node at the
beginning, where we're putting in that solver, and the retrieve node here.
So we're setting the draft node as the entry point, retrieving, and
then we're going to always progress in a directed edge from draft to
retrieve, and then retrieve to solve.
And then, again, we'll create the loop of solve to evaluate, and then from evaluate,
either go to the end or back to solve.
So we're creating that.
Here's the visualization, in case that's easier to see.
So again, the rest of this is all the same, we've just added
these two steps at the beginning.
Let's try it out.
We're gonna, since we added a checkpointer here, we're gonna add
just ignore this here, but we're gonna say retrieve three of these examples
from the corpus and pass those in.
This is also going to take some time, so bear with us.
All right, looks like our graph finished already.
You can see this is a little bit truncated.
Let's jump over to Langsmith to see what exactly was done.
But you can see that from the state that we've got, you can see
that it was a success this time.
All right, now that we've jumped over to the Langsmith trace, you can see it's only
a few steps this time, so that's great.
And following our graph structure, we had the draft node here, which again,
you put in the initial question, the system prompted the question, and it
outputs the initial answer to the problem.
We retrieve some examples from it.
So again, see the queries, this candidate program that we talked about, and
it retrieves other examples from the corpus, and then we pass that in here.
So you can see, Now those examples are formatted here in the system prompt
for the solver, and then it has that.
All of it we don't have the initial candidate program in here because we've
saved it in a different key in the state, but then it tries to generate an answer.
This time it's correct, and , test cases are successful.
So, it's great.
I'll jump back to part three because we see it solves this bronze level
question, but how well does it solve some of the more difficult
challenging ones in this benchmark?
Alright, so Jumping back to the notebook, let's test it on a
harder, silver level question.
So we've got this from the DES dataset.
You can see the question here.
It's a river cruise one.
Basically you're trying to detect cycles and then simulate
different steps of it as well.
It gives a couple of sample inputs here, but it's a much more challenging
question than the first one here.
We'll format that and then run it.
And we'll see how it does.
I fully expect this not to work.
It may.
But I expect this to fail.
Because it's just a more challenging problem.
And these OLMs, while they have been trained in a lot of code, a
lot of reasoning types of problems, whenever there's new ones, they
just kind of struggle to be composing them in creative ways.
So some of these techniques can help, but I think we're running into some of
the limits of the reasoning capabilities of agents as they stand today.
Alright, so our optimized graph has finished, and we got
another graph recursion error.
It wasn't able to correctly answer the problem in the allocated number of steps.
That's okay.
In fact, we expected it.
These are extremely challenging problems, and they push LLMs, at least as they
are trained and designed today, to the limits of their reasoning capabilities.
Thank you.
It requires some novel combination of algorithms and data
structures in challenging ways.
The paper explores then one final inference time optimization, which really
gets us from the realm of autonomous agents to the realm of co pilots.
So that they can benchmark it, they restricted human involvement to simple
guidance and prodding without revealing any parts of the answer or anything.
But When you're building an actual application, you often want to
optimize really for the end user experience and maximizing the
chance of accomplishing the goal.
If you are going to be creating application where a user is in the
loop, you want to give them a nice ability really to be providing guidance
whenever and wherever they want.
LangGraph makes this pretty easy.
So for the sake of this tutorial, and in part 3, we're just going to add a generic
human in the loop interface to our agent.
And we're going to insert it right here.
So the agent graph is going to be structured exactly
the same way as part two.
We'll insert the problem, the agent will generate a candidate program,
the program will retrieve similar high quality examples from this corpora of
semantic memory, or episodic memory, and then the agent tries to solve
it, it generates the program here, then runs the test cases on them.
Then what we're changing in part three is we're going to interrupt here, and then
we'll say, A human is allowed to then look at these keys, state of the graph
at this point, perhaps optionally add a message saying to consider some alternate
route, consider looking into a specific part of the of the generated program, etc.
We can then resume execution at any time, since LangGraph lets you just persist
this in a checkpoint, and continue trying.
You can continue this loop and continue to intercede and provide feedback as it
goes through and hopefully prevent it, prevent the LLM from falling into these
local optima, these local pit holes, where it's just looping through and unable
to actually accomplish the real task.
Once you can collaboratively come to the final answer, the agent can, and
the graph can finally finish executing.
In theory, this type of design is only restricted by the quality or
the capability of the user involved.
Since really we can be providing the, the correct answer or any type of feedback
and the LLM will be able to synthesize this and incorporate it into it.
So let's jump into here and create this human little loop agent for
solving computing Olympiad problems.
This code block here is exactly the same as in part two.
We get our checkpoint right here, and we're just going to do that in memory.
We've got our state graph, we're using the exact same state here.
We've got the prompt in LLM.
We create this draft solver, add it to the node.
We set that as the entry point.
We create the retrieve node.
We create the solver node.
And the evaluation node to run the test cases.
Then we start connecting them.
So we add an edge for draft to retrieve, retrieve to solve.
Solve to evaluate and then we add these conditional edges to
define the conditional looping.
So we say, once you run the test cases we'll either go back to the
solver or we'll finish if we succeed.
And we'll create the check pointer as well.
The one different thing in part 3 compared to part 2 is we're
going to add this interrupt after the evaluation node command.
So basically before it goes to a human step, we're going
to tell the graph, Hey, stop.
and allow the human or any other process to be modifying the state.
Let's visualize this here.
As you can see, the graph looks exactly the same as in part
two, and we'll start running it.
So again, this is just going to continue executing until it reaches that interrupt.
All right, so our graph has stopped executing.
You can see the current state by looking at this snapshot using the config, and
you can see again the problem there.
Note that it doesn't say a graph recursion error, but it still has gotten the
incorrect submission as we expected.
Since we added the interrupt, it actually will stop this loop and then
we can resume it So we can look again.
Here's the silver problem that we were looking at before.
It was unable to solve it so far.
We're going to look at its current candidate solution.
So this is again printed out from the agent right now.
Looks okay.
Maybe a little simple.
Definitely doesn't handle all of the edge cases.
And then we can look at this test here, because that's the last tool message here.
Incorrect submission.
It actually got 8 out of 10, so pretty close.
Let's say let's give it some recommendations here.
Okay.
as a human message.
And then we'll check to make sure that that actually was reflecting the state.
So you can see, we now have this human message we've done by
calling graph dot update state and providing that config there.
So it tells which snapshot to be updating.
And then you have the human message here and we can resume the way we resume here
is we pass in a null values and none.
And then since we're using this.
Same config that we're using it knows to load the current state from the check
pointer that we've compiled into the graph Again, we use that in memory SQLite check
pointer here for the sake of the tutorial But there's a lot of implementations you
can use to connect with your own storage architecture This is going to take a
little bit of time, so again, we'll resume whenever it's done executing.
Alright, so in our case, it actually was enough to get it to succeed.
I had this other code here to try to prompt it into the right position, because
occasionally it doesn't succeed even after that first bit of human feedback.
But in our case, it does.
As you can see, you can list through all of the states here, so I was just
getting the most recent checkpoint from this graph's execution,
and we see that it's a success.
You can actually check out the Langsmith trace again here,
I'll jump back to see it run.
And we see that again, we passed in null.
If you remember from the loop in the notebook, it goes, it loads.
We go back to the solver because that's the next node slated
to be executed in that graph.
It already run the draft and recall and retriever and all those steps.
You can see the full list of messages to see that yes, indeed, it has been fetching
these things from memory and includes this recommendation that we've added.
We've put this all in a notebook, but again, you can put in any sort of UI
above the land graph implementation and allow the user to be interacting
with your copilot in arbitrary ways.
So the AI is able to then incorporate this breakdown into an updated response
that then passes all of the test cases.
So I'd say it's a success.
Thanks.
Alright, so that brings us to the end of this tutorial.
As we saw, LLMs, as they're trained today, alone aren't super great at
solving this challenging type of reasoning problem problems posed
by olympiad programming questions.
However, through some prompting techniques and the better systems
design, you can improve the average performance of your programming.
Greatly from the low point of under 9 percent to above 20%.
And when you're building real applications to solve challenging problems, you can
create these sort of human in the loop interfaces easily with Langraph to make
it so that you can reach a better overall outcome compared to just the agent alone
and compared to just the human alone.
I think these general techniques are fairly, fairly expansive and
can be applied in a lot of domains.
So you'll recall, first we started with a zero shot agent with reflection.
It basically prompted the agent to be looking at the test case results, looking
at the current solution, and then try to incorporate those results and feedback
into an updated candidate to eventually get out and solve the right problem.
We saw how even on a bronze level problem, this doesn't always
work, and so we then added in an additional optimization for retrieval.
This, as the authors posed as a sort of , episodic memory, allowed the
model to then fetch these really high quality examples from the corpus,
and use that to try to trigger a little bit better of an output that
follows a similar design and approach.
This can induce better reasoning in this case because of
these few shot instructions.
We saw that that was able to solve the bronze level question, but it
failed on the silver level question.
So then we added in this human in a loop interface with the interrupt after that
allowed us to then go in and modify the checkpointed state of the graph to then
guide the agent to come to a proper solution to the problem that's hard.
As you can see from all of these steps, autonomous agents are really cool.
They're not quite there yet in all of these challenging problems, but
through better engineering, through using state machines, and all of this,
you can come to some better designed systems that are actually capable of
doing some pretty impressive work.
I'm excited by this new data set because it is a lot more challenging
and shows the cracks in the abilities of our current types of language models
as they are today, while also showing how these hybrid systems, these neuro
symbolic approaches can be really powerful in improving performance.
I'm excited to see other people come out with better systems that can surpass those
presented by the authors and hopefully get to the point where they can solve all
these types of questions, even without resorting on much, much larger models.
That's all that we have for our tutorial today.
If you have any questions or comments, feel free to leave
them in the comments below.
Check out the links in the description as well to see the
code and run it for yourself.
And let us know what other types of tutorials you'd like to see that would
be helpful in implementing your own agents and chatbots and assistants.
Until next time, this is Will, and hope you have a great day.
Bye!
Ver más vídeos relacionados
Open AI's Q* Is BACK! - Was AGI Just Solved?
Phi-3 Medium - Microsoft's Open-Source Model is Ready For Action!
Determinism in the AI Tech Stack (LLMs): Temperature, Seeds, and Tools
Did OpenAI Just Secretly Release GPT-5?! ("GPT2-Chatbot")
The Copilot System: Explained by Microsoft
もう限界!な「一緒にいてしんどい人」への立ち向かい方
5.0 / 5 (0 votes)