Agent Response | LangSmith Evaluation - Part 24
Summary
TLDRランスが言語モデルのエージェント評価に関するシリーズを続け、ツール呼び出しの概念と、それを評価する方法を解説。ツール呼び出しは、モデルが特定のツールを呼び出すためのペイロードを返すことを意味。ラングラフを使用してエージェントを構築し、ツールの選択とペイロードをループで使用する方法を紹介。評価方法として最終応答、単一ツール呼び出し、または複数のツール呼び出しの評価を提案。SQLデータベースを使用したエージェントの例を通じて、評価手法を具体化し、異なるアーキテクチャのエージェントを比較する。
Takeaways
- 🧠 代理(Agent)とは、ツール呼び出し(Tool Calling)、メモリ(Memory)、計画(Planning)などのコアコンポーネントを持つものである。
- 🔧 ツール呼び出しは、LLM(Large Language Model)がツールのペイロードを返し、それを使って何らかの操作を行う機能である。
- 📚 Lanceは、Lang Smithの評価シリーズの中で、代理の評価方法について詳しく説明していく予定である。
- 🛠️ 代理はツール呼び出しを使用し、通常はループ内でこれを使用して、タスクを遂行する。
- 📈 Lang graphを使用して代理を構築し、ノードとエッジを使って代理の構造を表現することができる。
- 🔎 評価には、最終応答の品質、単一ステップでのツールの選択、または多ステップでのツール呼び出しの評価など、異なるアプローチがある。
- 📝 評価の際には、最終応答のみを確認するシンプルな方法から、ステップごとのツール呼び出しの詳細を確認するより複雑な方法まで存在する。
- 🗂️ Lanceは、Chinook DBというSQLデータベースを使用して、SQLエージェントを作成し、その動作を説明した。
- 🔑 評価プロセスでは、データセットの作成、評価器の定義、そして評価の実行が行われ、結果を分析する。
- 📊 評価結果は、正解率や応答の正誤を示すスコアを使って可視化され、異なるアーキテクチャの代理を比較することができる。
- 🛑 エージェントの改善点は、クエリの正確性や結果の妥当性など、データベースとの相互作用を通じて特定される。
Q & A
ランスが紹介するエージェントとは何ですか?
-ランスが紹介するエージェントは、ツールを呼び出す機能を持つ、言語モデルを拡張したものです。ツールの選択とその引数を返し、ユーザーの入力に基づいてツールを呼び出すことができます。
ツールコールとはどのような概念ですか?
-ツールコールは、言語モデル(LLM)が特定のツールのペイロードを返すことで、そのツールを実行するプロセスを指します。LLMは文字列から文字列へのマッピングを行い、ツールの選択と引数を推定します。
Magic functionツールの例は何をしていますか?
-Magic functionツールは、入力された数値に2を加算するという単純な操作を行います。これはツールコールの例として使われ、LLMがこのツールを呼び出す方法を示しています。
Lang chainでエージェントを構築する際に重要な要素は何ですか?
-Lang chainでエージェントを構築する際には、ノードとエッジを用いてグラフ構造を作り、LLMとツールを繋ぐことが重要です。また、ツール条件ノードやツールノードを使用して、LLMの応答を解析し、適切なツールを呼び出します。
エージェントの評価方法にはどのような種類がありますか?
-エージェントの評価方法には、最終応答の評価、単一のツールコールの評価、複数のツールコールの評価などがあります。それぞれの方法は、エージェントの異なる側面を評価するのに使われます。
SQLエージェントの例ではどのようにデータセットを作成しましたか?
-SQLエージェントの例では、入力と出力のペアを持つデータセットを作成し、SQLデータベースとその回答に基づいて評価を行います。データセットは、質問とそれに対応する答えで構成されています。
評価プロセスで使用される参照回答とは何ですか?
-参照回答は、エージェントの応答と比較される正解の回答です。評価では、エージェントの応答が参照回答と一致するかどうかを確認し、精度を測定します。
マルチステップ応答のSQLエージェントとは何ですか?
-マルチステップ応答のSQLエージェントは、ツールの呼び出しパスをより明示的にエンコードし、特定の順序でツールを呼び出すエージェントです。これにより、より具体的なアーキテクチャを持つエージェントが開発され、評価セットで高い精度を達成することができました。
評価セットでのSQLエージェントの改善点は何ですか?
-評価セットでのSQLエージェントの改善点は、より明示的なアーキテクチャを導入することで、正解率が初期モデルから67%に向上したことです。これは、ツールの呼び出しステップを細かく制御することで達成されています。
Lang chainの評価ツールを使って何をすることができますか?
-Lang chainの評価ツールを使って、エージェントの最終応答やツールコールの正確性を評価し、異なるエージェントアーキテクチャの性能を比較することができます。
Outlines
🤖 エージェントの基礎と評価方法
ランスは、Lang chainのシリーズを通じてエージェントの評価について解説しています。エージェントとは何か、どのように評価するかを説明し、ツールコールの概念を紹介しています。ツールコールは、LLMが特定のツールのペイロードを返すことであり、これを通じてツールの選択とその引数を推定します。Lang graphを使用してエージェントの構造を示し、ツールコールをループで使用する例を説明しています。評価方法として、最終的な応答、単一のツールコール、または複数のツールコールの評価の3つの異なるアプローチを提案しています。
🔍 エージェントの応答評価とSQLエージェントの構築
ランスは、SQLデータベースを使用するエージェントの構築方法と応答評価のプロセスを解説しています。Chinook DBを使用して、SQLツールを作成し、Lang graphでエージェントの状態とランタイムを定義しています。評価では、入力と参照回答に基づいて、エージェントの応答を比較しています。データセットの作成、評価の実行、結果の分析を通じて、エージェントの性能を評価しています。
📈 エージェントのマルチステップ応答評価
ランスは、エージェントのマルチステップ応答評価について説明しています。SQLエージェントの例を使って、より明示的なツール使用パスを定義し、その効果を評価しています。評価セットを使用して、初期のエージェントとマルチステップ応答エージェントの性能を比較し、より明確なアーキテクチャが改善をもたらしたことを示しています。
Mindmap
Keywords
💡エージェント(Agent)
💡ツールコール(Tool Calling)
💡ラングラフ(Lang Graph)
💡評価(Evaluation)
💡LLM(Large Language Model)
💡デコレーター(Decorator)
💡SQLデータベース
💡データセット(Dataset)
💡マルチステップ応答(Multi-step Response)
💡カスタムツール(Custom Tool)
Highlights
Lance从Lang chain介绍了Lang Smith评估系列的继续,专注于代理评估。
解释了代理(agent)的基本定义和评估方法。
提到了Lilian Wang的博客文章,该文章分解了代理的核心组件:工具调用、记忆、规划。
通过一个简单的例子解释了工具调用(tool calling)的概念。
介绍了如何使用Lang graph构建代理,并展示了节点和边缘的使用。
展示了代理如何在循环中使用工具调用,以及如何与Lang graph交互。
讨论了评估代理的三种不同方法:最终响应评估、单一工具调用评估和多个工具调用评估。
展示了如何使用Lang graph创建SQL代理,并解释了其工作流程。
描述了SQL代理如何与SQL数据库交互,并使用自定义工具来检查查询的正确性和结果的有效性。
介绍了如何使用Lang graph定义代理状态和代理助手。
展示了如何使用Lang graph的图来表示代理的逻辑流程。
讨论了如何对代理进行响应评估,并展示了使用Lang chain进行数据集创建和评估的过程。
通过实验比较了不同代理架构的性能,展示了显式架构相对于基线模型的改进。
提供了一个改进的SQL代理示例,该代理显式地编码了预期的工具调用路径。
展示了如何使用Lang graph进行多步骤响应评估,并比较了不同代理架构的效果。
强调了使用评估来指导代理开发和优化的重要性。
提供了Lang graph和Lang chain的链接,供用户深入了解代理构建和评估工具。
Transcripts
hey this is Lance from Lang chain we're
continuing our Lang Smith evaluation
series now talking through agent
evaluation this is one of the most
requested topics that we've heard so I
want to walk through this carefully
explaining first what an agent actually
is and then how to think about
evaluating it and we'll probably walk
through the different approaches for
evaluation in three different videos so
here's the starting point what is an
agent there's a great blog post from
Lilian Wang that kind of breaks down the
core components tool calling memory
planning all right so that's step one
now how to think about what is this
thing called tool calling here's a
really simple explanation of it all
you're doing is you're having an
llm basically return to you the payload
of a tool that it can use okay so that's
all it's going on let's give an example
here I have an llm I'm going to define a
tool called Magic function and all it
does is takes an input adds two okay so
in L with this nice little decorator
called tool and allows you to basically
convert this into what we call structure
tool and you can bind it to llms at
support tool calling and here's the key
point when you bind it to the LM the llm
then given an input so what is Magic
function 3 it can recognize hey I need
to invoke this tool and here's the key
point this is often the most confusing
part it just returns to you two things
the tool to tool to call and the payload
or the arguments to run the tool now it
doesn't have the ability to match
magically run that for you right again
it's an llm it's string to string but
what it gives you is a tool selection
and a payload that's all you need to
really know that's all that tool calling
is so this could be really anything this
could be a really complex tool it's a
simple tool that's a key Point you're
getting the tool name and the tool
arguments out and it's inferring those
from the user input so that's like step
one right and now agents use this
particular tool calling step typically
in a loop so Lang graph a really nice
way to build agents not the only way but
let me show you an example like here's
how this whole thing would work here's
my agent I've stru I structure this in
Lang graph in Lang graph you have nodes
and edges so I have two different nodes
my first node is my assistant that's my
llm it sees my input just like before it
has tools bound and it returns remember
all it can return is just like a string
in terms of like a raw response or a
tool message which is essentially
another string with like here's the tool
I want to use here's the tool invocation
we have this um we have basically what
we call a tools condition node that will
automatically look at the LM response
for you and determine is it just a
string response out or is it a tool call
and if it's a tool call then basically
all it happens is it takes that tool
called payload it passes to this other
node called The Tool node which
basically has our two tools it looks up
the right tool based on the specified
name it passes the specified input to
the tool and you get the tool response
out and we send that back to the LM now
this keeps running until LM has decided
okay I'm not going to call a tool
anymore I'm just going to respond
directly and then the tool conditions no
just returns so in this particular case
let's walk through it one Loop what's
Magic function 3 llm says okay that
looks like a tool call I need to use my
tool magic function Returns the tool
call with the magic function name and
the argument tool node gets that tool
node says okay run magic function runs
it with this input three you get the
result five out it passes that back to
LM as a tool message hey the tool output
is five LM looks at that and says okay
Returns the string the result is five
that's all that happens super super
simple basic agent explanation
right so here's the interesting thing
how do you think about evaluating this
thing well here's the way we've broken
it down this is our conceptual guide and
um there's like kind of three different
really intuitive ways to think about
this first is final response this is
just like end to end is it doing what we
want to do so in this case the final
response would be like looking at does
it return five I don't care how many
Loops it goes through does it return
five and you know if you think about
that's just a string to string
comparison typically we can use existing
evaluators like we use for rag answer
response easy stuff right so you're just
looking at end to-end response you don't
care anything about the agent process
okay so that's kind of like one way to
do it now another way to do evaluation
is like digging in a little bit like
looking at a single step of the agent
like here's a good example if I pass
this input does it actually like you
know want it does actually invoke the
right tools it like make the right
decision so for this my output would be
like my evaluation output we'll talk
about this in detail later will be like
the tool name okay so you could just
evaluate that you could say okay if I
have this prompt I expect this tool to
be called and just do an evaluation
there so that's like evaluating one step
of an agent right and then you can also
think about doing kind of the same idea
but for many steps and so let's say this
is more complex agent it has to invoke
like both of these tools right you could
have your reference be like you know fun
magic function then web search Okay um
and um you know and then in case your
evaluation would basically look for
there's a couple ways you could do
custom evaluator there you could look at
the exact sequence of tool calls you
could look at any sequence of tool calls
you could look at like you know if it's
close count it you know better versus
far in terms of like the trajectory
taken anyway we'll talk about all that
later but the intuition is simply
evaluating a whole selection of
different tool calls so again evaluate
final response evaluate a single tool
call or evaluate many tool calls that's
like the simple minded way to think
about at least three ways to look at
agent evaluation now let's go ahead
after that Preamble let's go to the
first one so here's a notebook i' I've
defined an agent um so I'm using this
chinuk DB this is a popular SQL database
um and so I'm going to build a SQL agent
okay it's going to be using chinuk DB
um here's like the flow again just looks
exactly like we just saw except in this
case my tool is SQL database um so
here's where I'm just basically going to
pull in uh this is our existing SQL tool
okay this's a whole bunch of SQL tools
for working with SQL databases I'm going
to find one or two custom tools this
like check query tool this is going to
basically check if the quer is correct
um I'm going to add one other like check
result tool this will check if like the
result from DB is like not empty does it
make sense so anyway there we go I I've
defined some
tools now with Lang graph you defined
agent State and um if you want more
details on L graph I'm going to link a
few videos to talk all about Lang graph
agents that's kind of outside the scope
of here but this is assuming you kind of
know how to build an agent okay because
you're doing evaluation so this is my
agent State now here is just where I'm
like basically defining my agent what
you might call the agent runnable or
agent assistant this is basically my SQL
prompt okay so this is basically telling
telling the agent what you're going to
do uh you're you know you're going to be
interacting with sqle database you're
going to be you know quering it looking
at the response uh and then answering
you know answering the user question and
again there's a whole bunch of detail
you can read The Notebook independently
um a couple graph utilities don't worry
too much about this um and boom let's
look at our graph cool so here's our
agent graph just like we saw before and
again this is just like a you know a
line graph representation what we
actually showed over here uh this EX
exactly the same thing I have an
assistant node I have a tool node I
bounce between the tool and the loop
until basically the assistant returns a
string saying here's my answer that's
all we're doing
Simple uh okay so here's a couple
different questions um let's just make
sure I can invoke this thing and that it
actually works um and my config is not
defined um let's see yep so we're going
to go ahead need to pull that up here
boom let's try that and that'll work
cool
uh so we can see that our agent is
running and it is running there we go so
we get an answer and that's fine so the
agent works okay cool and we can also
stream outputs um but let's not worry
about that for now let's move on to the
eval piece so now we have an agent we
build in L graph we know it works
cool now let's talk through that respons
eval again this is looking at the output
or response of our agent no problem so
first things first let's build data set
just like we've done a million times um
I'm going to go open up lsmith boom and
let's log in here
nice all right very good let's go ahead
and open up let's make sure I'm the
right tenant so I'm in my Lang chain
tenant let's look at data set so
actually I've actually already created
this data set I don't want to create it
again but I'll just show you what it
looks like SQL agent response there's my
data set here are my examples so
basically input up a pair just like
we've seen in the past right uh question
answer just like for rag eval same thing
but in this case we need an agent's
going to be doing all the work under the
hood but again our evaluation approach
can follow exactly what we've done with
rag in the past right here's our
examples so we created a data set with
input output pairs question answer now
these answers of course come from our
SQL database so we need an agent to
again interact with the database and uh
do all the work for us so we're just
going to find a chain this is basically
just going to invoke our graph with an
example from our dat data set so again
if you go look at our data set the data
set is keyed with our inputs are this
key input so again we grab our example
key input that'll return basically the
question Plum that into our agent no
problem easy now okay now here's what's
interesting our evaluator this is just
like we did in the past for a rag same
thing we're going to be doing a string
to- string comparison between a
reference answer and our agent answer so
this is literally the same thing we've
done in the past now we kick our reel so
again let's look what we're passing
we're passing in our little function
here that's basically just going to run
our agent on each input okay so that's
that uh our data set we defined our
evaluator we defined right here this is
the same as basically a rag evaluator
you know it's going to look at our
reference answer relative to our agent
answer cool so we kick off evaluation
here um that runs we can go over and
look so I can go ahead and actually just
I'll show you where I am so I'm in my um
our data set that's defined here I look
at my most recent experiment so here we
are so again we' kind of seen this
before uh this is basically a you know
one zero is the answer correct or not
scoring um we can actually just look at
some of these runs and kind of break
them out um so you know reference output
Les has 14 albums the r zien has 14
albums this is obviously correct that's
great we can look at some that are
incorrect uh the most purchased uh track
of 2013 was hot girl um and this has
some issue this is probably a problem
with the query so again this is what we
could dig into um and you can look
through accordingly so in any case this
is like the the simplest type of
valuation you might think of it's it's
actually the same as other types like
rag evaluation we've already talked
about where you're basically comparing
the output or what's returned by the
agent to a reference and you don't care
anything about what's happening under
the hood okay so this is just like an
example of endan eval on a simple test
case on a SQL
agent I'm just going to extend our
response evaluation slightly here so
initially remember our agent looks like
this we start it goes to the assistant
node the assistant picks one of several
tools to use the tool is invoked we go
back we continue that in a loop until we
finish right now when we talk about
laying out agents you can also St to
think about laying out those steps you
want the agent to take or those tool
nodes those tool calls very explicitly
as independent nodes so here's actually
a separate SQL agent that we've kind of
just devised just kind of hacking on it
a little bit and what we do here is
instead of running this uh kind of just
as a simple Loop where the assistant
kind of makes the decision at each step
as to which tool to use we encode the
path of tools that we want the agent to
use very
explicitly and so basically it follows
what we did previously your first tool
call which is basically list tables you
get the scheme
um you generate a query you you check
the query for correctness you execute it
and you go back and if there's an error
in execution then you try again so it's
same kind of idea except we're just
making a little bit more specific so
what's cool is with our now eval set to
find I can actually go and I've just run
this notebook and this is checked in
I'll of course share the link here bunch
of code this is again showing the the um
the kind of flow of the
agent um we going ahead and ran response
evaluation on that same eval set um
again this is kind of the same eval we
just went through with our updated agent
so we can kick off evaluation here I'm
going to name this evaluation SQL agent
mult multi-step response um so that runs
and we can go over to L graph we can
look at our data set we can see here's
our two experiments this is our initial
agent it got around 53% correct I ran
three
repetitions um and we can see that our
newer agent SQL uh agent multi SE
response gets to around 67% again across
the three repetitions so this is pretty
cool we can open this up we can go to
comparison view um and this is great so
we can actually see you know we do uh
two improvements and one regression
relative to the Baseline the Baseline
being our initial model and our uh what
we're comparing is of course the
multi-step agent so in any case good
example of how you can um you know use
evaluation to compare different agent
architectures and in this particular
case you can see that a more uh explicit
architecture does a little bit better on
EV set thanks
Ver Más Videos Relacionados
Single Step | LangSmith Evaluation - Part 25
Agent Trajectory | LangSmith Evaluation - Part 26
Llama 3 tool calling agents with Firefunction-v2
RAG Evaluation (Answer Correctness) | LangSmith Evaluations - Part 12
Tool use with the Claude 3 model family
Regression Testing | LangSmith Evaluations - Part 15
5.0 / 5 (0 votes)