Corrections + Few Shot Examples (Part 1) | LangSmith Evaluations
Summary
TLDRビデオスクリプトでは、言語モデル(LM)を評価者として使用する際の精度とリコールの重要性が強調されています。特に、RAG(Retrieval-Augmented Generation)パイプラインでのLMの利用が人気で、精度の向上や人間のフィードバックを活用して評価者を微調整する方法が紹介されています。オンライン評価システムの設定や、ユーザーが評価結果を訂正し、それに基づいて評価者を改善するプロセスが解説されています。この手法により、より正確で人間らしいスコアリング基準に従ったLMの評価が可能になります。
Takeaways
- 📈 スクリプトでは、言語モデル(LLM)を評価者として使用する際の問題点と、それらを修正する方法について議論されています。
- 🔍 LLMは非常に効果的な評価者として機能しますが、時には人間のようなニュアンスを捉えきれないことがあります。
- 🛠️ 人間フィードバックを取り入れることで、LLMの評価フローを改善することができます。
- 📝 RAG(Retrieval-Augmented Generation)パイプラインにおいて、LLMの評価者として特に人気があります。
- 📑 スクリプトでは、RAGのドキュメントグレーディングとそれに伴うオンライン評価者を設定する方法が紹介されています。
- 🔄 オンライン評価者を設定することで、アプリケーションが稼働中でもリアルタイムにフィードバックを得ることができます。
- 📝 評価ルールの追加やオンライン評価者の作成、そしてフィードバックを用いた評価の改善方法について説明されています。
- 🔧 人間が提供するフィードバックを用いて、評価者を微調整し、より正確なスコアリングを行うことができます。
- 📈 スクリプトでは、再現性(Recall)と正確性(Precision)の両方の評価者を設定し、それらをプロジェクトに適用する方法が示されています。
- 🔗 フィードバックを用いた評価の改善は、LLMの評価者をより正確にチューニングする強力な方法です。
- 📚 スクリプト全体を通して、LLMを効果的に使用し、人間フィードバックを取り入れた評価システムの構築方法が強調されています。
Q & A
ラングスミス評価とは何ですか?
-ラングスミス評価とは、データセットやアプリケーションを評価する際に使用される4つの主要な要素(データセット、評価対象、評価者、スコア)に基づく評価プロセスのことです。
ラングスミス評価でなぜユーザーはスコアを修正する機能を望むのですか?
-ユーザーは、特に言語モデル(LLM)を評価者として使用する場合、そのモデルが人間のように細かい好みを捉えきれないことがあるため、スコアを修正する機能が必要なくなるのです。
RAG(Retrieval-Augmented Generation)とは何ですか?
-RAGは、文書の検索結果を利用して言語モデルを支援する手法で、文字列から文字列への比較など、様々な場面で使用されています。
ラングスミスの評価者としてLLMが有効な理由は何ですか?
-LLMは、RAGや類似の文字列比較タスクにおいて、効果的に機能し、多くの優れた論文でその効果が示されています。
オンライン評価者とは何ですか?
-オンライン評価者は、プロジェクトが実行されるたびに自動的に実行され、例えばアプリの運用中に大きな誤りを見つけるなどの目的に使用される評価システムです。
ラングスミスの評価プロセスで「リカリ」とは何を意味しますか?
-リカリは、ドキュメントに質問に関連する事実が含まれているかどうかをテストするプロセスであり、関連性のある事実が含まれている場合はスコア1と評価されます。
ラングスミスの評価プロセスで「正確性」とは何を意味しますか?
-正確性は、ドキュメントが質問に関連している程度を評価するプロセスで、正確な情報のみが含まれている場合にスコア1と評価されます。
人間フィードバックを評価フローに組み込む方法の一つとして、few-shot例を使用することの利点は何ですか?
-few-shot例を使用することで、人間からのフィードバックを具体的な例として提示し、評価者をより正確なスコアリングに調整することができます。
ラングスミスの評価者として使用されるGPD 40とは何ですか?
-GPD 40は、ラングスミスの評価プロセスで使用される言語モデルの一つで、グレード付けのタスクに適しています。
ラングスミスの評価プロセスで、ユーザーがスコアを修正する際に提供するフィードバックの役割は何ですか?
-ユーザーが提供するフィードバックは、評価者を再調整し、人間のような好みや要件に従ったスコアリングを可能にすることで、評価の精度を高める役割を果たします。
Outlines
😀 ランチェインと評価システムの紹介
ランチェインとスミスの評価について語り、データセット、アプリケーション、評価者、スコアの4つの主要部分を再確認。特にLLM(大規模言語モデル)を評価者として使用する場合、人間のような細かい好みを捉えきれないことがある。そこで、人間フィードバックを評価フローに取り込む方法について話す。RAG(Retrieval-Augmented Generation)の例を通じて、ドキュメントのグレードと評価の改善方法を説明。
🔍 レトロスペクティブな評価者とフィードバックの活用
RAGパイプラインでLLMを評価者として使用する様々な方法を紹介し、オンライン評価者としてドキュメントのグレードを設定し、人間による訂正を活用して改善する方法を説明。オンライン評価者を作成し、精度とリコールの両方のドキュメント評価者をプロジェクトに追加することで、フィードバックをリアルタイムで得られる仕組みを構築。
📝 評価者への訂正とフィードバックの具体例
評価者からのフィードバックをレビューし、特定のドキュメントが問いに対する関連性を持っているかどうかを判断。訂正を通じて、最終ドキュメントがREACTの仕組みについて詳細に説明していないと評価し、その訂正を評価者のフィードバックとして提供。このプロセスにより、評価者がより正確なスコアを出すように調整される。
🛠 LLM評価者のフィードバック機能の強さ
LLM評価者が効果的で広く使用されることから、人間フィードバックの取り込みが非常に強力であることを強調。具体的な訂正例を通じて、評価者が人間が望むスコアリングルーブリックに沿うように調整される方法を紹介し、この機能を活用してLLM評価者を構築することが重要であると結びつける。
Mindmap
Keywords
💡Lang chain
💡評価
💡リコール
💡オンライン評価者
💡RAG
💡ドキュメントのグレード
💡人間フィードバック
💡精度
💡Few-shot例
💡校正
Highlights
讨论了Lang Smith评估和回忆的四个主要部分:数据集、应用程序、评估者和得分。
用户希望修改或纠正评估者给出的分数,特别是在使用大型语言模型(LLM)作为评估者时。
介绍了LLM评估者在RAG(Retrieval-Augmented Generation)中的应用。
展示了如何设置在线评估者进行文档评分,并使用纠正来改进评估。
创建了一个简单的RAG机器人,用于索引博客文章并回答问题。
介绍了如何为项目添加评估规则,进行回忆检查。
使用GPD 40作为评估者,因为它更适合评分任务。
展示了如何将链的输出映射到评估提示的输入。
解释了回忆测试的基本概念,即文档是否包含与问题相关的信息。
讨论了如何使用纠正作为少量样本示例来改进评估者。
展示了如何通过反馈明确地纠正评估者的分数。
介绍了精度评估者的概念,用于评估文档的相关性精度。
演示了如何将评估者的输出连接到项目,以进行实时反馈。
展示了如何审查评估者的反馈,并根据需要进行调整。
强调了将人类反馈纳入评估者的重要性,以确保评分符合预期。
讨论了通过具体示例校准评估者,以产生更符合预期的分数。
强调了在线评估者在生产环境中的实用性,用于实时监控和改进。
总结了在线评估者和人类反馈在构建符合预期评分标准的评估者中的作用。
鼓励用户尝试使用在线评估者和人类反馈来改进评估过程。
Transcripts
hey this is L Lang chain so we've been
talking about a lot about Langs Smith
evaluations and recall the four major
pieces you have a data set you have some
application trying to evaluate you have
some evaluator and then you have a score
now one of the things that we've seen
very consistently is that users want the
ability to modify or to correct scores
from the
evaluator now this is most true in cases
where we talked about this quite a bit
previously you have an llm as a judge as
your evaluator we kind of talked about
different types of evaluators you can
use human feedback you can use heris
evaluators LMS judges is one in
particular that's very popular for
things like Rag and really any kind of
like string to- string comparisons LM as
judges is really effective there's a lot
of great papers on this but we know that
LMS can make mistakes and in particular
in the case of judging you know often
times they may not capture exactly the
nuanced kind of preferences that we as
humans kind of want to encode in them
so we're going to talk today about a few
ways you can actually correct this and
incorporate human feedback into your
evaluator flow now I'll show you um one
of the most popular applications for LMS
judge evaluators is rag so if we go down
here we can look at remember there's a
bunch of different ways to use LMS as
judges within a rag pipeline you can
evaluate the final answers you can
evaluate the documents themselves um you
can evaluate you know answer
hallucination R to the documents so
we're going to show today about um how
we can set up an online evaluator for
ragot that will do document grading and
then how we can actually use corrections
to improve it so I'm going to go over to
notebook I'm going to create a rag bot
here so you know I'm just going to index
a few blog
posts um so that all ran and here's my
rag bot it's super simple doesn't even
use Lang chain this is just kind of
rawad GPD 40 um I'm doing a document
retrieval step uh I set my system prom
up here this is like a standard rag
prompt your helpful assistant use the
following docs to answer the question
that's all we're going to do here and
let's go ahead and run that once on a
simple kind of input question about the
react agent so this is a good question
asked because one of our documents in
particular uh this guy right here uh is
a blog post about
agents so that went ahead and ran now we
go to
lsmith and um we can see we have a new
project with one Trace here we go we can
look at the trace we can see it contains
retrieve docs invoke our llm so that's
great now let's say I want to build an
evaluator for this project so I can go
to add
rules I'm going to call this recall I
want to I want to perform a recall check
on the retrieve documents I go to online
evaluator create evaluator look at our
suggested prompts and I see this one for
document relevance recalls this pretty
nice and I'm going to go ahead and use
GPD 40 it's better llm for
grading and you're going to see a few
things here that are kind of nice so
this is basically setting up an
evaluator that'll run every time my
project runs and this is really nice if
you have a you know an app in production
and you want to for example greatest
responses flag things you know that are
egregiously wrong and so forth so what
I'm going to do is this mapping allows
me to take the outputs of my chain and
map it into the inputs for this prompt
so it's pretty cool so I can see my
chain has these two outputs answering
context context just the restri docs
which I pass through so there we go
question just input question so now my
two my inputs and outputs are defined
they map into my prompt uh right here so
facts question now what's going to
happen here in the
prompt this is going to be grading
relevance
recall and so you know I'm going to give
a question I'm giving it a set of facts
and basic basically I'm asking a score
of one says any of the facts is relevant
to the question um so again this is kind
of recall test so recall just basically
means um is do the documents contain uh
facts that are relevant to my question
now it can include lots of things that
are not relevant but as long as there's
a kernel of relevance I'm going to score
that as one so that's the main idea here
and now I'm going to do something that
is kind of nice I'm going to use this
corrections as few shot example so this
is going to create a placeholder for for
me that I'm going to use later after I
correct my evaluator and so what this
does this basically just sets up a
placeholder right here that contains
I'll call this this is basically a set
of facts
question reasoning and score so what's
going on here well fact and question are
just two of the things that are input to
my prompt so that's kind of these are
going to be uh provided by the user
reasoning is going to be an explanation
for the corrected score that I'm going
to give it so these are going to come
from the human feedback and this is
going to be basically a few shot example
that I'm going to tell the grader to use
in its consideration of the score so
what I'm going to do I can go ahead up
here and I can say here are
several examples and
explanations to caliber
R your
scoring
so what's really happening here if we
kind of step back is I'm basically
creating a placeholder where I can
incorporate human Corrections into my
prompt that's really all that's going on
here and what's really nice is so this
is all here I can go ahead and name this
output score recall let's just change
this just so it's a little bit easier in
our logging later on um um now what I
can do is I can use this preview button
to see how this is going to look and
make sure everything works so this is
pretty cool this just injects an example
um kind of
facts uh question reasoning score so
this is this is just like a placeholder
for what the user will input these of
course will be input by me later and
then here is the actual input for this
particular chain example so this is just
confirming that everything's hooked
together correctly um so
cool and I'm going to go ahead and
continue so that's our recall grader I'm
going to save
that let's add one more we'll call this
Precision great I'm going to say online
evaluator I'm going to use GPD 40 again
try a suggested prompt document
relevance
precision and again I'm just going to
change my my uh key name for the output
score I'll again use these few shot
examples let's just kind of format these
slightly based upon how we' like them to
look so this is
nice
good
Precision question facts that's
fine cool and again we'll just instruct
here
are some examples to
calibrate your
grading cool so that's all going on here
and let's go ahead and do the final
piece here so we have to hook up
basically here's our chain outputs
context are the documents that retrieved
here's the input
question and we are done so this is our
Precision grater and we're all set so
now we have two graders attached to uh
our
project I go back to my project now I'm
just going to mock let's say I'm a user
playing with this app let's just go
ahead and ask a couple questions so how
does react agent
work um and I'll go ahead ask another a
few other relevant questions what's the
difference between react and reflection
approaches um what are the types of
agent memory or llm
memory um cool I'll just run these uh
what is memory and retrieval model in
the generative agent
simulation run
that cool so we've just gone ahead and
ran a few different questions against
our data set
very nice so now what you're going to
see is these are all now logged to our
project and over time we're going to see
feedback rolling from our evaluator so
again what's happened is I set up a
project I've added two precision and
recall document evaluators to my project
these will grade the retrieve documents
for precision and recall relative to the
question and I'm just going to go ahead
and let those evaluators run on my uh
four new examp exle
inputs okay great so we can see is that
now I have online evaluator feedback
that's rolled
in um and these are against my four
input questions this is pretty cool now
what I can do is I can go ahead and
review so let's just kind of move this
over and my outputs contain the answer
and the retriev documents in these
context so it's pretty nice says I can
go ahead and quickly review this and say
hey do I agree with my evaluator or not
so let's look the question was react uh
how's a react agent work and let's look
at my documents briefly I can even just
copy over react maybe that could be kind
of make it quick so I can see this first
document those mention react twice um so
that's pretty nice
um yeah I think that's that's pretty
reasonable the second one uh Yep this is
definitely correct
um again third one looks right again
talking about the react kind of uh
action uh observation and thought
Loop um and this final one uh does not
actually mention react so in this
particular case I look at the scoring
the Precision is zero the Fe the recall
is one I think that's about right
because this fourth document does not
mention react at all so I'm happy with
this we close this down let's look at
the second
one in this particular case it's talking
about react and reflection so two
different approaches what's the
difference right so
here okay so the first one clearly talks
about react um that's good the second
one also clearly talk abouts
react uh cool the third one does not but
it mentions
reflection
okay so we can look at a little about at
this last document now you can see that
it mentions irco kind of a complimentary
approach to rea Dimensions react it
doesn't say too much about how it really
works it says it combines coot prompting
with queries in this case to Wikipedia
so you know I can be a little bit
critical here and what I'm going to say
is here I'm going to say the Precision
is not one and here's how I'm going to
do this so I can basically make a
correction to my greater so I'm going to
say okay I'm going to say the great is
zero and when I'm going to tell it now
this is really nice I can actually give
it my feedback explicitly so what I'm
going to say is the final
document does
mention
react but it doesn't actually discuss
how react Works in any level of
detail as
opposed to the other docs which
discuss the react
reasoning Loop more
specifically
cool um
so for this
reason I do not give it
a Precision score of one that's it so I
go ahead and update that cool so now we
basically said look I I consider this
last docum be a little bit of false
positive I get it says the word react
that's probably why it was retrieved it
doesn't really talk about like the
functioning of the react agent in in any
level of detail and so you know again
this is just an example the kind of
feedback so we can go back to our
evaluators we can look at the Precision
evaluator we can go down and see this F
shot data set now and see that actually
includes our
correction um and our explanation let's
have a look at the preview and see how
this going to be formatted so here are
some examples of calate your
grading um here's a whole bunch of
facts um cool here is the question the
final document does mention react but
doesn't specifically discuss how it
works um so I do not give it a Precision
um I do not give it a Precision score of
one Precision zero Okay cool so it looks
like it sucked in our feedback nicely
it's now part of the few shot example
so that's great um that's included in
our valuation prompt um now let's go
ahead and check so
um let's rerun on this
question and see if our example kind of
was correctly captured by our
evaluator great so we can see our
evaluator just ran so again here was the
question we asked what's the difference
between react and reflection approaches
are scoring now is precision zero recall
one now look at the last time we asked
this question what's the difference
between reaction reflection with our
correction if you look at our scoring
here Precision we corrected it to be
zero we provide that as feedback now the
evaluator is correctly calibrated it's
course it is zero and one of course look
this is a case of overfitting I
completely understand that we've
literally added this particular example
to the F shot prompt but it's a case
where we can actually highlight that
performing feedback and Corrections
rolling that into your evaluator few
shot examples can actually correctly
calibrate it to produce scores that are
more align with what you want so it's a
really useful tool for building LM judge
evaluators that adhere to the type of uh
kind of scoring rubric that you actually
want and this is really useful because
oftentimes It's tricky to actually just
prompt it to produce the correct uh kind
of scoring giving it specific examples
from Human Corrections uh is really
powerful powerful and so that's the big
Insight here it's a really useful
feature um and particularly because
elements judge Valu which are so
effective and they're you know
increasingly widely used the ability to
incorporate feedback really easily is is
just a really powerful and nice tool so
encourage you to play with it thanks
浏览更多相关视频
RAG Evaluation (Answer Correctness) | LangSmith Evaluations - Part 12
CC hiroe 42
Code-First LLMOps from prototype to production with GenAI tools | BRK110
Why Evals Matter | LangSmith Evaluations - Part 1
Regression Testing | LangSmith Evaluations - Part 15
RAG Evaluation (Answer Hallucinations) | LangSmith Evaluations - Part 13
5.0 / 5 (0 votes)