Corrections + Few Shot Examples (Part 1) | LangSmith Evaluations

LangChain
26 Jun 202415:16

Summary

TLDRビデオスクリプトでは、言語モデル(LM)を評価者として使用する際の精度とリコールの重要性が強調されています。特に、RAG(Retrieval-Augmented Generation)パイプラインでのLMの利用が人気で、精度の向上や人間のフィードバックを活用して評価者を微調整する方法が紹介されています。オンライン評価システムの設定や、ユーザーが評価結果を訂正し、それに基づいて評価者を改善するプロセスが解説されています。この手法により、より正確で人間らしいスコアリング基準に従ったLMの評価が可能になります。

Takeaways

  • 📈 スクリプトでは、言語モデル(LLM)を評価者として使用する際の問題点と、それらを修正する方法について議論されています。
  • 🔍 LLMは非常に効果的な評価者として機能しますが、時には人間のようなニュアンスを捉えきれないことがあります。
  • 🛠️ 人間フィードバックを取り入れることで、LLMの評価フローを改善することができます。
  • 📝 RAG(Retrieval-Augmented Generation)パイプラインにおいて、LLMの評価者として特に人気があります。
  • 📑 スクリプトでは、RAGのドキュメントグレーディングとそれに伴うオンライン評価者を設定する方法が紹介されています。
  • 🔄 オンライン評価者を設定することで、アプリケーションが稼働中でもリアルタイムにフィードバックを得ることができます。
  • 📝 評価ルールの追加やオンライン評価者の作成、そしてフィードバックを用いた評価の改善方法について説明されています。
  • 🔧 人間が提供するフィードバックを用いて、評価者を微調整し、より正確なスコアリングを行うことができます。
  • 📈 スクリプトでは、再現性(Recall)と正確性(Precision)の両方の評価者を設定し、それらをプロジェクトに適用する方法が示されています。
  • 🔗 フィードバックを用いた評価の改善は、LLMの評価者をより正確にチューニングする強力な方法です。
  • 📚 スクリプト全体を通して、LLMを効果的に使用し、人間フィードバックを取り入れた評価システムの構築方法が強調されています。

Q & A

  • ラングスミス評価とは何ですか?

    -ラングスミス評価とは、データセットやアプリケーションを評価する際に使用される4つの主要な要素(データセット、評価対象、評価者、スコア)に基づく評価プロセスのことです。

  • ラングスミス評価でなぜユーザーはスコアを修正する機能を望むのですか?

    -ユーザーは、特に言語モデル(LLM)を評価者として使用する場合、そのモデルが人間のように細かい好みを捉えきれないことがあるため、スコアを修正する機能が必要なくなるのです。

  • RAG(Retrieval-Augmented Generation)とは何ですか?

    -RAGは、文書の検索結果を利用して言語モデルを支援する手法で、文字列から文字列への比較など、様々な場面で使用されています。

  • ラングスミスの評価者としてLLMが有効な理由は何ですか?

    -LLMは、RAGや類似の文字列比較タスクにおいて、効果的に機能し、多くの優れた論文でその効果が示されています。

  • オンライン評価者とは何ですか?

    -オンライン評価者は、プロジェクトが実行されるたびに自動的に実行され、例えばアプリの運用中に大きな誤りを見つけるなどの目的に使用される評価システムです。

  • ラングスミスの評価プロセスで「リカリ」とは何を意味しますか?

    -リカリは、ドキュメントに質問に関連する事実が含まれているかどうかをテストするプロセスであり、関連性のある事実が含まれている場合はスコア1と評価されます。

  • ラングスミスの評価プロセスで「正確性」とは何を意味しますか?

    -正確性は、ドキュメントが質問に関連している程度を評価するプロセスで、正確な情報のみが含まれている場合にスコア1と評価されます。

  • 人間フィードバックを評価フローに組み込む方法の一つとして、few-shot例を使用することの利点は何ですか?

    -few-shot例を使用することで、人間からのフィードバックを具体的な例として提示し、評価者をより正確なスコアリングに調整することができます。

  • ラングスミスの評価者として使用されるGPD 40とは何ですか?

    -GPD 40は、ラングスミスの評価プロセスで使用される言語モデルの一つで、グレード付けのタスクに適しています。

  • ラングスミスの評価プロセスで、ユーザーがスコアを修正する際に提供するフィードバックの役割は何ですか?

    -ユーザーが提供するフィードバックは、評価者を再調整し、人間のような好みや要件に従ったスコアリングを可能にすることで、評価の精度を高める役割を果たします。

Outlines

00:00

😀 ランチェインと評価システムの紹介

ランチェインとスミスの評価について語り、データセット、アプリケーション、評価者、スコアの4つの主要部分を再確認。特にLLM(大規模言語モデル)を評価者として使用する場合、人間のような細かい好みを捉えきれないことがある。そこで、人間フィードバックを評価フローに取り込む方法について話す。RAG(Retrieval-Augmented Generation)の例を通じて、ドキュメントのグレードと評価の改善方法を説明。

05:01

🔍 レトロスペクティブな評価者とフィードバックの活用

RAGパイプラインでLLMを評価者として使用する様々な方法を紹介し、オンライン評価者としてドキュメントのグレードを設定し、人間による訂正を活用して改善する方法を説明。オンライン評価者を作成し、精度とリコールの両方のドキュメント評価者をプロジェクトに追加することで、フィードバックをリアルタイムで得られる仕組みを構築。

10:03

📝 評価者への訂正とフィードバックの具体例

評価者からのフィードバックをレビューし、特定のドキュメントが問いに対する関連性を持っているかどうかを判断。訂正を通じて、最終ドキュメントがREACTの仕組みについて詳細に説明していないと評価し、その訂正を評価者のフィードバックとして提供。このプロセスにより、評価者がより正確なスコアを出すように調整される。

15:04

🛠 LLM評価者のフィードバック機能の強さ

LLM評価者が効果的で広く使用されることから、人間フィードバックの取り込みが非常に強力であることを強調。具体的な訂正例を通じて、評価者が人間が望むスコアリングルーブリックに沿うように調整される方法を紹介し、この機能を活用してLLM評価者を構築することが重要であると結びつける。

Mindmap

Keywords

💡Lang chain

「Lang chain」とは、言語モデルを活用した一連のプロセスを指しており、このビデオでは特に、評価とリコールという2つの重要な要素に焦点を当てています。ビデオでは、Lang chainを使ってデータを評価し、改善する方法について説明しています。

💡評価

「評価」とは、データセットやアプリケーションのパフォーマンスを測定するプロセスを意味します。ビデオでは、人間フィードバックを活用して、自動化された評価プロセスを微調整し、より正確な結果を得る方法について議論されています。

💡リコール

「リコール」は、特定の質問に対して関連する情報を含んでいるかどうかを測定する指標です。ビデオでは、リコールのチェックを通じて、ドキュメントが質問に関連する情報を提供しているかどうかを評価する方法について説明しています。

💡オンライン評価者

「オンライン評価者」とは、リアルタイムでデータを評価するシステムであり、ビデオでは、このシステムを用いてRAG(Retrieval-Augmented Generation)パイプライン内のドキュメントの関連性を評価する方法が紹介されています。

💡RAG

「RAG」とは、リトリーバル拡張生成の略で、このビデオでは特に、RAGパイプライン内でオンライン評価者を用いてドキュメントの関連性を評価する例として取り上げられています。

💡ドキュメントのグレード

「ドキュメントのグレード」とは、ドキュメントが質問に関連している程度を評価するプロセスです。ビデオでは、オンライン評価者がドキュメントのリコールと正確さを評価し、フィードバックを基に改善する方法が示されています。

💡人間フィードバック

「人間フィードバック」とは、人間の判断や意見をシステムにフィードするプロセスで、ビデオでは、このフィードバックを使ってオンライン評価者の精度を向上させる方法が議論されています。

💡精度

「精度」は、正解がどれだけ正確に選択されているかを示す指標です。ビデオでは、精度の評価を通じて、ドキュメントが質問に関連する正確な情報を提供しているかどうかを評価する方法について説明しています。

💡Few-shot例

「Few-shot例」とは、少量の例を用いて機械学習モデルをトレーニングする方法です。ビデオでは、人間からのフィードバックをfew-shot例として用いて、評価プロセスを微調整する手法が紹介されています。

💡校正

「校正」は、初期の評価結果を人間の判断に基づいて調整するプロセスです。ビデオでは、オンライン評価者が提供する精度やリコールのスコアを、人間が校正し、より適切な評価を得る方法が示されています。

Highlights

讨论了Lang Smith评估和回忆的四个主要部分:数据集、应用程序、评估者和得分。

用户希望修改或纠正评估者给出的分数,特别是在使用大型语言模型(LLM)作为评估者时。

介绍了LLM评估者在RAG(Retrieval-Augmented Generation)中的应用。

展示了如何设置在线评估者进行文档评分,并使用纠正来改进评估。

创建了一个简单的RAG机器人,用于索引博客文章并回答问题。

介绍了如何为项目添加评估规则,进行回忆检查。

使用GPD 40作为评估者,因为它更适合评分任务。

展示了如何将链的输出映射到评估提示的输入。

解释了回忆测试的基本概念,即文档是否包含与问题相关的信息。

讨论了如何使用纠正作为少量样本示例来改进评估者。

展示了如何通过反馈明确地纠正评估者的分数。

介绍了精度评估者的概念,用于评估文档的相关性精度。

演示了如何将评估者的输出连接到项目,以进行实时反馈。

展示了如何审查评估者的反馈,并根据需要进行调整。

强调了将人类反馈纳入评估者的重要性,以确保评分符合预期。

讨论了通过具体示例校准评估者,以产生更符合预期的分数。

强调了在线评估者在生产环境中的实用性,用于实时监控和改进。

总结了在线评估者和人类反馈在构建符合预期评分标准的评估者中的作用。

鼓励用户尝试使用在线评估者和人类反馈来改进评估过程。

Transcripts

play00:01

hey this is L Lang chain so we've been

play00:03

talking about a lot about Langs Smith

play00:05

evaluations and recall the four major

play00:07

pieces you have a data set you have some

play00:08

application trying to evaluate you have

play00:10

some evaluator and then you have a score

play00:14

now one of the things that we've seen

play00:15

very consistently is that users want the

play00:18

ability to modify or to correct scores

play00:21

from the

play00:22

evaluator now this is most true in cases

play00:25

where we talked about this quite a bit

play00:27

previously you have an llm as a judge as

play00:30

your evaluator we kind of talked about

play00:32

different types of evaluators you can

play00:34

use human feedback you can use heris

play00:37

evaluators LMS judges is one in

play00:40

particular that's very popular for

play00:41

things like Rag and really any kind of

play00:44

like string to- string comparisons LM as

play00:46

judges is really effective there's a lot

play00:47

of great papers on this but we know that

play00:49

LMS can make mistakes and in particular

play00:52

in the case of judging you know often

play00:54

times they may not capture exactly the

play00:56

nuanced kind of preferences that we as

play00:58

humans kind of want to encode in them

play01:00

so we're going to talk today about a few

play01:02

ways you can actually correct this and

play01:04

incorporate human feedback into your

play01:05

evaluator flow now I'll show you um one

play01:09

of the most popular applications for LMS

play01:11

judge evaluators is rag so if we go down

play01:13

here we can look at remember there's a

play01:15

bunch of different ways to use LMS as

play01:17

judges within a rag pipeline you can

play01:20

evaluate the final answers you can

play01:22

evaluate the documents themselves um you

play01:25

can evaluate you know answer

play01:27

hallucination R to the documents so

play01:29

we're going to show today about um how

play01:31

we can set up an online evaluator for

play01:34

ragot that will do document grading and

play01:37

then how we can actually use corrections

play01:39

to improve it so I'm going to go over to

play01:41

notebook I'm going to create a rag bot

play01:43

here so you know I'm just going to index

play01:45

a few blog

play01:46

posts um so that all ran and here's my

play01:51

rag bot it's super simple doesn't even

play01:52

use Lang chain this is just kind of

play01:54

rawad GPD 40 um I'm doing a document

play01:57

retrieval step uh I set my system prom

play02:00

up here this is like a standard rag

play02:01

prompt your helpful assistant use the

play02:03

following docs to answer the question

play02:05

that's all we're going to do here and

play02:07

let's go ahead and run that once on a

play02:09

simple kind of input question about the

play02:11

react agent so this is a good question

play02:14

asked because one of our documents in

play02:17

particular uh this guy right here uh is

play02:20

a blog post about

play02:22

agents so that went ahead and ran now we

play02:25

go to

play02:26

lsmith and um we can see we have a new

play02:30

project with one Trace here we go we can

play02:32

look at the trace we can see it contains

play02:34

retrieve docs invoke our llm so that's

play02:37

great now let's say I want to build an

play02:40

evaluator for this project so I can go

play02:43

to add

play02:44

rules I'm going to call this recall I

play02:48

want to I want to perform a recall check

play02:50

on the retrieve documents I go to online

play02:52

evaluator create evaluator look at our

play02:55

suggested prompts and I see this one for

play02:58

document relevance recalls this pretty

play03:00

nice and I'm going to go ahead and use

play03:02

GPD 40 it's better llm for

play03:05

grading and you're going to see a few

play03:07

things here that are kind of nice so

play03:09

this is basically setting up an

play03:10

evaluator that'll run every time my

play03:12

project runs and this is really nice if

play03:13

you have a you know an app in production

play03:16

and you want to for example greatest

play03:17

responses flag things you know that are

play03:20

egregiously wrong and so forth so what

play03:23

I'm going to do is this mapping allows

play03:25

me to take the outputs of my chain and

play03:27

map it into the inputs for this prompt

play03:29

so it's pretty cool so I can see my

play03:32

chain has these two outputs answering

play03:34

context context just the restri docs

play03:36

which I pass through so there we go

play03:39

question just input question so now my

play03:42

two my inputs and outputs are defined

play03:45

they map into my prompt uh right here so

play03:48

facts question now what's going to

play03:50

happen here in the

play03:51

prompt this is going to be grading

play03:53

relevance

play03:54

recall and so you know I'm going to give

play03:57

a question I'm giving it a set of facts

play03:59

and basic basically I'm asking a score

play04:01

of one says any of the facts is relevant

play04:04

to the question um so again this is kind

play04:07

of recall test so recall just basically

play04:09

means um is do the documents contain uh

play04:13

facts that are relevant to my question

play04:15

now it can include lots of things that

play04:16

are not relevant but as long as there's

play04:18

a kernel of relevance I'm going to score

play04:20

that as one so that's the main idea here

play04:22

and now I'm going to do something that

play04:24

is kind of nice I'm going to use this

play04:26

corrections as few shot example so this

play04:28

is going to create a placeholder for for

play04:29

me that I'm going to use later after I

play04:32

correct my evaluator and so what this

play04:34

does this basically just sets up a

play04:37

placeholder right here that contains

play04:40

I'll call this this is basically a set

play04:41

of facts

play04:44

question reasoning and score so what's

play04:48

going on here well fact and question are

play04:52

just two of the things that are input to

play04:55

my prompt so that's kind of these are

play04:57

going to be uh provided by the user

play05:01

reasoning is going to be an explanation

play05:04

for the corrected score that I'm going

play05:06

to give it so these are going to come

play05:07

from the human feedback and this is

play05:09

going to be basically a few shot example

play05:11

that I'm going to tell the grader to use

play05:13

in its consideration of the score so

play05:16

what I'm going to do I can go ahead up

play05:18

here and I can say here are

play05:23

several examples and

play05:27

explanations to caliber

play05:30

R your

play05:33

scoring

play05:34

so what's really happening here if we

play05:37

kind of step back is I'm basically

play05:41

creating a placeholder where I can

play05:43

incorporate human Corrections into my

play05:46

prompt that's really all that's going on

play05:48

here and what's really nice is so this

play05:51

is all here I can go ahead and name this

play05:53

output score recall let's just change

play05:55

this just so it's a little bit easier in

play05:57

our logging later on um um now what I

play06:01

can do is I can use this preview button

play06:03

to see how this is going to look and

play06:04

make sure everything works so this is

play06:06

pretty cool this just injects an example

play06:10

um kind of

play06:12

facts uh question reasoning score so

play06:16

this is this is just like a placeholder

play06:18

for what the user will input these of

play06:19

course will be input by me later and

play06:21

then here is the actual input for this

play06:24

particular chain example so this is just

play06:25

confirming that everything's hooked

play06:27

together correctly um so

play06:30

cool and I'm going to go ahead and

play06:32

continue so that's our recall grader I'm

play06:34

going to save

play06:35

that let's add one more we'll call this

play06:39

Precision great I'm going to say online

play06:43

evaluator I'm going to use GPD 40 again

play06:46

try a suggested prompt document

play06:49

relevance

play06:50

precision and again I'm just going to

play06:52

change my my uh key name for the output

play06:55

score I'll again use these few shot

play06:58

examples let's just kind of format these

play07:00

slightly based upon how we' like them to

play07:03

look so this is

play07:05

nice

play07:08

good

play07:09

Precision question facts that's

play07:13

fine cool and again we'll just instruct

play07:16

here

play07:17

are some examples to

play07:24

calibrate your

play07:27

grading cool so that's all going on here

play07:31

and let's go ahead and do the final

play07:32

piece here so we have to hook up

play07:34

basically here's our chain outputs

play07:36

context are the documents that retrieved

play07:39

here's the input

play07:41

question and we are done so this is our

play07:44

Precision grater and we're all set so

play07:47

now we have two graders attached to uh

play07:50

our

play07:51

project I go back to my project now I'm

play07:54

just going to mock let's say I'm a user

play07:55

playing with this app let's just go

play07:57

ahead and ask a couple questions so how

play07:59

does react agent

play08:01

work um and I'll go ahead ask another a

play08:05

few other relevant questions what's the

play08:06

difference between react and reflection

play08:09

approaches um what are the types of

play08:11

agent memory or llm

play08:13

memory um cool I'll just run these uh

play08:18

what is memory and retrieval model in

play08:19

the generative agent

play08:21

simulation run

play08:23

that cool so we've just gone ahead and

play08:26

ran a few different questions against

play08:28

our data set

play08:32

very nice so now what you're going to

play08:34

see is these are all now logged to our

play08:36

project and over time we're going to see

play08:39

feedback rolling from our evaluator so

play08:42

again what's happened is I set up a

play08:43

project I've added two precision and

play08:46

recall document evaluators to my project

play08:50

these will grade the retrieve documents

play08:51

for precision and recall relative to the

play08:53

question and I'm just going to go ahead

play08:55

and let those evaluators run on my uh

play08:58

four new examp exle

play09:02

inputs okay great so we can see is that

play09:05

now I have online evaluator feedback

play09:07

that's rolled

play09:08

in um and these are against my four

play09:12

input questions this is pretty cool now

play09:15

what I can do is I can go ahead and

play09:18

review so let's just kind of move this

play09:22

over and my outputs contain the answer

play09:26

and the retriev documents in these

play09:27

context so it's pretty nice says I can

play09:29

go ahead and quickly review this and say

play09:31

hey do I agree with my evaluator or not

play09:33

so let's look the question was react uh

play09:36

how's a react agent work and let's look

play09:37

at my documents briefly I can even just

play09:39

copy over react maybe that could be kind

play09:41

of make it quick so I can see this first

play09:43

document those mention react twice um so

play09:47

that's pretty nice

play09:51

um yeah I think that's that's pretty

play09:53

reasonable the second one uh Yep this is

play09:56

definitely correct

play09:59

um again third one looks right again

play10:02

talking about the react kind of uh

play10:04

action uh observation and thought

play10:07

Loop um and this final one uh does not

play10:12

actually mention react so in this

play10:14

particular case I look at the scoring

play10:16

the Precision is zero the Fe the recall

play10:18

is one I think that's about right

play10:20

because this fourth document does not

play10:21

mention react at all so I'm happy with

play10:23

this we close this down let's look at

play10:26

the second

play10:27

one in this particular case it's talking

play10:29

about react and reflection so two

play10:30

different approaches what's the

play10:32

difference right so

play10:34

here okay so the first one clearly talks

play10:37

about react um that's good the second

play10:41

one also clearly talk abouts

play10:43

react uh cool the third one does not but

play10:48

it mentions

play10:49

reflection

play10:51

okay so we can look at a little about at

play10:53

this last document now you can see that

play10:56

it mentions irco kind of a complimentary

play10:58

approach to rea Dimensions react it

play11:00

doesn't say too much about how it really

play11:02

works it says it combines coot prompting

play11:04

with queries in this case to Wikipedia

play11:07

so you know I can be a little bit

play11:09

critical here and what I'm going to say

play11:11

is here I'm going to say the Precision

play11:14

is not one and here's how I'm going to

play11:17

do this so I can basically make a

play11:18

correction to my greater so I'm going to

play11:20

say okay I'm going to say the great is

play11:22

zero and when I'm going to tell it now

play11:24

this is really nice I can actually give

play11:25

it my feedback explicitly so what I'm

play11:27

going to say is the final

play11:30

document does

play11:32

mention

play11:34

react but it doesn't actually discuss

play11:39

how react Works in any level of

play11:44

detail as

play11:46

opposed to the other docs which

play11:51

discuss the react

play12:01

reasoning Loop more

play12:06

specifically

play12:08

cool um

play12:11

so for this

play12:14

reason I do not give it

play12:18

a Precision score of one that's it so I

play12:24

go ahead and update that cool so now we

play12:26

basically said look I I consider this

play12:29

last docum be a little bit of false

play12:30

positive I get it says the word react

play12:32

that's probably why it was retrieved it

play12:33

doesn't really talk about like the

play12:35

functioning of the react agent in in any

play12:37

level of detail and so you know again

play12:39

this is just an example the kind of

play12:40

feedback so we can go back to our

play12:43

evaluators we can look at the Precision

play12:47

evaluator we can go down and see this F

play12:51

shot data set now and see that actually

play12:54

includes our

play12:55

correction um and our explanation let's

play12:59

have a look at the preview and see how

play13:01

this going to be formatted so here are

play13:03

some examples of calate your

play13:05

grading um here's a whole bunch of

play13:11

facts um cool here is the question the

play13:15

final document does mention react but

play13:16

doesn't specifically discuss how it

play13:18

works um so I do not give it a Precision

play13:21

um I do not give it a Precision score of

play13:23

one Precision zero Okay cool so it looks

play13:25

like it sucked in our feedback nicely

play13:27

it's now part of the few shot example

play13:29

so that's great um that's included in

play13:32

our valuation prompt um now let's go

play13:36

ahead and check so

play13:41

um let's rerun on this

play13:44

question and see if our example kind of

play13:46

was correctly captured by our

play13:51

evaluator great so we can see our

play13:53

evaluator just ran so again here was the

play13:55

question we asked what's the difference

play13:56

between react and reflection approaches

play13:58

are scoring now is precision zero recall

play14:01

one now look at the last time we asked

play14:03

this question what's the difference

play14:05

between reaction reflection with our

play14:07

correction if you look at our scoring

play14:09

here Precision we corrected it to be

play14:13

zero we provide that as feedback now the

play14:16

evaluator is correctly calibrated it's

play14:18

course it is zero and one of course look

play14:21

this is a case of overfitting I

play14:22

completely understand that we've

play14:24

literally added this particular example

play14:26

to the F shot prompt but it's a case

play14:28

where we can actually highlight that

play14:31

performing feedback and Corrections

play14:33

rolling that into your evaluator few

play14:34

shot examples can actually correctly

play14:37

calibrate it to produce scores that are

play14:39

more align with what you want so it's a

play14:40

really useful tool for building LM judge

play14:43

evaluators that adhere to the type of uh

play14:46

kind of scoring rubric that you actually

play14:48

want and this is really useful because

play14:50

oftentimes It's tricky to actually just

play14:51

prompt it to produce the correct uh kind

play14:54

of scoring giving it specific examples

play14:56

from Human Corrections uh is really

play14:58

powerful powerful and so that's the big

play15:00

Insight here it's a really useful

play15:01

feature um and particularly because

play15:03

elements judge Valu which are so

play15:05

effective and they're you know

play15:06

increasingly widely used the ability to

play15:08

incorporate feedback really easily is is

play15:10

just a really powerful and nice tool so

play15:12

encourage you to play with it thanks

Rate This

5.0 / 5 (0 votes)

Related Tags
AI評価フィードバック精度向上リカリブ精度評価ドキュメントオンライン評価人間フィードバックAI学習精度改善
Do you need a summary in English?