Pre-Built Evaluators | LangSmith Evaluations - Part 5
Summary
TLDRこの動画スクリプトは、データセットの作成と評価方法について説明しています。データセットには、質問と回答のペアが含まれており、それをもとにLang Chain Evaluatorを用いてモデルの正確性を評価します。特に注目すべきは、Chain of Thought reasoningを使用したCoot QA Evaluatorの活用方法で、モデルの生成と正解の一致を評価することができます。このプロセスは、データセットの品質とモデルの性能を正確に把握するために非常に有用であり、次回の動画でさらに深堀される予定です。
Takeaways
- 📚 データセットの重要性と興味深さについて説明された。
- 🛠️ データセットの作成方法が紹介され、手動で整理された質問と回答のペアから始まる。
- 🔍 データセットの評価方法について学ぶ。入力と出力を比較し、評価器を用いてスコアを算出する。
- 📈 評価器の世界は広範で、カスタム評価器や組み込み評価器が存在する。
- 🏆 CoT (Chain of Thought) 評価器の利点が強調された。理論的思考を用いて回答を評価する。
- 🔧 データセットの名前が「dbrx」で、SDKを使って作成された。
- 🔎 ユーザーログからデータセットを構築する方法も提案されている。実際のユーザーデータからデータセットを作成する。
- 📊 評価結果は、遅延、P99、P50、P99などのメトリックとともに表示される。
- 👀 各テスト結果を詳細に調査でき、入力の質問、参考答案、モデルの出力、そして評価器のスコアが確認できる。
- 🛠️ 評価器の設定や使用方法が説明され、オープンAIを使用しての設定例が示されている。
- 📝 データセットの定義、評価器の選択、テストの実行、結果の分析までがスクリプト内で説明されている。
Q & A
なぜラグ(lags)が重要で興味深いのか説明されていますか?
-初めのビデオでは、lagsが重要な理由と興味深い点について説明されています。
データセットを作成するために使用される主なLang Primitivesは何ですか?
-第二のビデオでは、データセットを作るために使用される主なLang Primitivesが説明されています。
データセットの作成方法について説明されていますか?
-はい、手動でキュレートされた質問と回答のペアからデータセットを作成する方法について説明されています。
dbrxデータセットはどのようにして作成されましたか?
-dbrxデータセットは、特定のブログ記事に基づいて手動で多くの質問と回答のペアを作成し、それらをデータセットに追加することで作成されました。
ユーザーログからデータセットをどのように作成するか説明されていますか?
-はい、ユーザーの質問を実際のユーザーデータとして取り、それを真理応答付きのデータセットに変換する方法について説明されています。
データセットができたら、どうやってLMを評価するか説明されていますか?
-データセットとLMの出力を評価器に渡し、その評価結果とスコアを返す方法について説明されています。
評価器の世界はどのくらい広範囲にわたっているか説明されていますか?
-はい、カスタム評価器、組み込み評価器、ラベルがある場合やない場合の評価方法について触れられています。
CoQA評価器の特徴は何ですか?
-CoQA評価器は、Chain of Thought推理を使用して、LMが生成した答えと真実の答えを比較し、一致するかどうかを評価するという特徴があります。
実際にCoQA評価器を使用してLMを評価する方法について説明されていますか?
-はい、CoQA評価器を使用して、データセットの質問と回答ペアに対してLMを評価し、そのプロセスと結果を詳しく説明しています。
評価結果を確認する方法について説明されていますか?
-はい、データセットのテスト結果を確認し、Latency、Error Rateなどのメトリックスを確認する方法について説明されています。
評価結果を深く掘り下げて分析する方法について説明されていますか?
-はい、各テスト結果をクリックして詳細を確認し、評価器がどのように動作したかを理解する方法について説明されています。
Outlines
📌 データセットの構築と評価方法の紹介
この段落では、データセットの構築方法とモデルの評価方法について説明されています。データセットには、質問と回答のペアが含まれており、それをもとにモデルの性能を評価する方法が紹介されています。また、データセットの構築において、ブログ記事からの手動データセット作成やユーザーログからのデータセット作成の方法が触れられています。評価方法については、データセットの入力と出力ペアを用いて、アプリが出力と比較し、スコアを返すプロセスが解説されています。
📈 評価結果の分析とエラーの理解
この段落では、評価結果の分析方法とエラーの理解について説明されています。評価結果から、レイテンシやエラー率などのメトリックが得られ、それらを通じてモデルの性能を把握することができます。また、具体的な質問と回答のペアを用いて、モデルがどのように動作し、どの程度正確な回答を生成しているかを確認することができます。さらに、Chain of Thought reasoningを使った評価方法についても触れられており、その方法は、モデルの生成と正解の比較を通じて最終的な評価を行うというものです。
Mindmap
Keywords
💡Lag Chain
💡Evaluation
💡Data Set
💡Question Answer Pairs
💡Built-in Evaluators
💡Chain of Thought Reasoning
💡Ground Truth
💡SDK
💡User Logs
💡Latency
💡Error Rate
Highlights
The importance and interest of lags are discussed in the first video.
The core Lang Primitives are introduced in the second video.
A dataset is built from manually curated question-answer pairs based on a blog post about the new databricks model.
The dataset created is named dbrx and was built using the SDK.
Another method for dataset building is from user logs, which is useful for creating datasets with ground truth responses.
Evaluation of a language model (LM) against a dataset is explained, involving an app that produces output and an evaluator that performs assessment.
Evaluators can be custom or built-in, and there are various types available for different use cases.
The evaluator 'CoQA' is highlighted for its Chain of Thought reasoning capabilities.
CoQA is particularly effective for question-answer pairs and can compare the LM-generated answer to the ground truth.
Powerful models like CLA Opus or OpenAI GPT-4 can be used with CoQA for high-level evaluation.
The process of evaluating the LM involves plumbing questions into the LM, obtaining answers, and comparing them to ground truth.
Metrics such as latency, error rate, and evaluation scores are provided by the evaluator.
The evaluator output includes reasoning and a final score, allowing for easy auditing of the LM's performance.
The transcript outlines a clear and efficient method for LM evaluation using built-in evaluators and datasets.
The use of built-in language evaluators simplifies the process and avoids the need for reimplementation from scratch.
The practical application of the method is demonstrated through the use of the dbrx dataset and the CoQA evaluator.
Transcripts
hi this is last lag chain this is our
fifth video on lags with evaluations so
our first video kind of laid out whys
are important and interesting our second
video of laid out kind of of core Lang
of Primitives that we're going be
working with we just talked through two
two important concept so building a data
set from like a set of manually curated
in our case question answer
pairs um we buil a data set based on
this blog post about the new datab
brecks model and I basically manually
built a number of question answer pairs
from that blog post I add them to my own
data set that data set then was called
dbrx and I use the SDK to create it so
that should be it I also showed how to
build a DAT set from user logs which is
really useful for you know if you want
to take a bunch of actual user data like
user questions um and convert them into
like a data set with ground truth
responses for future testing so that's
another you know really useful and
common technique for data set
building so now let's get
into uh evaluation so here's a question
I build my data set how do I actually
evaluate my LM against it so in the
second video we talked about this
information flow but I'll just reiterate
it briefly so we have a data set the
data set has examples in my case the
data set has input output pairs question
answer what I'm going to do is I'm have
an app and we'll see that shortly I have
like a little example app um
that app sees an input from my data set
produces an output I also have the
ground truth output in the data set I
pass the user or the the ground truth
output and the app output to an
evaluator and it'll perform some
evaluation and return a score that's it
now here's where it gets a bit
interesting the world of evaluators is
is actually pretty Broad and we we've
actually touched on this a few other uh
videos so there's custom evaluators
there's built-in evaluators within
built-in evalu there's valat for labels
or not labels in my particular case I
have labels and I want to use a built-in
Lang withth a valuator so we have a
bunch of them listed here and I I'll go
over and show you so offthe shelf L Lang
chain evaluators is is like really uh
nice way to go um so you don't have to
kind of reimplement something from
scratch for question answering again my
data set is question answer pairs so on
an evaluator that operates on question
answer pairs right here's a few
different ones that are popular QA
Contex QA coot
QA the high Lev the point is this Co
coot QA is often um a very nice
evaluator um because it'll use Chain of
Thought reasoning um to basically look
at the llm generate answer versus the
ground truth and to evaluate whether or
not uh they they match um and so
typically in for the greater llm you use
a very powerful model like maybe CLA
Opus or you might use you know open AI
gbd4 for example but that's the high
level idea you're using Chain of Thought
reasoning to determine the final verdict
so let's actually just walk through
what's going on here I'm going to pick
that coot QA as my evaluator remember I
built my data set dbrx let's actually go
over and have a quick look at that so if
I go over to my lag Smith um I'm going
to go data syst testing dbrx search for
it here it is I have my set I've done no
evaluations against it it has four
examples so this is kind of where I am
currently now that's my data set name
remember I built this function answered
data Bri questions so I Define that up
here us using open AI very
simple um I'm pluming in my data set
name I'm Plumbing my evaluator I'm
adding a little prefix like this is test
QA open Ai and I'm also adding some meta
metadata like I'm you know website
context into y gp35 turbo so anyway
that's all going on here and this
evaluate is all I need so I kick this
off this will kick off an evaluation so
again think about what's happening look
at the float here all that's happening
is I have those four questions each of
those questions is going in going to my
my basic llm which is this answer chain
right so that's this thing so each
question is getting plumbed into this
here's a good example right right here
we Plumb in a question we get an answer
out it's just doing that behind the
scenes means so we take those four
questions we Plum them in we get answers
out for every one of those answers out I
go to that data set I fetch the ground
truth answer again we can see them here
right look here's one of our inputs so
this input gets plumbed into our llm and
we have this ground truth output that's
it so let's go back hopefully that ran
it
ran now here's where I do I go to my
data set I look at tests now I'm going
to move my my little head here so now I
have a test you see this prefix that we
added uh it is now right this this thing
right here we can see you know our name
has that prefix in it we can see some
metrics here latency P99 p50
P99 um and we can see things like error
rate we can see our metric and so forth
so let's actually just dive in
here so this is where you can really
have a lot of fun and do a lot of kind
of
inspection of your results so here's
what's going
on the input here is the question that I
plummed in right go back to our flow the
input is this thing it's just my
question all right the reference output
go back to the flow is this thing it's
just basically the correct answer okay
so I have the input question I have the
reference output now here's what my
model returned so this is what we're
assessing we're assessing this reference
versus what I what I returned using that
coot QA evaluator so behind the scenes
uh let's actually dig into that so
there's two things I can click on here
this open runs thing opens up just my
chain okay so this is my chain um again
which we defined up here so it's this
answered question with open AI thing so
that's just this running on our input
there's all the
context and here's the question that got
plumed in here's the answer so if I kind
of go back um that's what that run is
that's all it's happening there now I
might want to know what did this gr how
did the grader work what actually
happened there so if I click on this
little arrow it'll take me to the
evaluator run and that's going to open
up right here so this is the evaluator
that we used off the shelf we didn't
have to write this or anything we can
actually go we can see we're using open
AI as the eval which is fine and here's
actually The Prompt this is very useful
your teacher beinging a quiz blah blah
blah it gives you a bunch of like
criteria um okay so
basically um what's happening is this is
the greater
prompt and you're seeing the question
and the context and the student answer
so the context gives you the ground
truth answer the student answer is what
the model turned and then here's the
output here's like the reasoning and
here's the score so this is really nice
you can audit the greater really easily
so if I go back let's Zoom all the way
back out what's going on here I Define
the data set my inputs are here my
reference outputs are here my L
Generations are here my scores are all
listed one or zero in this case and I
can dig into each one to understand what
the evaluator did I can also dig into my
Generations using this open run to see
how they work so if I zoomed all the way
back out the stack what are we doing
here we're doing evaluation against our
data set with a built-in L LS with
evaluator um this was the information
flow and if I go all the way down what
do we just do we had our data set of uh
data bricks examples questions four of
them we used LM as a judge using a
built-in line chain evaluator against
those ground truth answers that we
provided in our data set and we basic
just did an llm evaluation that's it so
we're going to be kind of building on
this in the next video thanks
Voir Plus de Vidéos Connexes
RAG Evaluation (Answer Correctness) | LangSmith Evaluations - Part 12
Eval Comparisons | LangSmith Evaluations - Part 7
Manually Curated Datasets | LangSmith Evaluations - Part 3
Custom Evaluators | LangSmith Evaluations - Part 6
Why Evals Matter | LangSmith Evaluations - Part 1
Datasets From Traces | LangSmith Evaluations - Part 4
5.0 / 5 (0 votes)