Eval Comparisons | LangSmith Evaluations - Part 7
Summary
TLDRこの動画スクリプトは、EV(評価)システムの重要性と興味深さを説明し、Lang Smith Primitives、手動データセットの作成、ユーザーログからのデータセット構築、データセット評価方法、組み込みLang Chain Evaluatorの使用、カスタム評価器の構築など、一連のトピックについて説明しています。実際の使用事例に焦点を当て、ローカル環境でのMistra 7とGP35 Turboの比較を通じて、品質とパフォーマンスの比較を行いました。ABテストの有用性と、Lang Chainの柔軟性についても触れています。
Takeaways
- 📚 シリーズの第7回目の動画で、EV valsの重要性と興味深さを紹介しています。
- 🔍 以前の動画では、Lang Smith Primitivesや手動で作成したデータセットについて説明していました。
- 💻 今回は、実際のユースケースとニーズに焦点を当て、「mistra 7」がローカル環境で動作する様子を比較します。
- 📈 データセットの評価方法として、組み込みのLang chain evaluatorを使用する方法を紹介しています。
- 🔧 カスタム評価器を構築する方法についても説明し、実際のデータセットを使ってABテストを行いました。
- 🖥️ 動画では、ローカル環境で動作する「mistra 7」とOpenAIのパフォーマンスを比較しています。
- 📊 ABテストの結果、OpenAIの方がスコアや遅延の面で優れていることが示されていますが、「mistra 7」の品質についても詳細に分析しています。
- 🔎 動画では、個々の回答の詳細を比較し、どちらが正解であるかを確認するプロセスを説明しています。
- 🛠️ データセットの作成方法についても再確認し、ユーザーログからのデータセット作成や手動キュレーションなど様々な方法を紹介しています。
- 📝 評価器やABテストの使用方法が詳細に説明されており、様々なパラメーターを比較するための柔軟性があることが示されています。
- 🚀 このシリーズの最終動画では、導入概念を概説し、今後の動画でより深いテーマに掘り下げることを示唆しています。
Q & A
EV valsの重要性について説明してください。
-EV valsは、言語モデルの性能を評価するために非常に重要です。適切な評価方法を用いることで、モデルが実際の使用場面でどのように機能するかを正確に評価することができます。
Lang Smith Primitivesとは何ですか?
-Lang Smith Primitivesは、言語モデルの開発と評価において使用される基本的なツールや機能のことを指します。これには、データセットの作成や問題の設定、評価方法の選択などが含まれます。
手動でキュレートされたデータセットを作成する方法を説明してください。
-手動でキュレートされたデータセットを作成するには、専門家の知識や経験を利用して、特定のタスクに関連するデータを選択し、適切なフォーマットに整理する必要があります。このプロセスには、データの収集、分類、注釈付けなどが含まれます。
ユーザーログからデータセットを作成する方法について説明してください。
-ユーザーログからデータセットを作成するには、実際のアプリケーションの使用状況を記録し、ユーザーの質問や要望を抽出します。これらの情報をもとに、問題と解答のペアを作成し、モデルの訓練や評価に使用できるデータセットを構築します。
データセットの評価方法として 언급されたバリュエーターとは何ですか?
-バリュエーターは、データセットの質を評価するために使用されるツールです。異なる種類のバリュエーターを用いて、問題の正解率やモデルの応答の品質などを定量的に評価することができます。
Lang chain evaluatorの機能について説明してください。
-Lang chain evaluatorは、言語モデルの応答を自動的に評価するためのツールです。この評価器は、問題とその正解のペアを用いて、モデルの応答がどの程度正確かどうかを判断します。
ABテストとは何ですか?
-ABテストは、製品やサービスの異なるバージョンを比較するための一種の試験方法です。ここでは、異なる言語モデルやプロンプトなどを比較し、どちらがより優れた結果を出すかを評価します。
ローカル環境でMira 7Bを実行することについて説明してください。
-Mira 7Bをローカル環境で実行するには、適切なハードウェアとソフトウェアの設定が必要です。この場合、M2 Macと32GBのRAMを使用して、OpenAIのAPIを用いてモデルを実行しています。
Mira 7BとGP35 Turboの比較結果について説明してください。
-Mira 7BとGP35 Turboの比較では、OpenAIのAPIを使用したGP35 Turboの方がスコアや遅延の点で優れていますが、Mira 7Bもローカル環境で十分な品質の応答を生成しています。
ABテストの利点は何ですか?
-ABテストの利点は、異なるモデルやプロンプトの効果を直接比較できることです。これにより、最適な設定や改善の方向性を明確にし、製品やサービスの品質向上に繋がります。
このビデオスクリプトの目的は何ですか?
-このビデオスクリプトの目的は、言語モデルの評価方法やABテストの重要性について説明し、実際の使用例を通じてこれらの概念を理解しやすくすることです。
Outlines
🤖 Lang ChainとSmithの評価シリーズ第7弾
この段落では、Lang ChainとSmithの評価シリーズの第7回目の動画が紹介されています。シリーズの初回はEV valsの重要性について説明し、2番目の動画ではLang Smith Primitivesについて話しました。3番目の動画では、データセットの作成方法を、4番目はユーザーログからデータセットを作成する方法を紹介しました。5番目では、データセットのバリデータとしてbuilt-in Lang chain evaluatorを使用する方法について説明し、6番目ではカスタムバリデータについて議論しました。今回、実際のユースケースに焦点を当て、misdra 7とgp35 turboのパフォーマンスを比較することに焦点を当てています。
🔍 ABテストによるLang Chainの比較分析
この段落では、ABテストを使用してLang Chainの比較分析を行う方法が説明されています。具体的な手順として、データセットの作成、ビルトインのCoot QA評価器の使用、ABテストの実施、そして結果の比較が行われています。特に、misdra 7とgp35 turboの比較において、misdra 7がローカル環境で動作する様子や、両者のパフォーマンスを比較することが重要です。また、結果の分析においては、正解率や遅延などの統計的指標が重要となります。
Mindmap
Keywords
💡EV vals
💡Lang Smith Primitives
💡データセットの作成
💡ユーザーログ
💡データセットの評価
💡カスタム評価器
💡比較試験
💡misra 7
💡gp35 turbo
💡ABテスト
💡品質
Highlights
Introduction to the importance and interest of EV valves in the context of Langs Smith evaluation series.
Discussion on Lang Smith Primitives and their role in the evaluation process.
Demonstration of creating manually curated datasets based on a datab ras blog post.
Explanation of building datasets from user logs for apps in production.
Overview of various judges for datasets and the use of a built-in Lang chain evaluator for question answering.
Presentation of a flow diagram and guidance on where to find more detailed information on the topics.
Real-world use cases and the need for comparisons, specifically asking how Mistra 7 performs locally on a laptop compared to gp35 turbo.
Note on using Alama for the comparison, with instructions on obtaining the model.
Description of the setup for comparing Mistra running locally on a laptop with open AI.
Emphasis on the importance of quality comparison over speed in the local vs cloud model challenge.
Explanation of the developer-curated dataset of four examples on the data bricks block post used as a basis for the evaluation.
Use of LM as a judge and the built-in coot QA evaluator for the AB test between gbd 35 turbo and mraw.
Description of the process and ease of conducting an AB test with the new function and the Lang chain platform.
Showcase of the comparison mode and its ability to provide immediate statistics on the two runs.
Detail on the inspection of individual answers and the grading process, highlighting the differences between the outputs of mraw and open AI.
Discussion on the incorrect grading of mraw's response and the investigation into the context window issue.
Illustration of the power of comparative AB testing for different prompts and LLMs, and the simplicity of conducting such tests with the Lang chain platform.
Summary of the introductory concepts covered in the video series and a teaser for deeper themes to be explored in future videos.
Transcripts
hi this is L Lang chain this is the
seventh video in our Langs Smith
evaluation series um so our first video
gave kind of a context as to why EV vals
are interesting and important the second
video talk about Lang Smith Primitives
our third video showed how to create
manually curated data sets we built one
based upon this datab bras blog post um
the fourth one showed how to build data
set from user logs so if You' have an
app in production you want to kind of
capture user questions and create a data
set from them you can very easily do
that we talked through that um we then
talked about various judges for data
sets so different types of valuators we
should how to build use a built in Lang
chain evaluator um for question
answering we applied that to our data
data set um and we just talked through
custom evaluator so again we've kind of
showed this this flow diagram and go to
go to those videos if you want to kind
of Deep dive into those
topics now we're going to have a little
bit of fun so this is where you know you
kind of get into like very real world
use cases and and
needs you often want to do
comparisons so let's ask a really
practical
question how does mistra 7 be running
locally on my laptop compared to gp35
turbo on this little challenge we've set
up again remember we've four question
eval set on this data bre blog post p
and open L undo versus uh gbd 35 turbo
so just a little note here I'm using
Alama for this um just you can download
it going to uh uh ama.com you can do
Alama P mstr to get the model and you
can kind of follow instructions here um
so here's my setup I'm going to create a
new function that does the same thing
that we already were doing with open AI
but here I'm going using MW this is
running locally on my laptop so again it
just I can answer ask and answer
questions about the particular blog post
so I answered I just asked a question
and here we go so you know the answer is
streaming out very good and it's
obviously slower than open AI exactly
what we would expect but we really care
out here is I want to know about quality
how does it actually compare on this
little challenge I built for myself so
what are we actually doing uh here we a
developer curated data set of four
examples on this data bricks block post
um I'm using LM as a judge again
remember the built-in coot QA evaluator
I'm using um and I ground truth answers
for every question I'm doing an AB test
between gbd 35 turbo and mraw running
locally on my laptop so that's the setup
um and it's pretty easy so remember
we've already built or defined this data
set dbrx we've already used this to
evaluator coqa so that doesn't change at
all all that changes is I'm just passing
in this new function now now let's go
back and look at that function it looks
exactly like our old one a few little
twists I'm using ol instead of open AI
that's really get same output object
answer you addict with an answer that's
it simple we just saw it work
here so what I can do is kick off the SE
valve this will take a little bit cuz
it's running locally I have an M2 Mac uh
with 32 gigs by the way so that kind of
gives you some sense I've heard a lot of
people having good results using Mr all
7B on on Far smaller machines though so
it's a really nice open model to work
with if you want and you can see it's
still churning it's streaming its
answers out here it's actually done so
it didn't take that long it ran against
my four questions now here's where it
can get really interesting let's go over
to my data set now I can see here that
there's three runs so this is our
initial run uh experiment you can think
of with open AI this is that second one
to do with a custom evaluator we're not
interested in that that was just kind of
a quick more unit test that wasn't a
proper kind of LM based evaluation and
now here's our latest one so here's
where the gets really
interesting I can click on um this and
this so my mraw run my open eye run and
I can click this compare thing that
opens up this compare mode and you
already see some nice statistics here so
what I can see is the average score so
the first run which was up in AI indeed
does quite a bit better in terms of
score latency as exactly we would expect
M draw slower by quite a bit um and
here's a latency dis and so forth so you
get some immediate statistics about the
two
runs now here's where I've done a lot of
work in the past and you know this is
kind of the Crux of AB testing that's
extremely useful um that's why it's very
helps to do this inside Lang is all kind
have captured for you managing this all
yourself is can be quite painful here's
my first question here's a reference
output here is the output returned by
mraw here's the output returned by open
AI so I can actually look at those in
detail I can like kind of zoom in look
at the answers and like hey they they
look very similar here that's really
cool and you can see my greater also
assesses them both as correct and again
we talked about you can actually click
on the evaluator run for each case to
see why but they look good now here's
where it gets a little interesting it
looks like my mistr running locally is
is did not receive a correct
grade um and I did so let's actually
look at what was the question what is
the context window of the drb ex Dro
model okay so it's 32k token right what
did mrra think uh cont is oo 2048 so
that is definitely wrong and we would
have to investigate as to why Mr
believed that to be the case it could be
you know there could be many different
reasons why I failed for that one but
indeed our grader is correctly grading
that as as a as a wrong response for fun
we can actually dig in and look at uh
that particular grading trace and we can
see why stud's answer is incorrect the
student States at the context window is
2048 um the Contex says clearly
32k there you go so the grer is doing
the same thing and we can kind of go
through each one so this is like a toy
example but it shows you a very
important useful concept of comparative
AB testing so like you might want to
compare different prompts different llms
and this is a very nice way to do that
you can see it's extremely simple we've
just supplied our data set name um so
we're of course run against the same
data set where you know typically I like
to apply different experimental prefix
to to enumerate the different
experiments and running so that's easy
you can also capture that in metadata by
the way so that's another way to kind of
differentiate your experiments um and
I'm using the same grader of course and
I'm just modifying the my function which
in this case was just hey swapping out
mistol uh swapp swapping in mistol for
for opening I so again this just shows
you how to use this compare mode and
limith to do ab testing really nicely um
in this particular case we're comparing
mrra versus open AI we can look at you
know kind of overall run statistics as
well as granular answer wise differences
we can inspect the greater as shown here
we can look at the runs as shown here so
this gives you a very flexible General
place to do ab testing uh across
different parameters in this casat I use
different uh different llms um and I've
used this all the time for lots of
different kind of projects and it is
indeed quite useful it's very nice to
have this all kind of managed in one
place for you um so we're going to be
kind of diving into some uh deeper
themes after this this is kind of the
final video of our like kind of
introductory Concepts so if we kind of
Zoom all the way back out what did we
talk through we just talked through man
building your M Cur data set in this
case running Elm as a judge against the
ground Truth for AB testing so we kind
of went through that we also had talks
through um you know same setup but basic
just simple unit test using custom
evaluators we talked through
um yep we talked through yeah different
data set creation yeah we talked through
here we go Len is a judge with ground
truth um so for like you know uh just
just you know Len is greater evaluation
um but no AB test so in that case you're
just looking at like a single a single
model and evaluating it using LM as as a
judge um we talked about the information
flow for valuation
we've talk about different ways to build
data sets from user logs from manual
curation and that's really it this gives
you kind of like the groundwork you need
to do a lot of different things building
custom evaluators and AB testing is
frankly covers a huge surface area of
use cases uh and we're going to do some
deep Dives following this so stay tuned
for for additional videos
thanks
Ver Más Videos Relacionados
Custom Evaluators | LangSmith Evaluations - Part 6
Pre-Built Evaluators | LangSmith Evaluations - Part 5
Manually Curated Datasets | LangSmith Evaluations - Part 3
Evaluation Primitives | LangSmith Evaluations - Part 2
Backtesting | LangSmith Evaluations - Part 19
Datasets From Traces | LangSmith Evaluations - Part 4
5.0 / 5 (0 votes)