Why Evals Matter | LangSmith Evaluations - Part 1
Summary
TLDRこの動画スクリプトでは、評価の考え方と実装方法について解説されています。新しいモデルがリリースされた際には、公開されている評価が多数報告されます。評価を考えると、データセット、評価者、タスク、結果の解釈という4つの要素に分けることができます。動画では、Human EvalやChatbot Arenaなどの評価の例を挙げ、それらの評価方法と解釈の方法について説明されています。また、個別のテストや評価を実装する方法についても触れており、Langsmithというプラットフォームを紹介しています。Langsmithを使うことで、データセットの作成、評価者の定義、結果の分析を行いやすくなります。
Takeaways
- 📝 評価の考え方と実装方法についてシリーズを開始する予定
- 🔍 評価はデータセット、評価者、タスク、結果の解釈の4つの要素から考えられる
- 👨💻 Human EvalはOpenAIが公開した、プログラミング問題を解決するデータセット
- 📊 結果は通常バーチャートで報告され、正解率を示す
- 🤖 Chapad Arenaは動的に生成されるデータセットで、ユーザーがLLMSの応答を比較して選択
- 🏆 Chapad Arenaは比較的評価で、ELOスコアのような統計情報を提供
- 💡 個別のテストと評価に興味があり、自分でベンチマークを構築したい人が増えている
- 🛠️ データセットの作成には手動でのキュレーションやアプリでのユーザーインタラクションの利用などがある
- 👩🔬 LMを評価者として使うと、基準なしの評価や相対的な評価が可能
- 🔧 単純なアサーションチェックから比較的複雑な評価まで、様々な評価方法がある
- 🔗 Lang Smithは評価を簡単に実行できるプラットフォームで、データセットの作成や評価器の定義が容易
- 🌐 Lang SmithはLang Chainと一緒に使用することもできるが、必要はなく、柔軟性がある
Q & A
最近のユーザーの関心をどのように評価するのに役立つ4つの要素は何ですか?
-データセット、評価者、タスク、結果の解釈方法の4つの要素が役立ちます。
Cloud 3のような新しいモデルがリリースされた際に、公開されている評価とは何ですか?
-新しいモデルがリリースされた際に、公開されている評価は、様々なLMSを比較し、不同なタスクやデータセットに対してのパフォーマンスを示すものです。
Human Evalとは何であり、どのような評価方法を用いていますか?
-Human Evalは、OpenAIが2021年に制作したプログラミング問題に基づく評価方法です。評価方法是、各コード問題に真実の正解があり、ユニットテストやプログラム的な方法で正確さを指定します。
Outlines
🔍モデル評価の基本とその実装方法
このパラグラフでは、最近の評価に対する関心の高まりを受け、評価の基本的な考え方とその実装方法についてのシリーズを開始することを紹介しています。具体的には、Langmiを使用して評価を自分で実装する方法について説明しています。評価について考えるための4つの要素、すなわちデータセット、評価者、タスク、結果の解釈方法について説明し、これらの要素を用いて公開モデルの様々な評価を具体的に見ていきます。例として、OpenAIによって制作された165のプログラミング問題を含む「Human Eval」という評価方法や、Chatbot Arenaのような動的に生成されるデータセットに基づく評価方法が挙げられています。評価の異なるアプローチや、個別のテストや評価の重要性についても触れています。
🛠Langmiを使った評価の実装
第二のパラグラフでは、評価を実装するための具体的なステップとしてLangmiプラットフォームの紹介が行われます。Langmiは、評価の実行を簡単にするためにLang Chainチームによって構築されたプラットフォームであり、データセットの構築、評価者の定義、結果の比較分析などを行うためのUIとSDKを提供します。Langmiを使用することで、ユーザーは自分自身の評価を容易に実装し、バージョニングや編集、カスタム評価器の使用など、評価プロセスの様々な側面を管理できるようになります。このツールはLang Chainと一緒に使用することができますが、必須ではないため、柔軟性が高い点も強調されています。このシリーズの後続のビデオでは、これらの概念をさらに詳しく説明し、スクラッチから自分自身の評価を構築する方法について解説する予定です。
Mindmap
Keywords
💡評価
💡データセット
💡評価者
💡タスク
💡結果の解釈
💡Chatbot Arena
💡ELOスコア
💡Lang Smith
💡ユニットテスト
💡ABテスト
💡カスタマイズされたテスト
💡評価の改善
Highlights
Series introduction on evaluation from scratch and implementation using Langsmith.
Public evaluations reported on new models like Cloud 3.
Evaluation as four pieces: dataset, evaluator, task, and interpretation of results.
Human evaluation by OpenAI with 165 programming problems.
Evaluation method with ground truth and unit tests for correctness.
Chatbot Arena's dynamic evaluation through user interactions.
Comparative assessment in Chatbot Arena with human judges.
Interest in personalized testing and evaluation.
Building benchmarks and testing models with personalized datasets.
Categories of datasets: manually curated, user interaction, synthetically generated.
Evaluators can be human, LMs, or unit tests against ground truth.
Reference-free evaluation with LMs as judges.
Application of evaluations: unit tests, proper evaluations, and AB testing.
Langsmith platform for easy evaluation with UI and SDK.
Langsmith supports building datasets, versioning, editing, and custom evaluators.
Langsmith facilitates trace inspections, comparative analysis, and more.
Langsmith does not require Langchain but can be used together for flexibility.
Transcripts
hi this is L blank chain we've heard a
lot of interest users on evaluation in
recent weeks and months and so we want
to kick off a short series laying out
kind of how to think about evaluation
from scratch and how to implement it
yourself using
langmi so it kind of set the stage when
new models release like Cloud 3 you
often see a bunch of public evaluations
reported so on the left here is the clog
through blog post showing a bunch of
different evals in the various rows and
compared to other popular LMS in the
various columns you've also seen maybe
things like chatbot Arena um which now
has claw three Opus at the top but the
question here is like what are these
evaluations how to think about them and
how could I Implement them
myself so maybe a nice way to think
about evaluation is just four pieces
there's a data set there's some kind of
evaluator there's a task and there's
some means of interpreting the results
so let's actually make this concrete and
look at the various evaluations that
have been run on some of these public
models um so human eval is a really good
one it's was produced by open AI back in
I think 2021 it has 165 programming
problems so it's basically related to
the task of Coden you can see that at
the bottom and what's interesting is the
evaluation method here you can think of
it in two ways what's like the judge of
who's actually judging the result and
like what's the mode of evaluation in
this case the mode is like ground truth
there's a ground truth correct answer
for every coding problem and using unit
test some kind of programmatic way of
specifying correctness um interpretation
typically just reported as bar charts in
this case I'm showing some results from
the recent data bricks model um which
they report a human eval on um but let's
look at another one so here's an
interesting kind of comparative example
on chapad Arena so in this case there's
actually no static data set this is more
dynamically generated from user
interactions and the way it works is a
user is presented with two different
llms they prompt them both and they
choose which response they like better
so it's more of like an an arena or like
a a battle format in that sense so again
the in this case the judge is a human
the mode in this case is not really
ground truth so much as is a comparative
assessment in terms of metrics they
often report these par wives plots which
basically show one model versus all
other models and then the statistics
tell you the likelihood of one model
beating the other they roll these up
into things like ELO scores which kind
of tell you the likelihood of a model
beating another one um kind of taken
from chess so anyway you can see that
you can look at different evaluation
like benchmarks using these four
different kind of buckets and just group
them that way and think through them in
that way but we kind of seen an interest
in personalized testing and evaluation
um so for example like of course models
are are are you know published with you
know hundreds of different public
benchmarks but people often want to
build their own benchmarks um and and
kind of hack on and test models
themselves um we've actually seen some
interest in the community around this so
karpathy tweeted about one nice
Benchmark from a scientist at at Google
deepmind um will Dew from open AI
mentioned there's kind of a lot of
opportunity in better
evaluations so you know if you kind of
talk about those four buckets and and
break them down a little bit there's
like a few different things to to kind
of cover here there's a lot of surface
area for building your own evaluations
so when you think about data sets
there's a few categories like one is
manually curate like we saw with human
eval build a data set of question answer
pairs or like code solution pairs right
so there like highly curated you define
it yourself another is like if you have
an app out in production you have a
bunch of user inter rtion with your app
you can roll those into a data set for
example of user lobs and you can use LMS
to synthetically generate data sets for
you so these are all really interesting
modes of data set kind of creation Now
in terms of evaluation we saw examples
of you c as a judge like in the case of
chap Vina in that case with comparison
as the mode we talked about using like
unit test or heris sixs as the judge
against like a ground truth correct code
solution in the case of human eval you
can also use lm's as judges and there's
a lot of cool work on this um lm's as
judges can you know judge for General
criteria which you might think of as
reference free like there's no ground
truth but you give the llm like I want
to assess a few different things like
you know brevity or something so it's
kind of like a reference free mode of
evaluation and of course an LM can also
judge or an answer relative to ground
truth
so the final piece here is thinking
about like how do the how are these
typically applied you can think about a
few different categories unit tests
evaluations and AB testing so unit
testing are kind of like simple
assertion scrap functionality these are
very routine in software engineering um
they can be run online to give an
application feedback it be run offline
as part of for example Ci or other kinds
of evaluation you can also have like
again like we talked about before a
proper evaluation with a judge in this
case it's not just like um you know
maybe a simple assertion in this case
maybe it's a more involved like human
feedback or Ln judge and again we talked
a little bit about human evaluation and
also Ln based evaluation and then AB
testing this is just comparative
evaluations popular one here is
regression testing looking over time um
or experimental testing in you know
different parameters so a question might
ask is like well how do I actually get
started how do I Implement some of these
ideas myself and this is kind of where
lsmith comes in so the team at Lang
chain built Lang Smith is a platform
that makes it very easy to run
evaluations um and so if you go to this
link here Smith lch.com you'll be
prompted to sign up I've already signed
up so you can see this is just my like
Lang Smith page we're going talk about
this in a lot more detail in upcoming
videos but the point is likith makes it
very easy to instrument various
evaluations we have a a uh a UI and SDK
for building data sets versioning them
editing them an SDK for defining uh your
own evaluators or implementing or using
custom evaluators um and we also have
the UI for inspections for Trace
inspections comparative analysis and
we're going to kind of walk through in
in in a few different videos these ideas
kind of very carefully uh so you
understand all these pieces and how to
do each one from scratch and and of
course Very importantly langth does not
required Lang chain to use um but of
course you can use it with Lang chain so
that's an important point of flexibility
I want to highlight and in the upcoming
videos where we kind digging into each
one of these bins uh like really
carefully and like building up an
understanding scratch as to how to build
your own evaluations thanks
Посмотреть больше похожих видео
Evaluation Primitives | LangSmith Evaluations - Part 2
Pre-Built Evaluators | LangSmith Evaluations - Part 5
Regression Testing | LangSmith Evaluations - Part 15
RAG Evaluation (Document Relevance) | LangSmith Evaluations - Part 14
Single Step | LangSmith Evaluation - Part 25
Evaluations in the prompt playground | LangSmith Evaluations - Part 8
5.0 / 5 (0 votes)