How to Use LangSmith to Achieve a 30% Accuracy Improvement with No Prompt Engineering
Summary
TLDRビデオの要約を以下のようにまとめます。ハリソン氏は、Lang chainのチームメンバーであるdosuが、プロンプトエンジニアリングを行わずにアプリケーションのパフォーマンスを30%向上させた方法について、ブログをリリースしたと語ります。彼らが使用したのは、Lang Smithというプラットフォームで、これはLang chainとは別であり、オープンソースで作動します。Lang Smithは、アプリケーションのデータフローを改善するためのツール群を組み合わせています。これには、アプリケーションを通過するデータのログ記録、トレース、テスト、バリデーション、プロンプトハブ、人間による注釈などが含まれます。彼らは、分類タスクに取り組み、OpenAIクライアントを直接使用して、問題のタイプを分類しました。また、Lang Smithを使用して、実行に関連付けられたフィードバックを残す方法を説明し、その後、そのフィードバックを使用してアプリケーションのパフォーマンスを向上させるプロセスを紹介します。さらに、彼らは、セマンティック検索を介して、類似した例を通じて入力に応じた出力を提供する手法も使用しました。この方法は、分類タスクにおいては単純な例ですが、より複雑なタスクにも適用可能です。
Takeaways
- 🚀 ハリソンからランチェーンのブログを紹介。dosuがアプリケーションのパフォーマンスを30%向上させた方法について解説します。
- 🛠️ dosuが使用したのはLang Smithというプラットフォームで、ロギング、トレース、テスト、バリデーションを統合したツールを使用しました。
- 🔍 Lang Smithは、アプリケーションのデータフローを改善するための機能を組み合わせたものです。
- 📈 dosuは分類タスクを30%向上させるために、Lang Smithの機能を活用してフィードバックを収集し、アプリケーションを改善しました。
- ⚙️ 環境変数の設定から始め、Lang Smithプロジェクトにデータをログする方法を学びます。
- 📝 アプリケーションはOpenAIクライアントを使用して分類タスクを実行し、Lang Smithで追跡します。
- 🔁 フィードバック機能を使って、実行結果に問題がある場合に修正を提案できます。
- 🔢 Lang Smithで自動化ルールを設定し、フィードバックを持つデータをデータセットに移動させます。
- 🔧 データセットから良い例を引き出し、アプリケーションに取り入れることでパフォーマンスを向上させます。
- 🤖 モデルは以前のパターンから学習し、新しい入力に対しても適切な分類を行えるようになります。
- 🔍 dosuは、類似した例を用いたセマンティック検索を行い、入力に最も関連性のある例を適用しました。
- 📚 フィードバックループを構築し、アプリケーションのパフォーマンスを継続的に改善していく方法が示されています。
Q & A
dosuがアプリケーションのパフォーマンスをどのように改善しましたか?
-dosuは、プロンプトエンジニアリングなしでアプリケーションのパフォーマンスを30%向上させました。これは、Lang chainが過去数ヶ月間に構築したツールを大幅に使用して行われました。
Lang Smithとは何ですか?
-Lang Smithは、Lang chainとは別個のプラットフォームで、Lang chainと一緒に使用したり、独立して使用したりすることができます。アプリケーションのデータフローを改善するための機能の組み合わせを提供しています。
Lang Smithの強力な点は何ですか?
-Lang Smithの強力な点は、ロギング、トレース、テスト、バリデーション、プロンプトハブ、人間アノテーションキューなどの機能がすべて別々のものではなく、一つのプラットフォームに統合されていることです。
チュートリアルでdosuが30%の増加を実現するために行ったタスクは何ですか?
-チュートリアルでdosuが行ったタスクは分類です。これは、LLM(大規模言語モデル)の標準で比較的簡単なタスクです。
アプリケーションで使用されているプロンプトテンプレートとは何ですか?
-アプリケーションで使用されているプロンプトテンプレートは、問題のタイプを以下のトピックの1つとして分類するというものです。トピックには、バグ、改善、新機能、ドキュメンテーション、または統合が含まれます。
Lang Smithでフィードバックを残す方法とは何ですか?
-Lang Smithでは、実行に関連付けられたフィードバックを残すことができます。これは、特定の実行IDを使用して行われ、後からその実行に関連付けられたフィードバックを残すことができるように設計されています。
データセットにデータを移動するために設定されたルールは何ですか?
-データセットにデータを移動するために設定されたルールは、フィードバックが関連付けられたデータポイントをデータセットに移動させるものです。正しいフィードバックは「user score is one」で、これは正しい分類を意味し、修正されたフィードバックは「correction」を使用して行われます。
ルールがトリガーされるまで待つ必要がありますか?
-はい、ルールはデフォルトで5分ごとに実行されます。ルールが設定された後には、同じデータポイントを再実行し、ルールが適用されるまで待つ必要があります。
アプリケーションのパフォーマンスを向上させるために使用されるデータポイントは何ですか?
-アプリケーションのパフォーマンスを向上させるために使用されるデータポイントは、Lang Smithのデータセットから取得される正しい値のデータセットです。これには、フィードバックに基づく修正された値が含まれます。
アプリケーションで使用されるプロンプトテンプレートはどのように変更されましたか?
-アプリケーションで使用されるプロンプトテンプレートは、新しい方法で変更されました。具体的には、2行の例が追加され、その後に例のプレースホルダーが置かれています。これは、以前のパターンから学習し、新しい入力に対して応じて応答をカスタマイズするようにモデルに学習させます。
dosuが行った意味のあるセマンティックサーチとは何ですか?
-dosuが行ったセマンティックサーチは、数百のデータポイントの良いフィードバックと修正済みフィードバックをlsmithに記録し、その中から5つまたは10つの例をランダムに選択するのではなく、現在の入力に最も類似している例を選択することでした。これは、類似した入力を持つ場合、出力を同じまたは似たものにするべきであり、新しい入力に適用されるべきロジックが、それらの入力に適用されるロジックに似ていることを意味します。
このチュートリアルで説明されているプロセスはどのようにして他の複雑なタスクにも適用できますか?
-このチュートリアルで説明されているプロセスは、分類タスクに適用されており、比較的単純な例です。しかし、Lang chainのチームは、これらの同じ概念がより複雑なタスクにも関連性があると信じており、それらを試すことに興味を持っています。
Outlines
🚀 アプリケーション性能向上の事例紹介
HarrisonがLang chainから紹介し、dosuというチームがアプリケーションの性能を30%向上させた事例について話しています。彼らはLang Smithというプラットフォームを使い、オープンソースのLang chainとは別で動作します。Lang Smithはアプリケーションのデータフローを改善するためのツール群を提供し、ログ記録、トレース、テスト、バリデーション、プロンプトハブ、人間アノテーションキューなどが含まれています。dosuは分類タスクを改善し、Lang Smithのフィードバック機能を使って改善を進めていました。
🔍 Lang Smithでのフィードバックと自動化
Lang Smithで実行したタスクにフィードバックを付け、それを使ってデータセットを作成し、アプリケーションの性能を向上させる方法が説明されています。正しいフィードバックと誤ったフィードバックの両方に対してルールを設定し、データセットにデータを追加するプロセスが紹介されています。また、ルールがトリガーされるまで待機し、その後データセットにアクセスして結果を確認する方法も説明されています。
📈 アプリケーションへのフィードバックの取り込み
アプリケーションに取り込まれた正しい出力を用いて、性能を改善する方法が説明されています。具体的なコード例とともに、入力と出力を交互に並べた文字列をプロンプトに使用し、モデルが以前のパターンから学習して汎化する方法が紹介されています。また、GitHubイシューのように新しいタイプの質問に対してフィードバックを残し、アプリケーションを改善するプロセスが示されています。
🔧 フィードバックループの構築と応用
Lang Smithで実行とフィードバックをキャプチャし、それらをデータセットとして保存することでフィードバックループを構築する方法が説明されています。自動化を使って良い例のデータセットを作成し、それをアプリケーションに取り込んで性能を向上させるプロセスが紹介されています。さらに、分類タスク以外にもこれらの概念を応用できる可能性についても触れられており、より複雑な例にも適用できると述べています。
Mindmap
Keywords
💡Lang chain
💡dosu
💡Lang Smith
💡データフロー
💡チュートリアル
💡分類
💡フィードバック
💡自動化
💡データセット
💡プロンプトテンプレート
💡埋め込み型検索
Highlights
Dosu, a code engineering teammate, improved application performance by 30% without any prompt engineering.
They utilized tools developed at Lang Chain over the past few months.
Lang Smith, a separate platform from Lang Chain, was used to improve data flow efficiency.
Lang Smith combines logging, tracing, testing, validation, and human annotation cues in one platform.
A tutorial is provided to teach others how to replicate the 30% performance increase.
The task Dosu focused on was classification, a relatively simple task by LLM standards.
They used OpenAI directly without using Lang Chain for the classification task.
The application uses a prompt template to classify issues into topics like bug, improvement, new feature, etc.
Incorporating feedback from users is crucial for improving the application's performance.
Lang Smith allows leaving feedback associated with runs to collect data over time.
The feedback can be used to create a data set of good examples to improve the application.
Automations can be set up in Lang Smith to move data with feedback into a data set.
The data set of good examples can be pulled into the application to enhance its performance.
Using few-shot examples from the data set allows the model to learn from previous patterns.
Semantic search over examples can find the most similar examples to the current input.
Passing in a few relevant examples instead of many random ones can improve the application's performance.
Building a feedback loop by capturing feedback, creating data sets, and using examples can significantly enhance application performance.
The same concepts demonstrated for classification can be applied to more complex examples.
Lang Chain is eager to explore these concepts further and help others implement them.
Transcripts
hi y this is Harrison from Lang chain
today we released a Blog about how dosu
a code engineering teammate improved
some of their application performance by
30% without any prompt engineering and
it's using a lot of the tools that we've
built at laying chain over the past few
months and so today in this video I want
to walk through roughly how they did
that and walk through a tutorial that
will teach you how you can do it on your
application as well so specifically what
they used was Lang Smith and so Lang
Smith is our separate platform it's
separate from Lang chain the open source
it actually works with and without Lang
chain and actually dosu doesn't use
linkchain but they use Lang Smith and
what lsmith is is a combination of
things that can be aimed at improving
the data fly whe of your application so
this generally consists of a few things
this generally consists of logging and
tracing all the data that goes through
your applications testing and valuation
and Lance is doing a whole great series
on that right now there's a promp hub
there's some human annotation cues but
the real power of Lang Smith comes from
the fact that these aren't all separate
things these are all together in one
platform and so you can set up a really
nice flywheel of data to to start
improving the performance of your
application so let's see what exactly
that
means there's a tutorial that we put
together that walks through in similar
steps some of the same things that dosu
did to achieve a 30% increase um and the
task that they did it for was
classification um which is a relatively
simple task by llm standards but let's
take a look at what exactly it involves
so we're just going to walk through the
tutorial the first thing we're going to
do is set up some environment variables
here these this is how we're going to
log uh data to our laying Smith project
I'm going to call it classifier
demo um set that up let me let me
restart my kernel clear all previous
ones now set that up
awesome so this is the simple
application that mimics uh some of what
uh dosu did um so if we take a look at
it um we can see that we're using open
AI we're not even using Lang chain we're
just using open AI client directly and
we're basically doing a classification
task we've got this like F string prompt
template thing that's class classify the
type of the issue as one of the
following topics we've got the topics up
here bug Improvement new feature
documentation or integration we then put
the issue text um and and then we really
just wrap this in the Langan Smith
traceable this just will Trace things
nicely to Lang Smith um and this is our
this is our
application if we try it out we can see
that it does some classification steps
so if I paste in this issue fix bug in
lell I would expect this to be
classified as a bug and we can see
indeed that it is um and if I if I do
something else like let's do H like fix
bug in documentation so this is slightly
trickier because it touches on two
concepts it touches on bug and it
touches on documentation now in the
Linkin repo we would want this to be
classified as a documentation related
issue but we can see that off the bat
our prompt template classifies it as a
bug adding even more complexity in here
the fact that we want it classified as
documentation is something that's maybe
a little bit unique to us if if pantic
or some other project was doing this
maybe they would want this to be
classified as a bug and so Devon at dosu
has a has a really hard job of of trying
to build something that'll work for both
us and pantic and part of the the way
that he's able to do that is by starting
to incorporate some feedback from us as
and users into his applic
so one of the things that you can do in
uh laying Smith is leave feedback
associated with runs um so for this
first run that gets a positive
score so if we if we run this again
notice one of the things that we're
doing is we're passing in this run ID um
and so this run ID is basically a uu ID
that we're passing in the reason that
we're creating it up front is so that we
can associate feedback with it over for
time um so if we run
this and then if we create our L Smith
client and if we create the feedback
associated with this this is a pretty
good one so we can assume that that it's
been marked as good um we've collected
this in some way if you're using like
the GitHub interface that might be you
know they they don't change the label
they think it's good and so we'll mark
this as user score
one and we're using the run ID that we
create above and pass in so we're using
this to collect feedback now we've got
this followup fix bugging documentation
it creates the wrong uh kind of like
label we can leave feedback on that as
well so we can now call this create
feedback function and notably we're
leaving a
correction so so uh this key can be
anything I'm just calling it correction
to line up but then instead of passing
in score as we do up here I'm passing in
this correction value and this corre C
value is something that's a first uh
first class citizen in lsmith to denote
the corrected values of what a run
should be and so this should be
documentation and let's assume that I've
gotten this feedback somehow maybe as an
end user I correct the label in in
GitHub to have it say documentation
instead of bug so let's log that to link
Smith okay awesome so this is generally
like what I set up in my code I now need
to do a few things in Lang Smith in
order to take advantage of this data
flywheel so let's switch over to link
Smith I can see I've got this classifier
demo project if I click in I can see the
runs that I just ran if I click into a
given run I can see the inputs I can see
the output I can click into feedback and
I can see any feedback so here I can see
correction and I can see the correction
of documentation if I go to the Run
below I can see that I've got a score of
one because this is the input that was
fixed bug and lell and output of that
okay awesome so I have this data in here
I've got the feedback in here let's
start to set up some Automation and what
I'm going to want to do is I'm going to
want to move data that has feedback
associated with it into a data
set so I'm going to do that by I'm I'm
going to click add a rule I'm going to
call this posit positive feedback I'm
going to say sampling rate of one I'm
going to add a filter um I'm going to
add a filter of where feedback is is
user score is one um and I can see that
actually actually let me switch out my
view so I can see one thing I can one
thing that's nice to do is just preview
what the filters that you add to the
rule are actually going to do so I can
do that here I can go
filter feedback user score one and I can
see that when applied this applies to
one run so I can basically preview my
filters here I can now click add rule it
remembers that filter
let's call this positive feedback and if
I get this positive feedback I just want
to add it to a data set so I just want
to add it to a data set let me create a
new one let me name it uh
classifier demo um it's going to be a
key value data set which basically just
means it's going to be dictionaries in
dictionaries out and let me create
this and I've now got this rule um I am
not going to click use Corrections here
because remember this is the positive
feedback that I'm collecting okay great
let's save that now let's add another
rule let's go back here let's remove
this filter and let's add another filter
which is instead when it has Corrections
so now I'm saying anytime there's
corrections I can see the filter applied
again go here add
rule I can now uh let's call it
negative feedback I'm going to add it to
a data set let's call it classifier demo
and now I'm going to click use
Corrections cuz now when this gets added
to the data set I want it to basically
use the corrections instead of the True
Value so let's save this and now I've
got two rules
awesome okay so now I've got these rules
set up these rules only apply to data
points and feedback that are logged
after they are set up so let's go in
here and we basically need to rerun and
and have these same data points in there
so that the rule rules can pick them up
so let's run this this is the one with
positive feedback so let's leave that
correction let's rerun
this this is the one with negative
feedback so let's leave that correction
um and now basically we need to wait for
the rules to trigger by default they run
every 5 minutes we can see now that it
is 11:58 just 1159 and so this will
trigger in about a minute so I'm going
to pause the video and wait for that to
trigger all right so we're back it's
just after noon which means the rules
should have run the way I can see if
this happened by the way I can click on
rules I can go see logs so I can see
logs and I can see that there was one
rule um or there was one run that was
triggered by this rule I can go to the
other one I can see again there was one
run that was triggered by this Rule and
so basically that's how I can tell if
these rules were run and what data
points they were run over so now that
they've been run I can go to my data
sets and testing I can search for
classify
demo I can look in and I can see that I
have two examples I have this fixed bug
in lell with the output of bug and so
this is great this is just the the
original output and then I also have
this other one fix bug and documentation
with this new output of documentation
and this is the corrected value so we
can see that what I'm doing is I'm
building up this data set of correct
values and then basically what I'm going
to do is I'm going to use those data
points in my application to improve its
performance so let's see how to do do
that and so we can go back to this nice
little guide we've got it walks through
the automations here and now we've got
some new code for our application so
let's pull it down and let's take a look
at what's going on so we've got the
Langs Smith client and we're going to
need this for our application because
now we're pulling down these uh these
examples in the data set I've got this
little function this little function is
going to take in examples and it's
basically going to create a string that
I'm going to put into the prompt so it's
basically going to create a string
that's just alternating inputs and then
outputs super easy and that's that's
honestly most of the new code this is
all same code as before here we Chang
The Prompt template so we add these two
lines here here are some examples and
then a placeholder for
examples okay and we'll see how we use
that later on and now what we're doing
is inside this function we're pulling
down all the examples that are part of
this classifier demo um
project so I'm listing examples that
belong to the this data set and then by
default it returns an iterator so I'm
calling list on it to get a concrete
list I'm passing that list into my a
function that I defined above create
example string and then I'm formatting
The Prompt by passing in uh the examples
variable to be this example string all
right so let's now try this out with the
same input as before so if we scroll up
and we take this same input fix bug and
documentation and if we run it through
this new method we can see that we get
back documentation notice here that the
input is the same as before so it's just
learning that if it has the exact same
input then it should output the same
output the thing that we're doing by
using this as a few shot example is it
can also generalize to other inputs so
if we change this to like address bug in
documentation we can see that that's
still classified as documentation
there's still these conflicting kind of
like bug and documentation ideas but
it's learning from the example and it's
learning that there should be
documentation um let's see what some
other okay so now you know like does
this fix all issues no let's let's try
out some things like make Improvement in
documentation is this going to be
classified as a Improvement or as
documentation so it's classified as
Improvement we probably want it to be
classified as documentation so one thing
we can do is we can leave more feedback
for it and so this this imitates exactly
what would happen um in real life in
GitHub issues like you keep on seeing
these new types of questions that come
in that aren't exactly the same as
previous inputs because obviously
they're not and then you can start to
capture that as feedback and use them as
examples to improve your application so
we can create more feedback for for this
run like hey we want this to be about
documentation great so that's a little
bit about how we can start to capture
these examples use them as few shot
examples have the model learn from
previous patterns about what it's about
what it's
seen the last cool thing that uh dosu
did that I'm not going to walk through
or I'm not going to replicate it in code
but I'll walk through it is they
basically did a semantic search over
examples and so what is this and why did
they do this first they did this because
they were getting a lot of feedback so
they had hundreds of data points of good
and corrected feedback that they they
were logging to lsmith and so at some
point it becomes too much to pass in
hundreds or thousands of examples so
rather what they wanted to do was they
only wanted to pass in like five or 10
examples but they didn't want to just
pass in five or 10 random examples what
they wanted to do was pass in the
examples that were most similar to the
current input and so the rationale there
is that if you look for examples that
are similar to the input the outputs
should also be similarish or there
should be like the logic that applies to
those inputs should be similar to the
logic that applies to the new input so
basically what they did was they they
took all the examples um they created
embeddings for all of them they then
took the incoming uh kind of uh uh they
they took the incoming input created
embeddings for that as well and then
basically found the examples that were
most similar to that and so this is a
really cool way to have thousands of
examples but still only use five or 10
for your application for a given point
in
time hopefully this is a nice overview
of how you can start to really build the
feedback loop you can capture feedback
associated with runs and store those in
link Smith you can set up automations to
move those runs and sometimes their
feedback as well to create data sets of
good examples and you can then pull
those examples down into your
application and use that to improve the
performance going forward
doing this with classification is a
relatively simple
example however there are lots more
complex examples that we think these
same exact Concepts can be relevant for
and so we're very excited to try those
out if you have any questions or if you
want to explore this more please get in
touch we' love to help
関連動画をさらに表示
LangChain Agents with Open Source Models!
Introduction: Monitoring and Automations Essentials with LangSmith
AWS re:Invent 2023 - Use RAG to improve responses in generative AI applications (AIM336)
Introspective Agents: Performing Tasks With Reflection with LlamaIndex
The TRUTH About James Jani Sucess
Tile-Based Map Generation using Wave Function Collapse in 'Caves of Qud'
5.0 / 5 (0 votes)