How to Use LangSmith to Achieve a 30% Accuracy Improvement with No Prompt Engineering
Summary
TLDR在这段视频中,Harrison 介绍了如何通过使用 Lang Smith 平台,一个与 Lang Chain 独立但兼容的代码工程工具,来显著提升应用程序性能。Lang Smith 通过集成日志记录、追踪、数据测试和人类标注等工具,帮助用户改善应用性能。视频中,Harrison 通过一个分类任务的实例,展示了如何设置环境变量,使用 Open AI 客户端进行分类,并利用 Lang Smith 追踪和收集反馈。通过自动化规则,将带有反馈的数据点移动到数据集中,然后使用这些数据点作为示例来改进应用程序。此外,还介绍了如何使用语义搜索来选择与当前输入最相似的示例,以提高应用性能。整个过程展示了一个反馈循环,通过收集反馈、自动化处理和应用改进,不断优化应用程序。
Takeaways
- 🚀 Harrison 来自 Lang chain,他们发布了一篇博客,讲述了 dosu 如何通过使用 Lang chain 构建的工具,在没有进行任何代码工程的情况下提高了应用性能 30%。
- 🛠️ dosu 使用的是 Lang Smith 平台,这是一个与 Lang chain 独立的平台,可以单独使用或与 Lang chain 一起使用。
- 🔍 Lang Smith 通过日志记录、追踪、测试和评估数据流来改善应用性能,其强大之处在于这些功能都集成在一个平台上。
- 📈 dosu 通过 Lang Smith 实现性能提升的具体任务是分类,这是一个相对简单的任务,按照大型语言模型(LLM)的标准来看。
- 📝 在教程中,首先设置环境变量,这些变量将用于将数据记录到 Lang Smith 项目中。
- 🔗 dosu 使用 OpenAI 客户端直接进行分类任务,而不是使用 Lang chain。
- 🔑 通过 Lang Smith 可以为运行留下反馈,这些反馈与特定的运行 ID 关联,以便随着时间的推移收集反馈。
- 🔄 Lang Smith 中的数据飞轮可以通过自动化规则将带有反馈的数据点移动到数据集中。
- 📊 通过自动化规则,可以将正面反馈和带有修正的负面反馈分别添加到不同的数据集中。
- 🔧 在 Lang Smith 中设置好规则后,需要重新运行数据点以触发规则,以便规则能够识别并处理这些数据点。
- ⏱️ 规则默认每 5 分钟运行一次,可以通过查看日志来确认规则是否已触发以及它们运行的数据点。
- 📚 通过 Lang Smith 收集的反馈和数据集可以用来改进应用程序,例如通过使用少量示例来训练模型,使其学习并泛化到其他输入。
- 🔍 dosu 还进行了语义搜索,以在大量示例中找到与当前输入最相似的少数几个示例,以提高应用性能。
Q & A
Dosu是如何通过使用Lang Smith提高应用性能的?
-Dosu通过使用Lang Smith平台,结合日志记录、追踪、测试和评估数据,以及用户反馈,创建了一个数据流,从而提高了应用性能30%。
Lang Smith是如何帮助Dosu改进应用的?
-Lang Smith通过集中日志记录、追踪、测试和评估等工具,允许用户在一个平台上进行操作,从而形成了一个数据流,帮助Dosu改进其应用。
在Lang Smith中,用户如何留下与运行相关的反馈?
-用户可以在Lang Smith中通过创建一个运行ID来关联反馈。通过这个运行ID,用户可以为特定的运行留下正面或负面的反馈,包括纠正错误的标签。
如何使用Lang Smith的自动化功能来改进应用?
-通过设置自动化规则,可以将带有反馈的数据点移动到数据集中。这些数据集随后可以在应用中使用,以改进应用的性能。
Dosu在Lang Smith中是如何使用分类任务的?
-Dosu在Lang Smith中使用分类任务来识别问题的主题,如bug、改进、新特性、文档或集成等,并通过Lang Smith追踪和反馈机制来优化分类准确性。
Lang Smith中的正面反馈和负面反馈是如何定义的?
-在Lang Smith中,正面反馈是通过用户评分为1来定义的,表示用户对结果满意。负面反馈则是通过提供纠正值来定义的,表示结果需要改进。
如何通过Lang Smith的数据集来改进应用的分类准确性?
-通过将Lang Smith中收集的正面和负面反馈(包括纠正值)添加到数据集中,可以在应用中使用这些数据点来训练和改进分类模型。
Dosu是如何使用Lang Smith的语义搜索来优化输入的?
-Dosu通过创建所有示例的嵌入,然后为当前输入创建嵌入,并找到最相似的示例来进行语义搜索,从而优化输入并提高应用性能。
Lang Smith中的自动化规则是如何触发的?
-自动化规则在设置后,会根据预设的条件自动触发。例如,可以设置规则以便在收集到正面或负面反馈时,自动将相关数据点添加到特定的数据集中。
Lang Smith如何帮助Dosu处理大量的用户反馈?
-Lang Smith允许Dosu通过自动化和语义搜索技术,从大量的用户反馈中筛选出最相关的示例,并将这些示例作为输入来改进应用。
如何将Lang Smith中的反馈和数据集应用到实际的应用程序中?
-通过Lang Smith的API,可以将收集到的反馈和数据集中的示例集成到应用程序中,用于训练和改进模型,从而提高应用程序的性能。
Dosu在Lang Smith中使用的分类任务是否仅限于简单的任务?
-虽然Dosu在Lang Smith中使用的分类任务是一个相对简单的任务,但Lang Smith的概念和工具也适用于更复杂的任务,有助于在各种应用场景中提高性能。
Outlines
🚀 应用性能提升30%的秘诀
Harrison介绍了Lang chain团队成员dosu如何通过使用Lang Smith平台,在没有进行工程改造的情况下提升应用性能达30%。Lang Smith是一个独立的平台,可以与Lang chain结合使用,也可以单独使用。它通过日志记录、追踪数据流、测试评估和人工注释来提高应用的数据流效率。Dosu通过Lang Smith的集成特性,建立了一个数据反馈循环,从而优化应用性能。教程中展示了如何设置环境变量,使用OpenAI进行分类任务,并使用Lang Smith进行追踪和反馈,以实现性能提升。
🔍 Lang Smith的数据反馈循环
本段详细介绍了如何使用Lang Smith收集反馈,并利用这些反馈构建数据集以优化应用。首先,通过创建反馈规则,将带有正面反馈的运行数据和带有纠正反馈的运行数据分别添加到不同的数据集中。然后,在Lang Smith中设置自动化规则,定期触发这些规则,将符合特定条件的运行数据和反馈移动到相应的数据集中。这样,就可以利用这些数据点来改进应用的性能,实现持续优化。
📈 利用数据集提升应用性能
在建立了数据集之后,可以通过将这些数据点作为示例输入到应用中,来提升应用的性能。具体做法是,使用Lang Smith客户端从数据集中提取示例,并将它们格式化为提示模板的一部分。通过这种方式,应用可以学习到之前输入输出的模式,并根据这些模式对新的输入做出更准确的分类。此外,还可以通过留下更多的反馈,来进一步训练和改进应用。
🔗 构建有效的反馈循环
最后一段概述了如何构建一个有效的反馈循环,以持续提升应用性能。首先,捕获与运行相关的反馈并将其存储在Lang Smith中。然后,设置自动化规则,将这些运行及其反馈移动到创建的数据集中。接下来,从数据集中提取示例,并将它们用于应用以提高其性能。这个过程不仅适用于分类任务,还可以应用于更复杂的场景。作者鼓励有兴趣的人进行尝试,并提供帮助。
Mindmap
Keywords
💡Lang chain
💡Lang Smith
💡数据流
💡分类任务
💡反馈
💡自动化
💡数据集
💡few-shot learning
💡语义搜索
💡嵌入
💡GitHub 接口
Highlights
Harrison 来自 Lang chain,他们发布了一篇博客,讲述了 dosu 如何通过使用 Lang chain 构建的工具,在没有进行任何重工程的情况下,将其应用程序性能提高了 30%。
dosu 使用了 Lang Smith 平台,这是一个与 Lang chain 独立且可以单独使用的工具,用于改善应用程序的数据流。
Lang Smith 结合了日志记录、追踪、测试、评估和人类注释等多种功能,所有这些功能都集成在一个平台上。
通过 Lang Smith,可以设置数据飞轮,开始提升应用程序性能。
教程展示了 dosu 如何通过分类任务实现性能提升,这是一个相对简单的任务,按照大型语言模型(LLM)的标准来看。
dosu 使用 OpenAI 客户端直接进行分类任务,而不是使用 Lang chain。
通过 Lang Smith 追踪功能,可以优雅地追踪事物,并将反馈整合到应用程序中。
dosu 面临的挑战是构建一个既适用于 Lang chain 也适用于其他项目(如 pantic)的系统。
在 Lang Smith 中,可以通过运行反馈来留下与运行相关的反馈。
通过创建反馈函数,可以标记运行结果为好或坏,并使用运行 ID 关联反馈。
Lang Smith 允许用户通过自动化规则将反馈数据移动到数据集中。
通过自动化规则,可以将正面反馈和纠正的反馈分别添加到不同的数据集中。
教程展示了如何重新运行数据点,以便自动化规则可以捕获它们。
在 Lang Smith 中,可以查看运行、输入、输出以及反馈,包括更正。
通过使用数据集中的示例作为少量样本,可以改善应用程序的性能。
dosu 通过语义搜索在示例中找到与当前输入最相似的例子,以此来提高应用程序的准确性。
教程提供了一个指南,展示了如何将 Lang Smith 中的数据点集成到应用程序中,以提高性能。
通过持续收集反馈并将它们作为示例,可以让模型学习并提高对新输入的分类准确性。
该方法不仅适用于分类任务,Lang chain 认为这些概念也可以应用于更复杂的任务。
Lang chain 对于将这些概念应用于更广泛场景的可能性感到兴奋,并愿意提供帮助。
Transcripts
hi y this is Harrison from Lang chain
today we released a Blog about how dosu
a code engineering teammate improved
some of their application performance by
30% without any prompt engineering and
it's using a lot of the tools that we've
built at laying chain over the past few
months and so today in this video I want
to walk through roughly how they did
that and walk through a tutorial that
will teach you how you can do it on your
application as well so specifically what
they used was Lang Smith and so Lang
Smith is our separate platform it's
separate from Lang chain the open source
it actually works with and without Lang
chain and actually dosu doesn't use
linkchain but they use Lang Smith and
what lsmith is is a combination of
things that can be aimed at improving
the data fly whe of your application so
this generally consists of a few things
this generally consists of logging and
tracing all the data that goes through
your applications testing and valuation
and Lance is doing a whole great series
on that right now there's a promp hub
there's some human annotation cues but
the real power of Lang Smith comes from
the fact that these aren't all separate
things these are all together in one
platform and so you can set up a really
nice flywheel of data to to start
improving the performance of your
application so let's see what exactly
that
means there's a tutorial that we put
together that walks through in similar
steps some of the same things that dosu
did to achieve a 30% increase um and the
task that they did it for was
classification um which is a relatively
simple task by llm standards but let's
take a look at what exactly it involves
so we're just going to walk through the
tutorial the first thing we're going to
do is set up some environment variables
here these this is how we're going to
log uh data to our laying Smith project
I'm going to call it classifier
demo um set that up let me let me
restart my kernel clear all previous
ones now set that up
awesome so this is the simple
application that mimics uh some of what
uh dosu did um so if we take a look at
it um we can see that we're using open
AI we're not even using Lang chain we're
just using open AI client directly and
we're basically doing a classification
task we've got this like F string prompt
template thing that's class classify the
type of the issue as one of the
following topics we've got the topics up
here bug Improvement new feature
documentation or integration we then put
the issue text um and and then we really
just wrap this in the Langan Smith
traceable this just will Trace things
nicely to Lang Smith um and this is our
this is our
application if we try it out we can see
that it does some classification steps
so if I paste in this issue fix bug in
lell I would expect this to be
classified as a bug and we can see
indeed that it is um and if I if I do
something else like let's do H like fix
bug in documentation so this is slightly
trickier because it touches on two
concepts it touches on bug and it
touches on documentation now in the
Linkin repo we would want this to be
classified as a documentation related
issue but we can see that off the bat
our prompt template classifies it as a
bug adding even more complexity in here
the fact that we want it classified as
documentation is something that's maybe
a little bit unique to us if if pantic
or some other project was doing this
maybe they would want this to be
classified as a bug and so Devon at dosu
has a has a really hard job of of trying
to build something that'll work for both
us and pantic and part of the the way
that he's able to do that is by starting
to incorporate some feedback from us as
and users into his applic
so one of the things that you can do in
uh laying Smith is leave feedback
associated with runs um so for this
first run that gets a positive
score so if we if we run this again
notice one of the things that we're
doing is we're passing in this run ID um
and so this run ID is basically a uu ID
that we're passing in the reason that
we're creating it up front is so that we
can associate feedback with it over for
time um so if we run
this and then if we create our L Smith
client and if we create the feedback
associated with this this is a pretty
good one so we can assume that that it's
been marked as good um we've collected
this in some way if you're using like
the GitHub interface that might be you
know they they don't change the label
they think it's good and so we'll mark
this as user score
one and we're using the run ID that we
create above and pass in so we're using
this to collect feedback now we've got
this followup fix bugging documentation
it creates the wrong uh kind of like
label we can leave feedback on that as
well so we can now call this create
feedback function and notably we're
leaving a
correction so so uh this key can be
anything I'm just calling it correction
to line up but then instead of passing
in score as we do up here I'm passing in
this correction value and this corre C
value is something that's a first uh
first class citizen in lsmith to denote
the corrected values of what a run
should be and so this should be
documentation and let's assume that I've
gotten this feedback somehow maybe as an
end user I correct the label in in
GitHub to have it say documentation
instead of bug so let's log that to link
Smith okay awesome so this is generally
like what I set up in my code I now need
to do a few things in Lang Smith in
order to take advantage of this data
flywheel so let's switch over to link
Smith I can see I've got this classifier
demo project if I click in I can see the
runs that I just ran if I click into a
given run I can see the inputs I can see
the output I can click into feedback and
I can see any feedback so here I can see
correction and I can see the correction
of documentation if I go to the Run
below I can see that I've got a score of
one because this is the input that was
fixed bug and lell and output of that
okay awesome so I have this data in here
I've got the feedback in here let's
start to set up some Automation and what
I'm going to want to do is I'm going to
want to move data that has feedback
associated with it into a data
set so I'm going to do that by I'm I'm
going to click add a rule I'm going to
call this posit positive feedback I'm
going to say sampling rate of one I'm
going to add a filter um I'm going to
add a filter of where feedback is is
user score is one um and I can see that
actually actually let me switch out my
view so I can see one thing I can one
thing that's nice to do is just preview
what the filters that you add to the
rule are actually going to do so I can
do that here I can go
filter feedback user score one and I can
see that when applied this applies to
one run so I can basically preview my
filters here I can now click add rule it
remembers that filter
let's call this positive feedback and if
I get this positive feedback I just want
to add it to a data set so I just want
to add it to a data set let me create a
new one let me name it uh
classifier demo um it's going to be a
key value data set which basically just
means it's going to be dictionaries in
dictionaries out and let me create
this and I've now got this rule um I am
not going to click use Corrections here
because remember this is the positive
feedback that I'm collecting okay great
let's save that now let's add another
rule let's go back here let's remove
this filter and let's add another filter
which is instead when it has Corrections
so now I'm saying anytime there's
corrections I can see the filter applied
again go here add
rule I can now uh let's call it
negative feedback I'm going to add it to
a data set let's call it classifier demo
and now I'm going to click use
Corrections cuz now when this gets added
to the data set I want it to basically
use the corrections instead of the True
Value so let's save this and now I've
got two rules
awesome okay so now I've got these rules
set up these rules only apply to data
points and feedback that are logged
after they are set up so let's go in
here and we basically need to rerun and
and have these same data points in there
so that the rule rules can pick them up
so let's run this this is the one with
positive feedback so let's leave that
correction let's rerun
this this is the one with negative
feedback so let's leave that correction
um and now basically we need to wait for
the rules to trigger by default they run
every 5 minutes we can see now that it
is 11:58 just 1159 and so this will
trigger in about a minute so I'm going
to pause the video and wait for that to
trigger all right so we're back it's
just after noon which means the rules
should have run the way I can see if
this happened by the way I can click on
rules I can go see logs so I can see
logs and I can see that there was one
rule um or there was one run that was
triggered by this rule I can go to the
other one I can see again there was one
run that was triggered by this Rule and
so basically that's how I can tell if
these rules were run and what data
points they were run over so now that
they've been run I can go to my data
sets and testing I can search for
classify
demo I can look in and I can see that I
have two examples I have this fixed bug
in lell with the output of bug and so
this is great this is just the the
original output and then I also have
this other one fix bug and documentation
with this new output of documentation
and this is the corrected value so we
can see that what I'm doing is I'm
building up this data set of correct
values and then basically what I'm going
to do is I'm going to use those data
points in my application to improve its
performance so let's see how to do do
that and so we can go back to this nice
little guide we've got it walks through
the automations here and now we've got
some new code for our application so
let's pull it down and let's take a look
at what's going on so we've got the
Langs Smith client and we're going to
need this for our application because
now we're pulling down these uh these
examples in the data set I've got this
little function this little function is
going to take in examples and it's
basically going to create a string that
I'm going to put into the prompt so it's
basically going to create a string
that's just alternating inputs and then
outputs super easy and that's that's
honestly most of the new code this is
all same code as before here we Chang
The Prompt template so we add these two
lines here here are some examples and
then a placeholder for
examples okay and we'll see how we use
that later on and now what we're doing
is inside this function we're pulling
down all the examples that are part of
this classifier demo um
project so I'm listing examples that
belong to the this data set and then by
default it returns an iterator so I'm
calling list on it to get a concrete
list I'm passing that list into my a
function that I defined above create
example string and then I'm formatting
The Prompt by passing in uh the examples
variable to be this example string all
right so let's now try this out with the
same input as before so if we scroll up
and we take this same input fix bug and
documentation and if we run it through
this new method we can see that we get
back documentation notice here that the
input is the same as before so it's just
learning that if it has the exact same
input then it should output the same
output the thing that we're doing by
using this as a few shot example is it
can also generalize to other inputs so
if we change this to like address bug in
documentation we can see that that's
still classified as documentation
there's still these conflicting kind of
like bug and documentation ideas but
it's learning from the example and it's
learning that there should be
documentation um let's see what some
other okay so now you know like does
this fix all issues no let's let's try
out some things like make Improvement in
documentation is this going to be
classified as a Improvement or as
documentation so it's classified as
Improvement we probably want it to be
classified as documentation so one thing
we can do is we can leave more feedback
for it and so this this imitates exactly
what would happen um in real life in
GitHub issues like you keep on seeing
these new types of questions that come
in that aren't exactly the same as
previous inputs because obviously
they're not and then you can start to
capture that as feedback and use them as
examples to improve your application so
we can create more feedback for for this
run like hey we want this to be about
documentation great so that's a little
bit about how we can start to capture
these examples use them as few shot
examples have the model learn from
previous patterns about what it's about
what it's
seen the last cool thing that uh dosu
did that I'm not going to walk through
or I'm not going to replicate it in code
but I'll walk through it is they
basically did a semantic search over
examples and so what is this and why did
they do this first they did this because
they were getting a lot of feedback so
they had hundreds of data points of good
and corrected feedback that they they
were logging to lsmith and so at some
point it becomes too much to pass in
hundreds or thousands of examples so
rather what they wanted to do was they
only wanted to pass in like five or 10
examples but they didn't want to just
pass in five or 10 random examples what
they wanted to do was pass in the
examples that were most similar to the
current input and so the rationale there
is that if you look for examples that
are similar to the input the outputs
should also be similarish or there
should be like the logic that applies to
those inputs should be similar to the
logic that applies to the new input so
basically what they did was they they
took all the examples um they created
embeddings for all of them they then
took the incoming uh kind of uh uh they
they took the incoming input created
embeddings for that as well and then
basically found the examples that were
most similar to that and so this is a
really cool way to have thousands of
examples but still only use five or 10
for your application for a given point
in
time hopefully this is a nice overview
of how you can start to really build the
feedback loop you can capture feedback
associated with runs and store those in
link Smith you can set up automations to
move those runs and sometimes their
feedback as well to create data sets of
good examples and you can then pull
those examples down into your
application and use that to improve the
performance going forward
doing this with classification is a
relatively simple
example however there are lots more
complex examples that we think these
same exact Concepts can be relevant for
and so we're very excited to try those
out if you have any questions or if you
want to explore this more please get in
touch we' love to help
Ver Más Videos Relacionados
5.0 / 5 (0 votes)