Llama 3 tool calling agents with Firefunction-v2
Summary
TLDRランスがFireworksという推論エンジンの新モデル「Fire Function V2」を紹介。このモデルはツール呼び出しに特化し、GPD 40と比較して高速でコスト効率が高く、オープンソースモデルと組み合わせて使用される。ツール使用はモデルの能力を拡張し、API呼び出しなどの外部ツールにアクセスできる。Fire Function V2は、機能呼び出しの強さを保持しながら、ベースモデルの能力も維持することを目指している。実際のテストでは、SQLデータベースエージェントを評価する実例を通じて、FireworksとGPD 40の性能を比較し、Fireworksのツール呼び出し能力が非常に優れていることを示した。
Takeaways
- 🎉 FireworksはLangchainのユーザーが利用する人気のある推論エンジンで、多くのオープンソースモデルを提供しています。
- 🚀 Fireworksは新しいモデル「Fire function V2」をリリースし、ツール呼び出しや関数呼び出しに優れていることが報告されています。
- 🔍 Fire function V2はGPD 40と比較して、関数呼び出しにおいて同等の性能を持ちながら、より高速でコストが低いという利点があります。
- 🛠️ ツール使用(tool use)は、モデルの機能を拡張し、外部ツールにアクセスできるようにします。これは主にエージェントと組み合わせて使用されます。
- 🔑 LLM(Large Language Model)は自然言語入力を受け取り、APIや関数呼び出しに必要な引数と関数名を判別して返すことができます。
- 📈 Fire function V2は、Llama 3 Instructモデルをベースにしており、関数呼び出しの能力を強化した一方で、ベースモデルの能力を保持することを目指しています。
- 🧩 Fire function V2は、関数呼び出しに特化した調整を行なっているが、一般化されたタスクでも高性能であることを示しています。
- 📝 スクリプトでは、Fire function V2を使用して天気情報の取得など、具体的なツール呼び出しの例を紹介しています。
- 🔬 スクリプトでは、SQLデータベースエージェントの性能を評価するためのテストケースも紹介されており、FireworksとGPD 40の性能を比較しています。
- 📊 テスト結果によると、FireworksはGPD 40と比較して、関数呼び出しにおいて同等またはそれ以上の性能を発揮している可能性があることが示唆されています。
- 💡 スクリプトの最終的には、Fireworksの関数呼び出し機能が非常に魅力的であり、関数呼び出しやツール使用が必要なアプリケーションでの利用を検討する価値があると結論づけています。
Q & A
Fireworksとはどのようなインファレンスエンジンですか?
-Fireworksは、多くのLangchainユーザーが利用している人気のあるインファレンスエンジンで、オープンソースのモデルを多数ホストしています。
Fire Function V2モデルの特徴は何ですか?
-Fire Function V2は、ツール呼び出しや関数呼び出しに非常に優れており、GPD 40と比較して高速でコストが低いと報告されています。
ツール使用とはどのような概念ですか?
-ツール使用は、外部ツールにモデルを接続することにより、モデルの能力を拡張する概念で、エージェントでほぼ常に使用されます。
LLMにツールをバインドするプロセスはどのようなものですか?
-LLMにツールをバインドするプロセスは、ツールのスキーマを定義し、そのツールをLLMに関連付けることで、自然言語入力から関数名と引数を生成するようにLLMをトレーニングすることです。
Fire Function V2が目指しているバランスは何ですか?
-Fire Function V2は、関数呼び出しの強さを維持しながら、ベースモデルの能力を保持することを目指しています。
Llama 3 Instructモデルとは何ですか?
-Llama 3 Instructは、Fire Function V2が構築されたモデルで、広く使用されており、非常に強力なモデルです。
Fire Function V2のトレーニングプロセスで気をつけたことは何ですか?
-Fire Function V2のトレーニングでは、関数呼び出しに特化した過剰適応を避け、一般化されたタスクでも良いパフォーマンスを維持するように_FINE_TUNE_しました。
実際のテストケースでFire Function V2をどのように評価しましたか?
-SQLデータベースエージェントを使用して、Fire Function V2のパフォーマンスを評価し、GPD 40と比較しました。
テストケースでのFire Function V2のパフォーマンスはどの程度でしたか?
-小規模なテストケースでは、Fire Function V2はGPD 40と同等またはそれ以上のパフォーマンスを示しました。
ユーザーはFire Function V2をどのように利用するべきですか?
-ユーザーは、Fire Function V2をインストールし、APIキーを設定することで、関数呼び出しやツール使用を通じてエージェントを構築し利用することができるでしょう。
Fireworksのドキュメントや設定方法はどこで確認できますか?
-Fireworksのドキュメントや設定方法は、公式ドキュメントや提供されているNotebookを参照することで確認できます。
Outlines
🎉 ランチェーンの新しいモデル「Fire Function V2」の紹介
ランチェーンのLanceが、ランチェーンで人気のある推論エンジンであるFireworksについて紹介し、新しくリリースされたモデル「Fire Function V2」について説明しています。このモデルは、ツール呼び出しや関数呼び出しに特化しており、GPD 40と比較して高速でコストが低いと報告されています。Lanceは、このモデルの機能を紹介し、小さなエージェント評価課題でテストする方法を説明しています。ツール使用の概念についても説明しており、モデルが外部ツールにアクセスし、自然言語入力からAPIや関数の引数と名前を推測するプロセスを解説しています。
🔧 Fire Function V2の機能テストと実用性評価
LanceはFire Function V2の機能を実際にテストし、SQLデータベースエージェントを使用してエージェントのパフォーマンスを評価する方法を紹介しています。このエージェントは、LLMとL graphを使用して質問に応じてツールを呼び出すことができます。Lanceは、Fire Function V2とGPD 40の両方をテストし、実験の結果を比較しています。小規模な課題では、Fire Function V2がGPD 40と同等かそれ以上のパフォーマンスを示したことがわかりました。Lanceは、Fire Function V2が関数呼び出しやツール使用に非常に適しており、さまざまなユースケースで有用であると結論付けています。
Mindmap
Keywords
💡Fireworks
💡ツール呼び出し(Tool Calling)
💡LLM(Large Language Model)
💡Langchain
💡Fine-tuning
💡Llama 3 Instruct
💡自然言語入力(Natural Language Input)
💡APIキー(API Key)
💡エージェント評価(Agent Evaluation)
💡SQLデータベースエージェント(SQL Database Agent)
Highlights
Fireworks is a popular inference engine with many open source models.
Introduction of a new model called Fire Function V2, designed for tool calling.
Fire Function V2 is reported to be competitive with GPD 40 in function calling but faster and less expensive.
Tool use expands a model's capabilities by connecting it to external tools.
Tool use is often combined with agents to give an LLM access to various tools and APIs.
The general flow of tool use involves defining a function, binding it to an LLM, and converting natural language requests into function arguments and names.
Fire Function V2 is built on LLaMA 3 Instruct, a widely used model.
Previous efforts in fine-tuning for function calling focused on aggressive and narrow fine-tuning, leading to overfitting.
Fire Function V2 aims to balance strong function calling with retaining the base model's capabilities.
Demonstration of setting up and using Fire Function V2 with a fresh notebook.
Explanation of defining a tool with expected inputs and binding it to the LLM.
Showcasing the LLM's ability to take natural language input and output the correct function name and arguments.
Introduction of a real-world example using a SQL database agent from Lang Smith's cookbooks.
Description of the SQL agent's architecture and its ability to decide which tool to use based on the question.
Evaluation of the SQL agent's performance using a set of reference answers.
Comparison of the agent's performance on Fireworks and GPD 40, showing similar accuracy in a small-scale challenge.
Encouragement for users to test Fireworks in their applications for function calling and tool use.
Transcripts
hey this Lance from Lang chain so
fireworks is a popular inference engine
that many users of Lang chain have used
uh it hosts many open source models and
they're releasing a new model today that
I'm really excited about um that is
called fire function
V2 and the main point of this is it is a
model that's very good at tool calling
or function calling um it's reported to
be competitive with GPD 40 in terms of
function calling but faster and less
expensive so that you know has a lot of
appeal um and today we're just going to
show how it works like how to use it and
we're going to go and test it on a small
agent evaluation challenge that I've set
up so maybe first is at the stage um you
know what is tool use wi is it
interesting uh I can zoom in a little
bit here so basically tool use expands
model's capabilities by connecting it to
external tools um it's used often times
or almost always with agents uh so that
is the ability to give an llm access to
different tools like web search or
various apis and have the LM basically
return to you um both the payload so
like what's actually necessary to pass
to that API or function as well as the
name so like what function to use and
what the input is derived from a natural
language input from the user and so the
general flow kind of looks like this I
have some function I Define I bind it to
my llm I input natural language if the
natural language is relevant to the
function the llm knows it's aware that
this function exists and it converts
that natural language request into
what's actually needed to run that
function notably like the function
argument as mentioned and the function
name so that's the big idea here um now
let's talk a little bit about what they
did which I think is actually pretty
notable and interesting here so they
built on llama 3 instruct which already
is a very strong model kind of been very
obviously really widely used now they
made an interesting point in their blog
post that is prior efforts to basically
fine-tune for function calling focus on
aggressive and narrow fine tuning and
this is a classic problem that we see
basically overfitting to a certain
benchmarks or challenges you know you
have a function calling Benchmark you
overfit to it very strongly you have a
model that's very very good at function
calling maybe this narrow context can do
well on a benchmark but it's not good at
generalization to tasks right and so
what they try to do with fire function
V2 according to the blog post is balance
these two things fine tuneit such that
it is very strong at function cling but
retain the capabilities of the base
model whereas prior efforts very
aggressive fine-tuning often erases the
native capabilities of the model so
again this is really the aim here is to
preserve kind of instruction following
as well as function calling so it's like
balancing these two worlds and and we'll
go ahead and see how good it is I think
that the you know what they mentioned is
very exciting but let's go ahead and
test it out so I have a fresh notebook
here um so basic all I have to do is
this uh you know just of course from
chap fireworks uh and of course I just
just pip install this first make sure
you've got that done so you know pip
install this guy you run that you can
import here and here's your model name
it's just going to be this accounts
fireworks models firef function V2 RC of
course set your API key so these are all
the kind of fundamentals and we'll share
those that documentation in terms of
just setting up your fireworks account
but here's where you you set your model
so boom we've done this and now I can go
down here now let's define some tool so
in my case my tool is just going to be
you know let's call it like it's a it's
a weather tool I have right and here's
what the input is the input I'm Define
the schema here the tool wants location
and unit so it wants like a city and it
wants unit in Celsius or Fahrenheit
right so those are expected inputs from
my tool or or for my function um and
then I'm going to go goad and find
Define that as a function itself so this
is kind of a you know example of what
that function could be of course in the
in the real world this could be an API
this is something you know some external
service um so I Define my function here
now I just bind that to my llm there we
go and let's go ahead and call it with a
query that's relevant to this so
basically all I have to ask is what's
the weather like in San Francisco and
Celsius I make the call and let's see
Okay cool so you can see we can
basically get a tool call out of the llm
which has the arguments formatted per my
schema so that's really cool and the
correct function name again as mentioned
here so this is like a basic example of
how the llm does appear to take natural
language just like we show in our
diagram here natural language input and
outputs the function name and the
arguments need to run that function okay
so that's really cool this is just like
showcasing General capability now let's
Che actually show a real world example
of this so I actually put together uh
this cookbook pretty recently it's in
our Langs Smith cookbooks section
related to agent
evaluation and we actually have a
accompanying video coming out today on
this if you really want to go in depth
in it but what I'll just mentioned very
briefly is this is the ability to
evaluate agent performance um and so
this is actually a SQL database agent
that we Define in this cookbook um I set
a few API keys for lsmith set my lsmith
key and basically here's where I grab
the database itself um and so I've
already done this you would just run
this and this is how you test make sure
the database exists um now what we're
going to do is we're going to find a SQL
agent so SQL agent basically will have
an llm that and we're using L graph to
orchestrate this agent so our llm
basically will receive a question decide
which tool to use um the will have
basically an edge that will look at the
tool make a decision as to whether or
not to return an answer or call a tool
if it's a tool from the llm we'll go
ahead and invoke it return that back and
this will repeat kind of in in a loop so
that's kind of the architecture of our
agent don't worry too much about this
the main point here was you want to
assess the capability of this uh our our
new fireworks llm in this context so
here's where it gets interesting what I
can do is I'm just going to Define that
llm we already talked about right here
this new fireworks model um and I'm
going to add some metadata for my
experiment logging here um and this is
all in the notebook you can s this is
where we def finded a number of tools
are relevant to SQL uh you again you can
look at the notebook to see this what we
really want to look at here is Tesla
agent in a real world context does it do
well or not um here we can Define our
SQL agent um here we're basically just
defining our our kind of full agent and
this is what the agent is going to look
like so again we're going to start it's
going to go to the llm in this case
it'll be chap fireworks uh it'll make a
decision potentially to call a tool the
tool will be called tool call gets
returned and this will go back and forth
until um the agent basically returns a
natural language response no tool call
and then we end so that's really it um
now here's what we can do we can
actually just evaluate the the
capability of this agent relative to
some references so again this is kind of
our agent flow the agent's going to take
a question return an answer based upon
quering SQL um and we have a set of
reference answers that I've actually
already defined so that's actually right
here I can build a data set of question
answer pairs um I'm going to name this
SQL agent response and and basically
this is going to wrap my agent and again
you can you can look at this notebook um
uh we we're share it um this is setting
my evaluators this will grade my agent
responses versus uh the reference so
there we go and we can go ahead and run
this so we're going to run this on
fireworks and we're also going to run
this on GPD 40 and we can kind of look
at the differences we're going to run
three replicates of these experiments uh
so that kind of gives us some confidence
in the scoring and and that'll go ahead
and run now so I've run this command on
both gbd4 and fireworks and I can go
over to my data set and limith and have
a look so here's my set of experiments
against this data set that I've run I
can see if you go back to the notebook
basically uh if I can go all the way up
um here's where I kind of set the
experiment prefix those prefixes are
logged right here so I can see here's
GPD 40 here's fireworks you can see that
I ran three repetitions of each so I've
run three replicates of each one of
these and this is the aggregate score so
basically it's saying 60% of the
questions I got correct in both cases in
aggregate I can click on each of these I
can run comparison and what's neat is
this is basically the comparison view so
you can see these are all of our
questions uh each question is a
reference answer so if I open this up I
can actually kind of see that so
basically um here's the reference input
here's the question here is the output
here's my two experiments so this is
fireworks this is GPD 40 I can see in
this particular case here is basically
the mean score across those three
replicates and I can look at each one so
here's the actual responses yeah they're
all correct um it went I think I lost
that um you can open that back up um so
basically I was looking at this first
one and yeah we were looking at the
replicates so yeah 14 and 14 so that's
kind of the basic setup here um I can
really easily compare between agent
performance using GPD 40 and fireworks
in the particular kind of test case and
what I can see is if you look at this
aggregate score so in this case they
both get it correct in this case they
both get it wrong in this third case it
looks like fireworks got it wrong but
gbd 40 got it correct so that's
interesting in this fifth case it looks
like they both got it correct and in
this uh sorry in this fourth case in
this fifth case it looks like gbd4 got
it wrong and fireworks got it correct so
in any case look I agree this is this I
I get it this is a small scale challenge
there only five questions but it gives
you a sense that fireworks fun calling
does seem to be pretty good uh at least
in this small scale test it looks like
it's on power GPD 40 of course you
should run testing yourself in your
application but I think this presents a
very nice option for building agents uh
in any or any use cases in which you
want to actually use function calling or
tool use looks pretty promising from
what I can see and um definitely
encourage you to play with it thanks
Browse More Related Video
5.0 / 5 (0 votes)