Llama 3 tool calling agents with Firefunction-v2

LangChain
18 Jun 202409:50

Summary

TLDRランスがFireworksという推論エンジンの新モデル「Fire Function V2」を紹介。このモデルはツール呼び出しに特化し、GPD 40と比較して高速でコスト効率が高く、オープンソースモデルと組み合わせて使用される。ツール使用はモデルの能力を拡張し、API呼び出しなどの外部ツールにアクセスできる。Fire Function V2は、機能呼び出しの強さを保持しながら、ベースモデルの能力も維持することを目指している。実際のテストでは、SQLデータベースエージェントを評価する実例を通じて、FireworksとGPD 40の性能を比較し、Fireworksのツール呼び出し能力が非常に優れていることを示した。

Takeaways

  • 🎉 FireworksはLangchainのユーザーが利用する人気のある推論エンジンで、多くのオープンソースモデルを提供しています。
  • 🚀 Fireworksは新しいモデル「Fire function V2」をリリースし、ツール呼び出しや関数呼び出しに優れていることが報告されています。
  • 🔍 Fire function V2はGPD 40と比較して、関数呼び出しにおいて同等の性能を持ちながら、より高速でコストが低いという利点があります。
  • 🛠️ ツール使用(tool use)は、モデルの機能を拡張し、外部ツールにアクセスできるようにします。これは主にエージェントと組み合わせて使用されます。
  • 🔑 LLM(Large Language Model)は自然言語入力を受け取り、APIや関数呼び出しに必要な引数と関数名を判別して返すことができます。
  • 📈 Fire function V2は、Llama 3 Instructモデルをベースにしており、関数呼び出しの能力を強化した一方で、ベースモデルの能力を保持することを目指しています。
  • 🧩 Fire function V2は、関数呼び出しに特化した調整を行なっているが、一般化されたタスクでも高性能であることを示しています。
  • 📝 スクリプトでは、Fire function V2を使用して天気情報の取得など、具体的なツール呼び出しの例を紹介しています。
  • 🔬 スクリプトでは、SQLデータベースエージェントの性能を評価するためのテストケースも紹介されており、FireworksとGPD 40の性能を比較しています。
  • 📊 テスト結果によると、FireworksはGPD 40と比較して、関数呼び出しにおいて同等またはそれ以上の性能を発揮している可能性があることが示唆されています。
  • 💡 スクリプトの最終的には、Fireworksの関数呼び出し機能が非常に魅力的であり、関数呼び出しやツール使用が必要なアプリケーションでの利用を検討する価値があると結論づけています。

Q & A

  • Fireworksとはどのようなインファレンスエンジンですか?

    -Fireworksは、多くのLangchainユーザーが利用している人気のあるインファレンスエンジンで、オープンソースのモデルを多数ホストしています。

  • Fire Function V2モデルの特徴は何ですか?

    -Fire Function V2は、ツール呼び出しや関数呼び出しに非常に優れており、GPD 40と比較して高速でコストが低いと報告されています。

  • ツール使用とはどのような概念ですか?

    -ツール使用は、外部ツールにモデルを接続することにより、モデルの能力を拡張する概念で、エージェントでほぼ常に使用されます。

  • LLMにツールをバインドするプロセスはどのようなものですか?

    -LLMにツールをバインドするプロセスは、ツールのスキーマを定義し、そのツールをLLMに関連付けることで、自然言語入力から関数名と引数を生成するようにLLMをトレーニングすることです。

  • Fire Function V2が目指しているバランスは何ですか?

    -Fire Function V2は、関数呼び出しの強さを維持しながら、ベースモデルの能力を保持することを目指しています。

  • Llama 3 Instructモデルとは何ですか?

    -Llama 3 Instructは、Fire Function V2が構築されたモデルで、広く使用されており、非常に強力なモデルです。

  • Fire Function V2のトレーニングプロセスで気をつけたことは何ですか?

    -Fire Function V2のトレーニングでは、関数呼び出しに特化した過剰適応を避け、一般化されたタスクでも良いパフォーマンスを維持するように_FINE_TUNE_しました。

  • 実際のテストケースでFire Function V2をどのように評価しましたか?

    -SQLデータベースエージェントを使用して、Fire Function V2のパフォーマンスを評価し、GPD 40と比較しました。

  • テストケースでのFire Function V2のパフォーマンスはどの程度でしたか?

    -小規模なテストケースでは、Fire Function V2はGPD 40と同等またはそれ以上のパフォーマンスを示しました。

  • ユーザーはFire Function V2をどのように利用するべきですか?

    -ユーザーは、Fire Function V2をインストールし、APIキーを設定することで、関数呼び出しやツール使用を通じてエージェントを構築し利用することができるでしょう。

  • Fireworksのドキュメントや設定方法はどこで確認できますか?

    -Fireworksのドキュメントや設定方法は、公式ドキュメントや提供されているNotebookを参照することで確認できます。

Outlines

00:00

🎉 ランチェーンの新しいモデル「Fire Function V2」の紹介

ランチェーンのLanceが、ランチェーンで人気のある推論エンジンであるFireworksについて紹介し、新しくリリースされたモデル「Fire Function V2」について説明しています。このモデルは、ツール呼び出しや関数呼び出しに特化しており、GPD 40と比較して高速でコストが低いと報告されています。Lanceは、このモデルの機能を紹介し、小さなエージェント評価課題でテストする方法を説明しています。ツール使用の概念についても説明しており、モデルが外部ツールにアクセスし、自然言語入力からAPIや関数の引数と名前を推測するプロセスを解説しています。

05:01

🔧 Fire Function V2の機能テストと実用性評価

LanceはFire Function V2の機能を実際にテストし、SQLデータベースエージェントを使用してエージェントのパフォーマンスを評価する方法を紹介しています。このエージェントは、LLMとL graphを使用して質問に応じてツールを呼び出すことができます。Lanceは、Fire Function V2とGPD 40の両方をテストし、実験の結果を比較しています。小規模な課題では、Fire Function V2がGPD 40と同等かそれ以上のパフォーマンスを示したことがわかりました。Lanceは、Fire Function V2が関数呼び出しやツール使用に非常に適しており、さまざまなユースケースで有用であると結論付けています。

Mindmap

Keywords

💡Fireworks

Fireworksは、多くのLangchainユーザーが利用している人気のある推論エンジンです。このエンジンはオープンソースのモデルを多数ホストしており、ビデオの主題である新しいモデル「Fire Function V2」のリリースについても触れています。Fireworksは、ツール呼び出しや機能呼び出しに優れており、GPD 40と比較して高速でコスト効率が良いと報告されています。

💡ツール呼び出し(Tool Calling)

ツール呼び出しは、モデルの機能を外部ツールに接続することによって拡張する技術です。この概念は、エージェントによく用いられ、自然言語の入力に基づいてAPIや他のツールにアクセスさせるために使用されます。ビデオでは、Fire Function V2がこの機能をどのように改善し、より効率的かつ効果的に実装しているかが説明されています。

💡LLM(Large Language Model)

LLMとは、大きな言語モデルの略で、自然言語を理解し、応答することができる高度なAIモデルです。ビデオでは、LLMが自然言語入力を解析し、適切なツールや関数を呼び出す方法について説明しています。特に、Fire Function V2がLLMの機能をどのように活用し、ツール呼び出しを向上させているかが強調されています。

💡Langchain

Langchainは、AIエージェントを構築するためのプラットフォームであり、Fireworksと密接な関係にあります。ビデオでは、Langchain上でのFire Function V2の使用が議論されており、そのモデルがどのようにしてLangchainのエージェント性能を向上させるかが示されています。

💡Fine-tuning

ファインチューニングとは、モデルを特定のタスクやデータセットに適合させるプロセスです。ビデオでは、Fire Function V2がファインチューニングを通じて、関数呼び出しの能力を高めつつ、ベースモデルの能力を保持する方法が説明されています。

💡Llama 3 Instruct

Llama 3 Instructは、Fire Function V2が構築されたベースモデルです。このモデルは、教示に基づく学習を通じて、より良い機能呼び出し能力を発揮します。ビデオでは、Llama 3 Instructの強みと、それをFire Function V2がどのように利用しているかが触れられています。

💡自然言語入力(Natural Language Input)

自然言語入力とは、人間が話すか書くかのように自然に見えるテキストデータです。ビデオでは、LLMが自然言語入力を理解し、それをツールや関数の引数に変換するプロセスが説明されています。

💡APIキー(API Key)

APIキーは、外部サービスやAPIにアクセスするための認証情報です。ビデオでは、Fireworksモデルを使用する際にAPIキーを設定するプロセスが触れられており、これはモデルが外部ツールと通信できるようにする鍵です。

💡エージェント評価(Agent Evaluation)

エージェント評価とは、AIエージェントの性能を測定するプロセスです。ビデオでは、SQLデータベースエージェントの性能を評価する実例が提供されており、Fire Function V2がこの評価プロセスでどのように機能するかが示されています。

💡SQLデータベースエージェント(SQL Database Agent)

SQLデータベースエージェントは、特定のタスクに対してSQLクエリを実行するAIエージェントです。ビデオでは、このエージェントがFire Function V2を使用して自然言語入力からSQLクエリを生成し、データベースから情報を取得するプロセスが説明されています。

Highlights

Fireworks is a popular inference engine with many open source models.

Introduction of a new model called Fire Function V2, designed for tool calling.

Fire Function V2 is reported to be competitive with GPD 40 in function calling but faster and less expensive.

Tool use expands a model's capabilities by connecting it to external tools.

Tool use is often combined with agents to give an LLM access to various tools and APIs.

The general flow of tool use involves defining a function, binding it to an LLM, and converting natural language requests into function arguments and names.

Fire Function V2 is built on LLaMA 3 Instruct, a widely used model.

Previous efforts in fine-tuning for function calling focused on aggressive and narrow fine-tuning, leading to overfitting.

Fire Function V2 aims to balance strong function calling with retaining the base model's capabilities.

Demonstration of setting up and using Fire Function V2 with a fresh notebook.

Explanation of defining a tool with expected inputs and binding it to the LLM.

Showcasing the LLM's ability to take natural language input and output the correct function name and arguments.

Introduction of a real-world example using a SQL database agent from Lang Smith's cookbooks.

Description of the SQL agent's architecture and its ability to decide which tool to use based on the question.

Evaluation of the SQL agent's performance using a set of reference answers.

Comparison of the agent's performance on Fireworks and GPD 40, showing similar accuracy in a small-scale challenge.

Encouragement for users to test Fireworks in their applications for function calling and tool use.

Transcripts

play00:00

hey this Lance from Lang chain so

play00:02

fireworks is a popular inference engine

play00:04

that many users of Lang chain have used

play00:06

uh it hosts many open source models and

play00:08

they're releasing a new model today that

play00:10

I'm really excited about um that is

play00:12

called fire function

play00:13

V2 and the main point of this is it is a

play00:16

model that's very good at tool calling

play00:18

or function calling um it's reported to

play00:20

be competitive with GPD 40 in terms of

play00:22

function calling but faster and less

play00:24

expensive so that you know has a lot of

play00:26

appeal um and today we're just going to

play00:27

show how it works like how to use it and

play00:30

we're going to go and test it on a small

play00:31

agent evaluation challenge that I've set

play00:33

up so maybe first is at the stage um you

play00:36

know what is tool use wi is it

play00:38

interesting uh I can zoom in a little

play00:40

bit here so basically tool use expands

play00:42

model's capabilities by connecting it to

play00:44

external tools um it's used often times

play00:47

or almost always with agents uh so that

play00:50

is the ability to give an llm access to

play00:52

different tools like web search or

play00:54

various apis and have the LM basically

play00:58

return to you um both the payload so

play01:02

like what's actually necessary to pass

play01:03

to that API or function as well as the

play01:06

name so like what function to use and

play01:08

what the input is derived from a natural

play01:10

language input from the user and so the

play01:12

general flow kind of looks like this I

play01:14

have some function I Define I bind it to

play01:17

my llm I input natural language if the

play01:20

natural language is relevant to the

play01:21

function the llm knows it's aware that

play01:24

this function exists and it converts

play01:26

that natural language request into

play01:28

what's actually needed to run that

play01:29

function notably like the function

play01:31

argument as mentioned and the function

play01:32

name so that's the big idea here um now

play01:35

let's talk a little bit about what they

play01:37

did which I think is actually pretty

play01:38

notable and interesting here so they

play01:41

built on llama 3 instruct which already

play01:44

is a very strong model kind of been very

play01:46

obviously really widely used now they

play01:48

made an interesting point in their blog

play01:49

post that is prior efforts to basically

play01:52

fine-tune for function calling focus on

play01:54

aggressive and narrow fine tuning and

play01:56

this is a classic problem that we see

play01:58

basically overfitting to a certain

play01:59

benchmarks or challenges you know you

play02:01

have a function calling Benchmark you

play02:03

overfit to it very strongly you have a

play02:05

model that's very very good at function

play02:07

calling maybe this narrow context can do

play02:09

well on a benchmark but it's not good at

play02:11

generalization to tasks right and so

play02:14

what they try to do with fire function

play02:15

V2 according to the blog post is balance

play02:17

these two things fine tuneit such that

play02:19

it is very strong at function cling but

play02:21

retain the capabilities of the base

play02:23

model whereas prior efforts very

play02:25

aggressive fine-tuning often erases the

play02:27

native capabilities of the model so

play02:28

again this is really the aim here is to

play02:30

preserve kind of instruction following

play02:33

as well as function calling so it's like

play02:35

balancing these two worlds and and we'll

play02:37

go ahead and see how good it is I think

play02:38

that the you know what they mentioned is

play02:39

very exciting but let's go ahead and

play02:41

test it out so I have a fresh notebook

play02:43

here um so basic all I have to do is

play02:46

this uh you know just of course from

play02:48

chap fireworks uh and of course I just

play02:50

just pip install this first make sure

play02:52

you've got that done so you know pip

play02:55

install this guy you run that you can

play02:58

import here and here's your model name

play03:00

it's just going to be this accounts

play03:01

fireworks models firef function V2 RC of

play03:04

course set your API key so these are all

play03:05

the kind of fundamentals and we'll share

play03:07

those that documentation in terms of

play03:08

just setting up your fireworks account

play03:10

but here's where you you set your model

play03:12

so boom we've done this and now I can go

play03:15

down here now let's define some tool so

play03:18

in my case my tool is just going to be

play03:20

you know let's call it like it's a it's

play03:21

a weather tool I have right and here's

play03:22

what the input is the input I'm Define

play03:24

the schema here the tool wants location

play03:27

and unit so it wants like a city and it

play03:29

wants unit in Celsius or Fahrenheit

play03:31

right so those are expected inputs from

play03:33

my tool or or for my function um and

play03:36

then I'm going to go goad and find

play03:37

Define that as a function itself so this

play03:39

is kind of a you know example of what

play03:41

that function could be of course in the

play03:43

in the real world this could be an API

play03:44

this is something you know some external

play03:46

service um so I Define my function here

play03:50

now I just bind that to my llm there we

play03:52

go and let's go ahead and call it with a

play03:54

query that's relevant to this so

play03:55

basically all I have to ask is what's

play03:57

the weather like in San Francisco and

play03:58

Celsius I make the call and let's see

play04:01

Okay cool so you can see we can

play04:03

basically get a tool call out of the llm

play04:05

which has the arguments formatted per my

play04:08

schema so that's really cool and the

play04:11

correct function name again as mentioned

play04:13

here so this is like a basic example of

play04:15

how the llm does appear to take natural

play04:18

language just like we show in our

play04:19

diagram here natural language input and

play04:22

outputs the function name and the

play04:23

arguments need to run that function okay

play04:25

so that's really cool this is just like

play04:27

showcasing General capability now let's

play04:29

Che actually show a real world example

play04:31

of this so I actually put together uh

play04:34

this cookbook pretty recently it's in

play04:36

our Langs Smith cookbooks section

play04:38

related to agent

play04:40

evaluation and we actually have a

play04:41

accompanying video coming out today on

play04:43

this if you really want to go in depth

play04:44

in it but what I'll just mentioned very

play04:46

briefly is this is the ability to

play04:48

evaluate agent performance um and so

play04:51

this is actually a SQL database agent

play04:53

that we Define in this cookbook um I set

play04:55

a few API keys for lsmith set my lsmith

play04:58

key and basically here's where I grab

play05:00

the database itself um and so I've

play05:03

already done this you would just run

play05:05

this and this is how you test make sure

play05:06

the database exists um now what we're

play05:09

going to do is we're going to find a SQL

play05:11

agent so SQL agent basically will have

play05:13

an llm that and we're using L graph to

play05:16

orchestrate this agent so our llm

play05:18

basically will receive a question decide

play05:20

which tool to use um the will have

play05:23

basically an edge that will look at the

play05:24

tool make a decision as to whether or

play05:26

not to return an answer or call a tool

play05:29

if it's a tool from the llm we'll go

play05:31

ahead and invoke it return that back and

play05:34

this will repeat kind of in in a loop so

play05:36

that's kind of the architecture of our

play05:37

agent don't worry too much about this

play05:39

the main point here was you want to

play05:40

assess the capability of this uh our our

play05:43

new fireworks llm in this context so

play05:45

here's where it gets interesting what I

play05:48

can do is I'm just going to Define that

play05:49

llm we already talked about right here

play05:51

this new fireworks model um and I'm

play05:54

going to add some metadata for my

play05:56

experiment logging here um and this is

play06:00

all in the notebook you can s this is

play06:01

where we def finded a number of tools

play06:02

are relevant to SQL uh you again you can

play06:04

look at the notebook to see this what we

play06:06

really want to look at here is Tesla

play06:08

agent in a real world context does it do

play06:09

well or not um here we can Define our

play06:11

SQL agent um here we're basically just

play06:15

defining our our kind of full agent and

play06:16

this is what the agent is going to look

play06:17

like so again we're going to start it's

play06:19

going to go to the llm in this case

play06:21

it'll be chap fireworks uh it'll make a

play06:23

decision potentially to call a tool the

play06:25

tool will be called tool call gets

play06:26

returned and this will go back and forth

play06:28

until um the agent basically returns a

play06:30

natural language response no tool call

play06:32

and then we end so that's really it um

play06:36

now here's what we can do we can

play06:37

actually just evaluate the the

play06:38

capability of this agent relative to

play06:40

some references so again this is kind of

play06:43

our agent flow the agent's going to take

play06:45

a question return an answer based upon

play06:47

quering SQL um and we have a set of

play06:49

reference answers that I've actually

play06:51

already defined so that's actually right

play06:53

here I can build a data set of question

play06:55

answer pairs um I'm going to name this

play06:57

SQL agent response and and basically

play07:01

this is going to wrap my agent and again

play07:03

you can you can look at this notebook um

play07:06

uh we we're share it um this is setting

play07:09

my evaluators this will grade my agent

play07:12

responses versus uh the reference so

play07:15

there we go and we can go ahead and run

play07:17

this so we're going to run this on

play07:19

fireworks and we're also going to run

play07:21

this on GPD 40 and we can kind of look

play07:22

at the differences we're going to run

play07:24

three replicates of these experiments uh

play07:26

so that kind of gives us some confidence

play07:27

in the scoring and and that'll go ahead

play07:30

and run now so I've run this command on

play07:32

both gbd4 and fireworks and I can go

play07:35

over to my data set and limith and have

play07:36

a look so here's my set of experiments

play07:40

against this data set that I've run I

play07:42

can see if you go back to the notebook

play07:44

basically uh if I can go all the way up

play07:47

um here's where I kind of set the

play07:48

experiment prefix those prefixes are

play07:50

logged right here so I can see here's

play07:52

GPD 40 here's fireworks you can see that

play07:56

I ran three repetitions of each so I've

play07:58

run three replicates of each one of

play07:59

these and this is the aggregate score so

play08:01

basically it's saying 60% of the

play08:03

questions I got correct in both cases in

play08:05

aggregate I can click on each of these I

play08:07

can run comparison and what's neat is

play08:10

this is basically the comparison view so

play08:12

you can see these are all of our

play08:14

questions uh each question is a

play08:16

reference answer so if I open this up I

play08:18

can actually kind of see that so

play08:19

basically um here's the reference input

play08:21

here's the question here is the output

play08:24

here's my two experiments so this is

play08:26

fireworks this is GPD 40 I can see in

play08:28

this particular case here is basically

play08:30

the mean score across those three

play08:32

replicates and I can look at each one so

play08:34

here's the actual responses yeah they're

play08:37

all correct um it went I think I lost

play08:40

that um you can open that back up um so

play08:45

basically I was looking at this first

play08:46

one and yeah we were looking at the

play08:48

replicates so yeah 14 and 14 so that's

play08:52

kind of the basic setup here um I can

play08:54

really easily compare between agent

play08:56

performance using GPD 40 and fireworks

play08:59

in the particular kind of test case and

play09:01

what I can see is if you look at this

play09:03

aggregate score so in this case they

play09:04

both get it correct in this case they

play09:06

both get it wrong in this third case it

play09:08

looks like fireworks got it wrong but

play09:11

gbd 40 got it correct so that's

play09:13

interesting in this fifth case it looks

play09:15

like they both got it correct and in

play09:16

this uh sorry in this fourth case in

play09:17

this fifth case it looks like gbd4 got

play09:20

it wrong and fireworks got it correct so

play09:22

in any case look I agree this is this I

play09:24

I get it this is a small scale challenge

play09:26

there only five questions but it gives

play09:27

you a sense that fireworks fun calling

play09:29

does seem to be pretty good uh at least

play09:31

in this small scale test it looks like

play09:33

it's on power GPD 40 of course you

play09:35

should run testing yourself in your

play09:36

application but I think this presents a

play09:38

very nice option for building agents uh

play09:40

in any or any use cases in which you

play09:42

want to actually use function calling or

play09:44

tool use looks pretty promising from

play09:46

what I can see and um definitely

play09:47

encourage you to play with it thanks

Rate This

5.0 / 5 (0 votes)

Related Tags
Fireworksツール使用AIモデルGPD 40性能比較API統合自然言語機能呼び出しエージェント評価オープンソース
Do you need a summary in English?