New Llama 3 Model BEATS GPT and Claude with Function Calling!?
Summary
TLDRIn this video, the presenter explores the groundbreaking open-source Llama 3 model developed by Grok, which excels in function calling and challenges proprietary models like GPT. The script details a comparison between GPT and Llama 3 using an AI personal assistant for task management in Asana, demonstrating the impressive speed and accuracy of Llama 3. The presenter highlights the significance of this open-source model in promoting AI transparency and accessibility, marking a significant step forward for the community.
Takeaways
- π The first open-source, large language model for function calling has been introduced by Grok, challenging proprietary models like GPT or Claude.
- π Grok's Llama 3 model has achieved top rankings on the Berkeley function calling leaderboard, with both its 70 billion and 8 billion parameter versions performing exceptionally well.
- π’ The 70 billion parameter Llama 3 model has a 90% accuracy, ranking it first on the leaderboard, while the 8 billion parameter version is only 1% less accurate, placing it third.
- π The benchmarking for function calling is done through the Berkeley function calling leaderboard, which aims to represent real-world use cases for large language models.
- π οΈ The video demonstrates using Grok's Llama 3 model with an AI personal assistant developed in the AI Master Class series for task management in Asana.
- π The script details a comparison between GPT and Llama 3, showcasing the process of changing the code to use the new model for function calling tasks.
- π§ The AI agent is designed to interact with Asana on behalf of the user to manage projects and tasks, utilizing tools defined in the code.
- β±οΈ The video shows that the Llama 3 model is notably faster than GPT in executing function calling tasks, although it may require additional confirmation steps.
- ποΈ The Llama 3 model successfully replicates the task management operations that GPT performs, including creating tasks, marking them as complete, and deleting tasks.
- π€ The script highlights the potential of using local, open-source models as AI agents in workflows, emphasizing the importance of transparency and accessibility in AI.
- π The video concludes by celebrating the success of the open-source Llama 3 model in performing function calling tasks, almost as effectively as proprietary models, marking a significant advancement for open-source AI.
Q & A
What major milestone has been achieved in the field of AI language models?
-For the first time, the best large language model for function calling is an open-source model that can be run locally, breaking away from proprietary models like GPT or CLA.
Which company has developed their own version of Llama 3 for function calling?
-A company called Gro has developed their own version of Llama 3, specifically designed for high performance in function calling.
How does Gro's Llama 3 model perform on the Berkeley function calling leaderboard?
-Gro's Llama 3 model, both the 70 billion parameter version and the 8 billion parameter version, are ranked highly on the Berkeley function calling leaderboard, with the 70 billion parameter version being number one.
What is the significance of the 70 billion parameter version of Llama 3 being number one on the leaderboard?
-The 70 billion parameter version of Llama 3 achieving a 90% accuracy on the leaderboard is significant as it demonstrates its superior performance in function calling compared to other AI models.
How does the 8 billion parameter version of Llama 3 compare to other models in terms of accuracy?
-The 8 billion parameter version of Llama 3 is only 1% worse in overall accuracy compared to the 70 billion parameter version, making it a more efficient model in terms of size and performance.
What is the Berkeley function calling leaderboard and how is it used to benchmark AI models?
-The Berkeley function calling leaderboard is a tool used to benchmark AI models based on their performance in function calling. It evaluates models based on how they are used in real-world scenarios like agents and enterprise workflows.
What AI personal assistant is being used in the video to test the Llama 3 model?
-The AI personal assistant used in the video is one that the presenter has been developing in their AI Master Class video series, designed to help with task management.
How does the presenter plan to evaluate the effectiveness of the Gro Llama 3 model for function calling?
-The presenter plans to evaluate the Gro Llama 3 model by comparing it to another powerful model, GPT 40, using the same AI agent for task management and observing their performance.
What tasks does the presenter assign to test the function calling capabilities of the AI models?
-The presenter assigns tasks such as creating a project in Asana, adding steps as tasks with due dates, marking tasks as complete, deleting tasks, and adding new tasks to test the function calling capabilities of the AI models.
What are the key differences in performance between GPT and the Gro Llama 3 model observed in the video?
-GPT is observed to handle tasks more smoothly and quickly, especially in understanding and executing multiple tasks without needing additional prompts. However, the Gro Llama 3 model, while slower, is still able to perform the tasks, demonstrating its effectiveness as an open-source model.
What is the presenter's final verdict on the Gro Llama 3 models in comparison to GPT?
-While the presenter acknowledges that GPT is slightly better at handling tokens and executing tasks, they are impressed with the Gro Llama 3 models, especially considering they are open-source and perform almost as well as proprietary models like GPT.
Outlines
π Introduction to Grok Llama 3: A New Benchmark Leader
The speaker introduces the groundbreaking news that Grok's Llama 3, an open-source model for function calling, has become the best performing model in this category, surpassing proprietary models like GPT. The blog post from Grok reveals that their 70 billion parameter version of Llama 3 leads the Berkeley Function Calling leaderboard, with a smaller 8 billion parameter version also performing exceptionally well, coming in third place.
π GPT vs Grok Llama 3: A Function Calling Showdown
The speaker sets up an experiment to compare GPT and Grok Llama 3 in task management. Using a task management AI agent, they ask GPT to list ten steps for creating an AI agent application, create a project in Asana, and add tasks for each step. The experiment demonstrates GPT's capability in invoking various tools and handling complex instructions seamlessly, with tasks created and managed efficiently in Asana.
βοΈ Testing Grok Llama 3: Performance and Capabilities
The speaker transitions to testing the Grok Llama 3 model, noting its impressive speed compared to GPT, albeit with some limitations. Despite a slower performance in certain tasks and needing additional prompts, Grok Llama 3 successfully handles the creation, modification, and deletion of tasks in Asana. This showcases the model's potential as a strong open-source alternative, albeit with some areas for improvement. The speaker highlights the significance of these advancements for open-source AI and encourages viewers to explore using local models in their workflows.
Mindmap
Keywords
π‘Open Source Model
π‘Function Calling
π‘Grok
π‘Benchmarks
π‘AI Personal Assistant
π‘Asana
π‘GPT
π‘Parameter
π‘Accuracy
π‘Local Model
π‘LangChain
Highlights
For the first time, the best large language model for function calling is an open source model.
The open source model can be run locally, unlike proprietary models like GPT or CLA.
Grock, a company that builds infrastructure for AI, has developed their own version of Llama 3 for function calling.
Llama 3 is outperforming every other AI model in function calling benchmarks.
The 70 billion parameter version of Llama 3 is leading the Berkeley function calling leaderboard with 90% accuracy.
The 8 billion parameter version of Llama 3 is only 1% less accurate and ranks third on the leaderboard.
The Berkeley function calling leaderboard is a benchmark that represents typical use cases for function calling in AI.
The AI personal assistant developed in the AI Master Class video will be used to test the Llama 3 model's function calling capabilities.
The task management agent will be used to manage tasks in Asana, a task management software.
The testing will involve creating a project in Asana and adding tasks based on a list of steps provided by the AI.
GPT 40 will be used as a comparison model to evaluate the effectiveness of the Llama 3 model.
The code for the AI agent is available in a GitHub repo linked in the video description.
GPT successfully created a project in Asana and added tasks based on the provided steps.
The Llama 3 model was faster than GPT in creating tasks in Asana, but required manual input for due dates.
The Llama 3 model successfully added, updated, and deleted tasks in Asana, demonstrating its function calling capabilities.
The Llama 3 model, while not as powerful as GPT, showed significant performance in function calling, especially for an open source model.
The success of the Llama 3 model is a major step forward for open source and local AI models, offering transparency and accessibility.
The demonstration shows that open source models can compete with proprietary models in practical applications like task management.
Transcripts
this week history has been made for the
very first time ever the best large
language model for function calling is
an open source model that you can run
locally it's no longer a proprietary
model like a GPT or CLA grock an AI
company that builds infrastructure to
help you work with any local model has
recently developed their own version of
llama 3 which is specifically designed
for function calling and this thing is
absolutely insane it's crushing it on
the benchmarks beating every single AI
model with function calling and so today
I'm going to show you guys exactly how
to use this model and we're going to do
some testing to really see if this thing
is as good as the benchmarks say it is
all right so here we have the blog post
from grock where they unveiled these
llama 3 models that have specifically
been designed for high performance for
function calling now the first big
question that I had when I heard about
this because honestly it seems too good
to be true is how can they actually say
that their version of the Llama 3 model
is the best at function calling the way
that they're benchmarking this is with
the Berkeley function calling leader
board and we'll dive into this in just a
second here but one thing that I wanted
to call out really quickly from this
article first of all there's 70 billion
parameter version of their llama 3 is
number one on this leaderboard right now
which is really cool it's got a 90%
accuracy um I mean that's the big deal
right now but one thing that I find even
more interesting honestly is their 8
billion parameter version of their llama
3 is only 1% worse for overall accuracy
this much smaller model and is number
three on the leaderboard so it's beating
out all GPT models and every single
Cloud Model except 3.5 Sonet with
function calling right now 3.5 Sonet as
you can kind of guess from what I just
said is number two on the leaderboard so
we can actually go over and take a look
at this Berkeley function calling
leaderboard right now this is not
updated with Gro llama 3s at this point
um but we had 3.5 Sona in first before
the updates and then gbt 4 and Claw 3
Opus which is super cool now just
looking at this initially it's a little
vag like what do these accuracies and
rankings really mean um if you want to
though you can read up on everything
that goes into this leaderboard and how
they do their benchmarking with function
calling um so just a little bit here
they are trying to be very
representative of most users use cases
with function calling and they call out
things like agents and Enterprise
workflows and so they're really trying
to model their evaluations based on how
people actually use large language
models for function calling and so I've
spent some time diving into this and it
really does seem accurate but what we're
going to do now is we're going to
actually dive into using this new gro
llama 3 model for function calling and
see how accurate it actually is and so
we're going to use the AI personal
assistant that I've been developing in
my AI Master Class video and we're going
to use it with this grock Lama 3 Model
to see how well it can help me with my
task management and so let's go ahead
and dive into comparison first starting
with GPT and then trying out with this
new llama 3 Model so in order to truly
evaluate the effectiveness of this new
gro llama 3 model for function calling
we need to compare it to another
powerful model using the same AI agent
and so the model that I'm going with
here is GPT 40 and the agent that I'm
going with is this task management agent
like an AI personal assistant that I've
been developing in my AI agents Master
Class series here on my channel and so
this agent it helps me manage my tasks
in AA which is my favorite task
management software there's a UI for
this as well with streamlet and it uses
a a lot of Cool Tools like Lane chain to
build this up really really nice and
easily and so if you're curious about
any of those things you can check out
other videos on my Channel or in the
master class series but I'm just going
to go over this code really quickly here
and then we'll dive into testing it out
with GPT then I'll show you how to
change it to use the gro llama 3 model
and we'll test it out there as well and
so really quickly here the link to this
code is in the description of the video
in a GitHub repo so you can check it out
if you want but I'm just going to go
over this in a really high level right
now so first of all we have a section
that defines all the tools that we're
giving the agent to interact with a SAA
on my behalf to manage projects and to
manage tasks and so here are all the
tools and then we get into the next
section which is the function to
actually interact with our AI agent and
so I build up the chatbot and bind all
the tools to it and then handle all the
prompting here and also handling any of
the tool calling that comes up when the
AI wants to invoke a tool as an agent
next up we have the main function and
this is just where we Define everything
with a streamlet UI so I can interact
act with my AI in the browser and have
it manage tasks just through natural
language that I spit at it uh through
the chat component and so that is
everything for this AI agent now let's
go ahead and see how well it does with
gp4 all right so here we are in the
Streamlight UI for the task management
AI agent that we have running with GPT
right now the way that I ran this script
is I just ran the command streamlet run
in the name of the Python script that I
just showed you you do that in a
terminal and then it'll give you this UI
in the browser for you to interact with
your agent and so what I'm going to do
right now to test how good GPT is with
function calling is I'm going to give it
a very difficult task where it needs to
invoke many different tools to interact
with a SAA to do something rather
complex for me and then we'll test the
exact same thing with the grock Llama 3
model and so I'm going to start out with
a very simple question I'm going to say
give me the 10 steps to create an AI
agent application and so basically I'm
just having GPT start out by doing a
little bit of research for me so it'll
give me the top 10 steps to make an AI
Agent app it's a little vag but we're
just doing this as an example and then
what I'm going to do is I'm going to say
okay
great now create a project in ass sauna
called I'll just say like AI Agent app
and add each step as a task that is due
by Friday all right so now we are
kicking off many different things behind
the scenes where GPT has to know to
invoke the tool to create a project and
then go go into it and create tasks for
every single step so it has to also
understand the due date that I gave and
its previous response to be able to pick
out each of those tasks and turn them
into a nice little title for me for each
task and so it's going to take a little
bit here because it has to invoke every
single one of those tools um but I'm
specifically letting it go here and not
just pausing and coming back when it's
done because I want to show the speed
here and also compare that to the grock
Llama 3 Model so here we go I've created
a project in AA called a Agent app and
I've added each of these tasks and it
gives the links as well so that worked
flawlessly that is awesome and so now
I'm going to do a couple of other little
tests here and then we'll go and
actually check it out in AA so first
I'll say nice I have finished um
defining the purpose and scope I don't
spell it right but that's totally fine
because I wanted to mark this task as
complete all right it has marked it as
complete nice and I'll say I'll just do
another test where I want it to delete a
task I'll say I actually don't want to
test the application I do not recommend
this but this is just a test here
because I want to remove this task uh
there we go it's removed it all right
nice and now I'm going to test adding
another task I'll say instead I want to
hire someone to test my app so I wanted
to add that as a task instead oh nice
okay so before it even adds a task it
asks me for the due date which is really
good so I'll say Saturday all right
added in by
Saturday so now it's thinking here we go
yep hire someone to test the app here we
go all right so now let's going to ASA
and actually check out and make sure
that all these things worked as the bot
told me it did so here we go over to ASA
we've got a new project called AI Agent
app I click into this and then boom here
we go we got a task for every single one
of the steps to build an AI Agent app
toine the purpose and scope is complete
we don't see test the application
anymore and we do have a new task
created that is due by Saturday to hire
someone to test the app and this is new
in two Saturdays from now which is also
nice that it that it determined that so
everything worked great now we're going
to go over to the grock Llama 3 model
and see if it can do this just as well
or maybe even better or faster so let's
go ahead and dive into how we change the
code to do that all right so I'm going
to spend just a minute going over the
changes that it takes to use the grock
Llama 3 model and then we'll go ahead
and test this one just like we did with
GPT to see how it fares with function
calling and so the first thing is I'm
going to import a new module from Lang
Chen Gro where it's just chat grock and
we'll use this to instantiate a grock
model for our chatbot instead of an open
AI one and then for our model that we
have finded through the environment
variables we're going to have a default
here of the Llama 3 grock 70 billion
parameter version and so with that all
the tools are going to be exactly the
same so all this code is going to be
very very very similar the only
difference here is instead of using a a
chat open AI object to instantiate the
chat bot we're going to use chat grock
passing in that grock llama 3 70 billion
parameter model you could even test this
with the 8 billion one as well because
that one is apparently number three on
the benchmarks and so that'd be cool to
play with as well and that is all the
changes that you have to use using Lang
chain to work with grock is so so easy
they have documentation for how you can
use grock without Lang chain but this
just makes it so simple so with that
let's go ahead and test out this new
grock llama 3 Model all right so here we
are in the streamlet UI for the task
management AI agent again but this time
powered by the grock Llama 3 for
function calling and so I'm just going
to go ahead and go through the exact
same process as before and right off the
bat you can see this thing is so
freaking fast compared to gbt which is
so cool it doesn't have the streaming
effect like the typewriter effect that
gbt has but I still appreciate the speed
a ton and so now with that I'm just
going to go ahead and give it a request
to do all the things in a sauna like we
did in GPT and so right off the bat uh
it's asking us to confirm the exact date
for Friday okay so that's a little weird
and I think it's just because llama 3
isn't as powerful as GPT but I'll say uh
Friday is and then I'll actually check
my calendar really quickly here uh
Friday is the 26th all right so let's
see if I can take this and run with it
to add the due dates and add all these
tasks into the new AI Agent app project
so it's going to take a little bit
because even though grock is really fast
I think there's a little a little bit of
rate limiting because I'm using the free
tier and so it'll make one task and then
it'll prompt itself again to make the
next task and that starts to kind of
rate limit itself and so I'm going to
come back when this is done oh actually
never mind there we go all the tasks for
AI Agent app have been added
successfully and are due by Friday that
is perfect okay so it took a little bit
to get it there I had to give it a date
when I didn't have to give that to GPT
but this is still pretty cool the fact
that a local model can do this an open
source model is freaking insane and so
now I'm going to go ahead and give it
another request I'll say I have finished
um let's see I'm going to say I have
finished choosing a programming language
and Dev environment because I want it to
actually mark this task as
complete that is interesting it seems
there is an error updating the task the
task ID you provided is not valid I'll
say no you need to look up the task IDs
I don't want to have to give that to the
model it needs to be able to determine
that itself just like GPT did um okay
Define the problem has been updated
successfully okay I don't even know if
that's the right one so it's not doing
the best here but I'll I'll test it out
a little bit more here create a new task
to um hire out a Dev let's see if it can
make a new task in this project to hire
out a developer for me hopefully he can
do this one fine we'll see what happens
it's taking a sweet time here not really
sure why this one should be pretty quick
it seems like gbt is actually faster and
invoking tools somehow but here we go
the task hire developer has been created
successfully um let's do one more test
here where I will I'll um delete the
task test the AI model I don't want this
anymore let's see if we can get rid of
it fine and then I'll go over to a sauna
after this and verify that everything
actually looks the way it should based
on what llama 3 told me in this
conversation so I'll just give it a
little bit of time to delete that task
and then we'll swap over over to assana
all right so it has successfully deleted
the task for me and now let's go over to
Asana and check this out so I deleted
the AI Agent app project from GPT now
this is the only one that was now
created by llama 3 so I'll click into
this we'll see how it looks okay so hire
developer has been added all the other
tasks are here it has checked off to
find the problem or task and I don't see
test the AI app anymore so there we go
it successfully did everything that gbt
did it took a little bit more to get it
there but it did work and so that is a
huge victory for open source and local
models so I honestly can't say I'm a
100% impressed with these grock llama 3
models for function calling because
they're not quite as good as GPT I think
mostly just because GPT is able to
handle a bunch of tokens a lot better
but still it's insane how well this
model is doing compared to other local
and open source models I didn't even
want to compare it to a Bas llama 3 or
Microsoft 5 for example because those
fall apart so bad it wouldn't even be a
good demonstration so that's why I
compared it to GPT and it was almost as
good which is a huge victory for open-
Source models if you're an advocate for
transparency in AI or making AI
accessible for anybody then this is what
you want to be rooting for these models
getting almost as good as proprietary
ones is a big step forward so I'm stoked
with this I hope that you can take this
knowledge that I've given you and apply
it to add local models as AI agents in
your workflows if you found this useful
in anyway I'd really appreciate a like
and a subscribe and with that I will see
you in the next one
Browse More Related Video
Create Anything with LLAMA 3.1 Agents - Powered by Groq API
BREAKING: LLaMA 405b is here! Open-source is now FRONTIER!
The LK-99 of AI: The Reflection-70B Controversy Full Rundown
π¨BREAKING: LLaMA 3 Is HERE and SMASHES Benchmarks (Open-Source)
Llama 3 e Meta AI: demo dell'AI GRATIS di Meta
How to Use Llama 3 with PandasAI and Ollama Locally
5.0 / 5 (0 votes)