The HARD Truth About Hosting Your Own LLMs
Summary
TLDRThis video discusses the rising trend of running local large language models (LLMs) to gain flexibility, privacy, and cost efficiency in scaling AI applications. While hosting your own LLMs avoids paying per token, it requires powerful hardware and high upfront costs. The presenter introduces a hybrid strategy: start by using affordable pay-per-token services like Grok, then transition to hosting your own models when it becomes more cost-effective. The video outlines how to integrate Grok easily, highlights its speed and pricing, and provides a formula to determine the optimal time to switch to self-hosting.
Takeaways
- 💻 Running your own large language models (LLMs) locally is gaining popularity, offering advantages like no per-token costs and keeping your data private.
- 🔋 Running powerful LLMs locally, like Llama 3.1, requires extremely powerful and expensive hardware, often costing at least $2,000 for GPUs.
- ⚡ Local LLMs are efficient when scaling your business, but upfront costs and high electricity expenses make initial setup very expensive.
- 🚀 An alternative to running local LLMs is to pay per token using cloud services like Gro, which offer much cheaper and faster AI inference without hardware costs.
- 🛠️ Using Gro allows you to pay per token for LLMs like Llama 3.1 70B, and the platform is easy to integrate into existing systems with minimal code changes.
- 🌐 Gro is not fully private since the company hosts the model, so for highly sensitive data, users should consider moving to self-hosting as they scale.
- 💡 The strategy outlined suggests starting with Gro’s pay-per-token model, then transitioning to local hosting when scaling to save on long-term costs.
- 💰 Gro offers highly competitive pricing, charging around 59 cents per million tokens, making it affordable compared to other closed-source models.
- 📊 A cost-benefit analysis shows that once a business reaches a certain number of prompts per day (around 3,000), it becomes more cost-effective to self-host LLMs.
- 🌥️ Hosting your own LLM in the cloud using services like Runpod is recommended for flexibility, but comes at a recurring cost (about $280 per month for certain GPUs).
Q & A
What are the main advantages of running your own local large language models (LLMs)?
-The main advantages include increased flexibility, better privacy, no need to pay per token, and the ability to scale without sending data to external companies. Local LLMs allow businesses to keep their data protected and potentially lower costs as they scale.
What are the challenges associated with running powerful LLMs locally?
-Running powerful LLMs locally requires expensive hardware, such as GPUs that cost at least $2,000, and the electricity costs can be high when running them 24/7. Additionally, setting up and maintaining the models can be time-consuming and complex.
How does the cost of running local LLMs compare to cloud-hosted models?
-Running LLMs locally requires a significant upfront investment in hardware and ongoing electricity costs. On the other hand, using cloud-hosted GPU machines can cost more than $1 per hour, which adds up quickly. However, local LLMs become more cost-effective once a business scales.
What is Grock, and why is it recommended in the script?
-Grock is an AI service that allows users to pay per token for LLM inference at a very low cost, sometimes even free with their light usage tier. It offers speed and affordability, making it a great option for businesses before they scale to the point where hosting their own models is more cost-effective.
What is the suggested strategy for businesses wanting to use LLMs without a large upfront investment?
-The suggested strategy is to start by paying per token with services like Grock, which is affordable and easy to integrate. Then, once the business scales to a point where paying per token becomes expensive, they can switch to hosting their own LLMs locally.
When does it make sense to switch from paying per token to hosting your own LLMs?
-The decision to switch depends on the scale of the business and the number of LLM prompts per day. For example, if you reach a point where you're generating around 3,000 prompts per day, it becomes more cost-effective to host your LLM locally rather than paying per token with a service like Grock.
What are the hardware requirements for running Llama 3.1 70B locally?
-To run Llama 3.1 70B locally, you need powerful hardware such as a GPU with at least 48GB of VRAM, like an A40 instance, which costs around 39 cents per hour in the cloud.
How easy is it to integrate Grock into existing AI workflows?
-Integrating Grock into existing AI workflows is simple. Users only need to change the base URL in their OpenAI instance to Grock's API and add their API key. For LangChain users, it's even easier, with a pre-built package for Grock integration.
What are some of the concerns when using Grock for sensitive data?
-While Grock offers better data privacy compared to proprietary models like GPT or Claude, it is still a hosted service. Therefore, it is recommended to use mock data when developing applications that handle highly sensitive information. Full data privacy is only guaranteed once the LLMs are hosted locally.
What are the potential long-term benefits of switching to local LLM hosting?
-Once a business scales and begins generating a large number of prompts, hosting LLMs locally can save thousands of dollars compared to paying per token. Additionally, businesses gain full control over their data and can avoid reliance on external services.
Outlines
💻 The Benefits and Challenges of Running Local LLMs
Running local large language models (LLMs) is becoming increasingly popular due to the cost savings and enhanced privacy they offer. By hosting your own models, you avoid paying per token and sharing data with third parties, making it a scalable solution. However, running powerful models like LLaMA 3.1 requires extremely expensive hardware, with GPUs costing thousands of dollars, and substantial energy consumption. Cloud-based GPU instances are an alternative but can also become expensive over time. These challenges make local hosting impractical for many businesses without significant upfront investments.
🔄 A Two-Stage Strategy for Using LLMs Cost-Effectively
The strategy proposed involves initially paying per token to use models through third-party services, rather than committing to costly local hosting. This allows businesses to scale gradually without large upfront costs. Once the operation reaches a certain scale where the costs of paying per token exceed the cost of owning hardware, the business can then transition to local hosting. This approach provides flexibility, maintaining the use of the same model without the disruptions of switching services, and is designed to mitigate the financial burden while benefiting from both local and cloud-based solutions.
⚙️ Introduction to Gro: A Cost-Effective AI Service
Gro is introduced as a highly affordable service for accessing LLMs, offering an option to pay per token with exceptional speed and a free tier for light use cases. The process of setting up Gro with OpenAI or LangChain is simple, requiring minimal code changes. Gro’s speed is a standout feature, boasting a token processing rate of 1,200 tokens per second. While Gro is still a third-party service, it is presented as a superior option to closed-source models like GPT or Claude in terms of both cost and data privacy. Users are advised to handle sensitive data with care when using such services during early-stage development.
💸 Cost Breakdown: When to Switch from Gro to Self-Hosting
The decision to switch from Gro to self-hosting can be determined with simple calculations. For example, using a cloud-based A40 GPU with 48 GB of VRAM costs about $280 per month. By comparing this to Gro’s pricing (1.69 million tokens per $1), businesses can calculate the tipping point where paying per token becomes more expensive than self-hosting. For a typical prompt of 5,000 tokens, businesses could process approximately 94,000 prompts per month with Gro before it becomes more cost-effective to host the model themselves. This approach helps businesses manage costs as they scale.
📈 Strategic Considerations for Scaling with Local LLMs
As businesses grow, the strategy focuses on balancing the flexibility of Gro with the long-term savings of local hosting. While Gro offers competitive pricing and convenience, maintaining control over a self-hosted LLM eventually becomes cheaper as the number of prompts increases. With examples from services like RunPod and DigitalOcean, businesses are encouraged to factor in GPU maintenance, electricity, and upgrade cycles when transitioning to local hosting. Despite the higher initial costs, this strategy offers significant savings in the long run for high-demand applications.
💡 Conclusion: Maximizing Efficiency with a Two-Stage Approach
The video concludes by reiterating the two-stage strategy for efficiently leveraging LLMs. Initially, businesses can use cost-effective services like Gro to pay per token, avoiding large upfront investments. As their AI usage scales, they can switch to self-hosting, saving thousands in the long run. The presenter emphasizes the importance of tracking usage metrics to determine the optimal moment for this transition, helping businesses balance short-term costs and long-term savings. The strategy is positioned as a valuable roadmap for companies looking to integrate AI without overwhelming upfront expenses.
Mindmap
Keywords
💡Local LLMs (Large Language Models)
💡Hardware Requirements
💡Cost of Scaling
💡Per Token Pricing
💡GPU Machines in the Cloud
💡Groq
💡Llama 3.1
💡Langchain
💡Switching from Hosted to Local LLMs
💡Data Privacy
Highlights
Running local large language models (LLMs) is gaining popularity due to cost savings and data privacy.
Local LLMs eliminate the need for per-token fees and ensure your data stays private, crucial for scaling business applications.
Despite the advantages, running powerful LLMs like LLaMA 3.1 requires expensive hardware, such as $2,000 GPUs, and can incur high electricity costs.
Cloud GPU machines provide an alternative but can still be costly, charging upwards of $1 per hour, which adds up quickly.
Initial setup and use of local LLMs may lead to slower response times or even timeouts if the hardware isn't powerful enough.
A strategic solution is to begin with paying per token for hosted models like Grok, which offers fast, cost-efficient LLM inference.
Grok's free tier can be suitable for light usage, offering flexibility before scaling to self-hosted models.
Integrating Grok into existing applications is simple, requiring minimal code changes, such as updating base URLs and adding an API key.
Grok provides extremely affordable rates—$0.59 per million tokens for LLaMA 3.1, making it a viable option for early-stage applications.
While Grok still involves some data privacy concerns since it’s hosted by a third party, it remains a more secure option than closed-source models like GPT.
The strategy is to use Grok's service while scaling and switch to hosting LLMs locally when the cost-benefit shifts in favor of self-hosting.
Napkin math calculations help determine the exact point at which it becomes more cost-effective to self-host LLMs based on token usage and cloud GPU costs.
Using an A40 GPU from Runpod, the monthly cost is approximately $280, making it a useful benchmark when considering the transition from Grok.
With average prompts consuming 5,000 tokens, the switch from Grok to self-hosting makes sense when handling over 3,000 prompts per day.
While Grok remains affordable and highly performant for most applications, businesses will eventually save money by switching to self-hosted LLMs once they scale.
Transcripts
running your own large language models
locally is all the rage right now there
are dozens of incredibly powerful llms
that you can download and self-host now
and running your own llms means that you
don't have to pay per token as you are
using your AI and you don't have to send
your data to another company they are
fantastic for scaling your business and
keeping your data protected and while
they can't compete with the best close
Source models like 01 and Claw 3.5 Sonet
they still abs absolutely kick butt with
the right setup however there are some
hard truths that you and I have to face
that you will quickly realize when you
try to use the more powerful local llms
like llama 3.1 on your own Hardware I
can guarantee that the first time you
try to use a local llm it is going to
take you much longer than you thought to
get a response and sometimes you won't
even get a response because you'll get a
timeout and in that case your computer
straight up is not good enough to use
the model at all and that my friend
reveals the hard truth running the more
powerful local llms requires insanely
powerful Hardware I'm talking gpus that
cost at least two grand just to be able
to run the weakest version of llama 3.1
and that's not even to mention all the
electricity costs that you're going to
have to pay to run this thing 247 you
could also go with GPU machines running
in the cloud but those can cost more
than a dollar per hour that adds up
really really quick and that's even on
the low end and this is a big problem
because on one hand you and I we want to
use local llms they give us the most
flexibility privacy and ability to scale
because we aren't paying per token but
on the other hand running the more
powerful llms locally can cost hundreds
or thousands of dollars up front and
local llms are actually the most
affordable when you really start to
scale but getting to that point with
them is absolutely painful I have seen
this problem this contention play out
with many businesses as they start to
integrate AI so I have developed a
simple but effective strategy for this
that I want to reveal to you now here is
the premise of the strategy unless
you're willing to put in a large initial
investment running local AI right off
the bat is not realistic that's what you
and I just covered but what you can do
is pay per token with these same self-
hostable models in an incredibly cheap
way and then later on when the Price Is
Right scale to hosting these exact same
models yourself in your own Hardware
that way you're only paying the big
bucks when it makes sense and you don't
even have to switch your llm which that
can have a lot of unintended
consequences for your application so in
this video allow me to show you both
sides of this strategy here first
starting with paying per token and then
going to hosting your llm yourself and
there is an exact calculable time when
it makes sense to make the switch and I
will cover that as well the way that you
start this strategy so you aren't paying
thousands Upfront for local llms is with
one of my favorite AI services on the
entire planet Gro Gro is your way to pay
per token for super fast openly
available llm inference and it is
insanely cheap oftentimes it's actually
free if your requirements are light
enough because grock has an awesome and
indefinite free tier and I also just
want to say that I'm not sponsored by
grock in any way their platform is just
the best for Speed and price when it
comes to this kind of thing and so
that's why I'm including grock
specifically in this strategy instead of
another service or being more generic
all right so now we get to the fun part
because I'm going to very quickly show
you around Gro I'm going to show you how
easy it is to use how powerful it is and
also how affordable it is as well and
then we'll dive into what it looks like
for you to make that decision to
eventually switch from grock to hosting
your own llms what that looks like and
when exactly you would make that choice
so I'm going to give a quick overview
here just give you a lot of value very
very quickly so I'm here on the grock
website it's just gro.com and they boast
their speeds immediately because they
literally have this chat window on their
homepage where you can talk to grock I'm
not sure which llm this uses under the
hood but just see how fast this is I'll
enter in just a random prompt that I
have here it spits out the answer really
fast and then tells you exactly how
quick it is
1,200 tokens per second that is insane
just for reference one word is about
1.25 tokens
and so this is about 1,000 words per
second roughly that is incredibly fast
much much faster than most llms out
there and it's also insanely easy to use
so if I scroll down on their homepage
here and I go down yep right here it
just shows you how all you have to do to
work with grock is you have to replace
the base URL within your open aai
instance with the grock API like it says
right here and then add in your grock
API key and then you just switch it out
your model right here it is just so so
easy like just 10 lines of code they
show you right here all you have to do
is PIP install open Ai and then boom
this is your code and you're now working
with grock and with Lang chain it's even
easier because you have to just uh pip
install this Lang chain grock package
set your grock API key in the
environment variable and then you have
access to this chat grock instance that
you can import from Lang chain grock you
can Define your model any other
parameters like the temperature and now
you can do an lm. invoke do stre
whatever you'd usually do in Lang chain
so it's just so easy to get started with
grock here as well and then I also love
n8n put out a lot of content on it so
I'll show you that really quick as well
so in your n workflow you'll typically
have a tools agent node when you're
working with large language models in n
and then for the chat model here I'll
just click on this you can see that
grock is one of the options supported so
no custom integration you can use grock
right off the bat with n8n all you have
to do for your credentials here is put
in your grock API key just like we saw
with code and then boom you have access
to all of the models within grock super
super nice and easy and so with that I
wanted to talk about the price for grock
as well so we'll go over here and look
at this for llama
317b it is 59 cents per million tokens
and so what that equates to is for every
$1 you get
1.69 million tokens that is actually EX
extremely affordable compared to any
Clos Source model really all the
powerful ones like GPT 40 and Claw 3.5
Sonic and then even comparing to other
services that offer llama 3.1 like
together for example they have their
light version of llama 3.1 70b that
actually is technically a little bit
cheaper but if you want to go turbo to
even come close to matching the speeds
of llama 3.1 with grock then you're
going to be paying more and so together
AI it's a fine service I've used it
before I like it uh but it just shows
how insanely affordable grock is it is
just amazing so there we go in just a
few minutes you now know how to use
grock what the pricing looks like and
also how fast it really is one thing
that I do want to mention here with Gro
is it still is another company that is
hosting the llm for you so it's not
quite the same as local AI in terms of
your data privacy but it is still a lot
better than sending your data to a close
Source model like GPT or CLA that's
going going to train the model on your
data and eventually theoretically
regurgitate it back out to other people
so it's still much better to use
something like Gro with llama but keep
that in mind that if you're developing a
proof of concept using Gro because you
don't want to pay for self-hosting your
llm you might want to use mock data for
things that's really really private just
as you are building out your application
and then once you scale and you're
running things locally then you can work
with your private data and not have to
worry about anything so with that we can
now dive into the next step of this
strategy all right so I have my
calculator up right now because we are
going to be doing some napkin math to
figure out exactly when it makes sense
for you to switch from paying per token
with a service like grock to hosting
your own llm because eventually it will
get to the point when you scale your
business that paying for token is more
expensive than just paying for the
hardware now one thing that I want to
mention is that I'm going to be covering
what it looks like to pay for a GPU
machine in the cloud to run your l
because you can technically build a
computer and have it in your house or a
data center but you have to deal with
maintenance paying for electricity it's
harder to upgrade when the next level of
gpus come out in a half a year or a year
and so generally it's a lot more
flexible to just pay for something in
the cloud and that's generally what I
would recommend and so that's what I'm
going to be focusing on right now there
are a lot of services out there to get
GPU machines a couple that I want to
highlight here is runp pod which is what
I'm on right now and then also digital
ocean I'm not not affiliated with either
of them but I use both of them and so
that's kind of my recommendation of just
where you'd want to start when looking
for somewhere to host your llms in the
cloud now digital ocean is actually
pretty pricey and they don't have a ton
of options I know that they are
expanding this in the future just
because I've done some research since I
like digital ocean so much um but in the
meantime runpod actually has much more
competitive pricing you can just kind of
look at some comparisons here for like
an h100 just like they have right here
it's definitely a lot cheer and also if
you want to run something like llama 3.1
70b which is a really classic local llm
you typically would only need something
like an A40 right here so this is
actually the model or the machine that I
want to use for my example here for my
napkin math so an A40 it is 39 cents in
the secure Cloud to run this with 48 gbt
of vram definitely good enough for llama
3.1
70b and so what we're going to do right
now is I'm going to walk you through
with this calculator step by step how
you can determine exactly when it
becomes more expensive to pay per token
with Gro and so let's go over to the
calculator here so first of all what we
want to do is compute how much it costs
per month to use that A40 instance with
run pod so I'm just going to take
0.39 because it is 39 cents an hour
multiply it by 24 because there's 24
hours in a day and then multiply by 30
CU there's roughly 30 days in a month
and that gives us a grand total of
$280 a month so you pay a pretty penny a
few hundred a month for this instance
but let's do the math and figure out
when it makes sense to actually do this
instead of go with grock so I'm going to
go back to the grock pricing here and
we'll take a look at this so for grock
for llama 3.1 70b you get
1.69 million tokens for every $1 that
you spend so let's plug that into the
calculator here I'm going to clear it
completely and I'll add add in this
number here so
1, 690,000 tokens per $1 now there are a
lot of tokens in your typical prompt
especially if you have complex
instructions for your llm or long chat
histories or you have rag where you're
retrieving multiple big chunks from
documents and dumping those into your
prompt and so the average prompt is
actually like a good solid few thousand
tokens even up to like 10 or 20,000
tokens I have seen before and so this is
a really rough estimate but in your use
case you will know the average amount of
tokens that you have per prompt and
that's something that you can actually
compute so you can track that in the
back end as you're developing your proof
of concept and starting to use your
application and bring users on and
figure out the average tokens so that
you can plug your exact number into this
to have a better idea when exactly you
would switch from grock to hosting your
local LM so I'm going to go with 5,000
as just just a good example here and so
what I'll do is I'm going to divide this
number by
5,000 because that is going to tell me
the number of prompts that I can make to
my llm per $1 you see that here because
it's
1,690 th000 tokens per $1 when I divide
by the number of tokens per prompt this
is the number of prompts that I can make
to my llm per $1 and so now what we're
going to do is multiply that by the
price of the runp Pod instance because
what this is going to give us is it is
going to give us the number of prompts
that we can make per month until it gets
to the point that runp pod is cheaper so
this is the magic number here
94,95 instance to host our llama
317b as long as that instance can keep
up with the demand which I'm pretty sure
um 48 GB of vram and I think it had like
80 GB of RAM or something it could
probably keep up with the demand of this
many prompts per month because if we
divide by 30 that is going to give us
the number of prompts that we would have
per day so if you get to like 3,000
promps per day with your application you
would want to self-host your llm instead
and if you think about it let me get
back to that number here I'll divide by
30 again I'm not sure why I went away if
you think about it that's not actually
that many prompts per day if you have
three ,000 users on your platform in a
day they could each make one prompt one
call to your llm and then it would
already be worth it to switch over to
hosting your own llm so I hope that
makes sense here I know there's a couple
of rough numbers that I put in here for
examples but you can make this very very
accurate to your use case figure out the
instance you want with runp pod the
exact pricing for your model in grock
how many tokens you have on average per
prompt and this can come down to an
exact science when you'd make that
switch so there you have it that is my
grand strategy for how you can work with
local llms effectively to not spend a
ton of money up front but eventually
scale where you're hosting everything
completely and saving thousands and
thousands of dollars once you scale your
application to thousands and thousands
of users now grock pricing is pretty
competitive you can see from my math
there that you might want to use it for
a very very long time and you might even
want to use it for longer just because
of the convenience of not having to
maintain your own llms but eventually it
does do get to the point where it just
makes so much sense and you save so much
money so I hope that you found this
strategy and all the logic behind it
valuable if you did I would really
appreciate a like and a subscribe and
with that I will see you in the next
video
Browse More Related Video
Ollama-Run large language models Locally-Run Llama 2, Code Llama, and other models
4 Levels of LLM Customization With Dataiku
A Practical Introduction to Large Language Models (LLMs)
Can LLMs reason? | Yann LeCun and Lex Fridman
Set up a Local AI like ChatGPT on your own machine!
Using Ollama to Run Local LLMs on the Raspberry Pi 5
5.0 / 5 (0 votes)