The HARD Truth About Hosting Your Own LLMs

Cole Medin
25 Sept 202414:43

Summary

TLDRThis video discusses the rising trend of running local large language models (LLMs) to gain flexibility, privacy, and cost efficiency in scaling AI applications. While hosting your own LLMs avoids paying per token, it requires powerful hardware and high upfront costs. The presenter introduces a hybrid strategy: start by using affordable pay-per-token services like Grok, then transition to hosting your own models when it becomes more cost-effective. The video outlines how to integrate Grok easily, highlights its speed and pricing, and provides a formula to determine the optimal time to switch to self-hosting.

Takeaways

  • 💻 Running your own large language models (LLMs) locally is gaining popularity, offering advantages like no per-token costs and keeping your data private.
  • 🔋 Running powerful LLMs locally, like Llama 3.1, requires extremely powerful and expensive hardware, often costing at least $2,000 for GPUs.
  • ⚡ Local LLMs are efficient when scaling your business, but upfront costs and high electricity expenses make initial setup very expensive.
  • 🚀 An alternative to running local LLMs is to pay per token using cloud services like Gro, which offer much cheaper and faster AI inference without hardware costs.
  • 🛠️ Using Gro allows you to pay per token for LLMs like Llama 3.1 70B, and the platform is easy to integrate into existing systems with minimal code changes.
  • 🌐 Gro is not fully private since the company hosts the model, so for highly sensitive data, users should consider moving to self-hosting as they scale.
  • 💡 The strategy outlined suggests starting with Gro’s pay-per-token model, then transitioning to local hosting when scaling to save on long-term costs.
  • 💰 Gro offers highly competitive pricing, charging around 59 cents per million tokens, making it affordable compared to other closed-source models.
  • 📊 A cost-benefit analysis shows that once a business reaches a certain number of prompts per day (around 3,000), it becomes more cost-effective to self-host LLMs.
  • 🌥️ Hosting your own LLM in the cloud using services like Runpod is recommended for flexibility, but comes at a recurring cost (about $280 per month for certain GPUs).

Q & A

  • What are the main advantages of running your own local large language models (LLMs)?

    -The main advantages include increased flexibility, better privacy, no need to pay per token, and the ability to scale without sending data to external companies. Local LLMs allow businesses to keep their data protected and potentially lower costs as they scale.

  • What are the challenges associated with running powerful LLMs locally?

    -Running powerful LLMs locally requires expensive hardware, such as GPUs that cost at least $2,000, and the electricity costs can be high when running them 24/7. Additionally, setting up and maintaining the models can be time-consuming and complex.

  • How does the cost of running local LLMs compare to cloud-hosted models?

    -Running LLMs locally requires a significant upfront investment in hardware and ongoing electricity costs. On the other hand, using cloud-hosted GPU machines can cost more than $1 per hour, which adds up quickly. However, local LLMs become more cost-effective once a business scales.

  • What is Grock, and why is it recommended in the script?

    -Grock is an AI service that allows users to pay per token for LLM inference at a very low cost, sometimes even free with their light usage tier. It offers speed and affordability, making it a great option for businesses before they scale to the point where hosting their own models is more cost-effective.

  • What is the suggested strategy for businesses wanting to use LLMs without a large upfront investment?

    -The suggested strategy is to start by paying per token with services like Grock, which is affordable and easy to integrate. Then, once the business scales to a point where paying per token becomes expensive, they can switch to hosting their own LLMs locally.

  • When does it make sense to switch from paying per token to hosting your own LLMs?

    -The decision to switch depends on the scale of the business and the number of LLM prompts per day. For example, if you reach a point where you're generating around 3,000 prompts per day, it becomes more cost-effective to host your LLM locally rather than paying per token with a service like Grock.

  • What are the hardware requirements for running Llama 3.1 70B locally?

    -To run Llama 3.1 70B locally, you need powerful hardware such as a GPU with at least 48GB of VRAM, like an A40 instance, which costs around 39 cents per hour in the cloud.

  • How easy is it to integrate Grock into existing AI workflows?

    -Integrating Grock into existing AI workflows is simple. Users only need to change the base URL in their OpenAI instance to Grock's API and add their API key. For LangChain users, it's even easier, with a pre-built package for Grock integration.

  • What are some of the concerns when using Grock for sensitive data?

    -While Grock offers better data privacy compared to proprietary models like GPT or Claude, it is still a hosted service. Therefore, it is recommended to use mock data when developing applications that handle highly sensitive information. Full data privacy is only guaranteed once the LLMs are hosted locally.

  • What are the potential long-term benefits of switching to local LLM hosting?

    -Once a business scales and begins generating a large number of prompts, hosting LLMs locally can save thousands of dollars compared to paying per token. Additionally, businesses gain full control over their data and can avoid reliance on external services.

Outlines

00:00

💻 The Benefits and Challenges of Running Local LLMs

Running local large language models (LLMs) is becoming increasingly popular due to the cost savings and enhanced privacy they offer. By hosting your own models, you avoid paying per token and sharing data with third parties, making it a scalable solution. However, running powerful models like LLaMA 3.1 requires extremely expensive hardware, with GPUs costing thousands of dollars, and substantial energy consumption. Cloud-based GPU instances are an alternative but can also become expensive over time. These challenges make local hosting impractical for many businesses without significant upfront investments.

05:02

🔄 A Two-Stage Strategy for Using LLMs Cost-Effectively

The strategy proposed involves initially paying per token to use models through third-party services, rather than committing to costly local hosting. This allows businesses to scale gradually without large upfront costs. Once the operation reaches a certain scale where the costs of paying per token exceed the cost of owning hardware, the business can then transition to local hosting. This approach provides flexibility, maintaining the use of the same model without the disruptions of switching services, and is designed to mitigate the financial burden while benefiting from both local and cloud-based solutions.

10:03

⚙️ Introduction to Gro: A Cost-Effective AI Service

Gro is introduced as a highly affordable service for accessing LLMs, offering an option to pay per token with exceptional speed and a free tier for light use cases. The process of setting up Gro with OpenAI or LangChain is simple, requiring minimal code changes. Gro’s speed is a standout feature, boasting a token processing rate of 1,200 tokens per second. While Gro is still a third-party service, it is presented as a superior option to closed-source models like GPT or Claude in terms of both cost and data privacy. Users are advised to handle sensitive data with care when using such services during early-stage development.

💸 Cost Breakdown: When to Switch from Gro to Self-Hosting

The decision to switch from Gro to self-hosting can be determined with simple calculations. For example, using a cloud-based A40 GPU with 48 GB of VRAM costs about $280 per month. By comparing this to Gro’s pricing (1.69 million tokens per $1), businesses can calculate the tipping point where paying per token becomes more expensive than self-hosting. For a typical prompt of 5,000 tokens, businesses could process approximately 94,000 prompts per month with Gro before it becomes more cost-effective to host the model themselves. This approach helps businesses manage costs as they scale.

📈 Strategic Considerations for Scaling with Local LLMs

As businesses grow, the strategy focuses on balancing the flexibility of Gro with the long-term savings of local hosting. While Gro offers competitive pricing and convenience, maintaining control over a self-hosted LLM eventually becomes cheaper as the number of prompts increases. With examples from services like RunPod and DigitalOcean, businesses are encouraged to factor in GPU maintenance, electricity, and upgrade cycles when transitioning to local hosting. Despite the higher initial costs, this strategy offers significant savings in the long run for high-demand applications.

💡 Conclusion: Maximizing Efficiency with a Two-Stage Approach

The video concludes by reiterating the two-stage strategy for efficiently leveraging LLMs. Initially, businesses can use cost-effective services like Gro to pay per token, avoiding large upfront investments. As their AI usage scales, they can switch to self-hosting, saving thousands in the long run. The presenter emphasizes the importance of tracking usage metrics to determine the optimal moment for this transition, helping businesses balance short-term costs and long-term savings. The strategy is positioned as a valuable roadmap for companies looking to integrate AI without overwhelming upfront expenses.

Mindmap

Keywords

💡Local LLMs (Large Language Models)

Local LLMs refer to large language models that are hosted and run on personal or business hardware rather than through cloud services. The video highlights the benefits of using local LLMs, such as increased privacy, flexibility, and scalability, but also discusses the challenges, like requiring expensive hardware and significant electricity costs. Local LLMs are ideal for businesses that prioritize data protection and don't want to rely on third-party servers.

💡Hardware Requirements

Running advanced local LLMs like Llama 3.1 requires powerful and expensive hardware, such as GPUs costing thousands of dollars. The video emphasizes this as one of the major obstacles to hosting LLMs locally, warning viewers that underpowered hardware will struggle or fail to process the large models, resulting in timeouts or performance issues.

💡Cost of Scaling

Scaling refers to increasing the use of LLMs in applications as the number of users or interactions grows. The video explains that local LLMs can be more cost-effective in the long run as usage scales up, but there is a large initial investment required for hardware. By contrast, paying per token on hosted services is affordable at smaller scales but can become more expensive when scaling to thousands of users.

💡Per Token Pricing

This pricing model refers to paying based on the number of tokens (words or pieces of words) processed by the LLM, rather than a flat fee for usage. The video suggests this as a more affordable option for businesses starting out, especially with services like Groq, which offers competitive token-based pricing before users decide to invest in hosting their own models.

💡GPU Machines in the Cloud

Instead of buying physical hardware, businesses can rent GPU-powered cloud machines to run LLMs, offering more flexibility and avoiding upfront costs. The video discusses this as an alternative to running LLMs locally, highlighting services like RunPod or DigitalOcean that offer these cloud machines at a lower cost than building a local setup.

💡Groq

Groq is an AI service that provides fast and affordable LLM inference, which means processing language model tasks at high speeds with minimal latency. The video praises Groq for its low-cost, pay-per-token pricing model and ease of integration with tools like OpenAI and Langchain, positioning it as a practical solution for developers who want fast LLM performance without the high costs of local hosting.

💡Llama 3.1

Llama 3.1 is a powerful, open-source LLM that is commonly mentioned in the video as a model people can host locally. It is used as an example of the type of large model that requires substantial hardware resources to run, but also as a model available through services like Groq, where users can access it at a lower cost without needing to host it themselves.

💡Langchain

Langchain is a tool that makes it easier to work with LLMs in applications by handling the chain of prompts and responses in a conversational AI setup. The video mentions Langchain as a useful library that simplifies the integration of models like those offered by Groq into applications, making it easy for developers to set up AI workflows with minimal code.

💡Switching from Hosted to Local LLMs

The video presents a strategy for transitioning from using hosted LLMs (like Groq) to locally hosting LLMs when the business scales to a point where running models locally becomes more cost-effective. This switch is based on the number of prompts processed per day, and the video walks through the calculation to determine the optimal time to make the transition.

💡Data Privacy

Data privacy is a core concern in the video, especially when comparing hosted LLMs and local LLMs. While hosted models like Groq provide flexibility and affordability, they still require sending data to external servers, which could lead to privacy risks. In contrast, local LLMs keep all data in-house, providing better control and security, making them a preferred option for businesses handling sensitive information.

Highlights

Running local large language models (LLMs) is gaining popularity due to cost savings and data privacy.

Local LLMs eliminate the need for per-token fees and ensure your data stays private, crucial for scaling business applications.

Despite the advantages, running powerful LLMs like LLaMA 3.1 requires expensive hardware, such as $2,000 GPUs, and can incur high electricity costs.

Cloud GPU machines provide an alternative but can still be costly, charging upwards of $1 per hour, which adds up quickly.

Initial setup and use of local LLMs may lead to slower response times or even timeouts if the hardware isn't powerful enough.

A strategic solution is to begin with paying per token for hosted models like Grok, which offers fast, cost-efficient LLM inference.

Grok's free tier can be suitable for light usage, offering flexibility before scaling to self-hosted models.

Integrating Grok into existing applications is simple, requiring minimal code changes, such as updating base URLs and adding an API key.

Grok provides extremely affordable rates—$0.59 per million tokens for LLaMA 3.1, making it a viable option for early-stage applications.

While Grok still involves some data privacy concerns since it’s hosted by a third party, it remains a more secure option than closed-source models like GPT.

The strategy is to use Grok's service while scaling and switch to hosting LLMs locally when the cost-benefit shifts in favor of self-hosting.

Napkin math calculations help determine the exact point at which it becomes more cost-effective to self-host LLMs based on token usage and cloud GPU costs.

Using an A40 GPU from Runpod, the monthly cost is approximately $280, making it a useful benchmark when considering the transition from Grok.

With average prompts consuming 5,000 tokens, the switch from Grok to self-hosting makes sense when handling over 3,000 prompts per day.

While Grok remains affordable and highly performant for most applications, businesses will eventually save money by switching to self-hosted LLMs once they scale.

Transcripts

play00:00

running your own large language models

play00:01

locally is all the rage right now there

play00:04

are dozens of incredibly powerful llms

play00:07

that you can download and self-host now

play00:09

and running your own llms means that you

play00:11

don't have to pay per token as you are

play00:13

using your AI and you don't have to send

play00:16

your data to another company they are

play00:18

fantastic for scaling your business and

play00:21

keeping your data protected and while

play00:23

they can't compete with the best close

play00:25

Source models like 01 and Claw 3.5 Sonet

play00:29

they still abs absolutely kick butt with

play00:31

the right setup however there are some

play00:33

hard truths that you and I have to face

play00:36

that you will quickly realize when you

play00:38

try to use the more powerful local llms

play00:40

like llama 3.1 on your own Hardware I

play00:43

can guarantee that the first time you

play00:46

try to use a local llm it is going to

play00:48

take you much longer than you thought to

play00:51

get a response and sometimes you won't

play00:53

even get a response because you'll get a

play00:55

timeout and in that case your computer

play00:57

straight up is not good enough to use

play00:59

the model at all and that my friend

play01:01

reveals the hard truth running the more

play01:04

powerful local llms requires insanely

play01:07

powerful Hardware I'm talking gpus that

play01:10

cost at least two grand just to be able

play01:12

to run the weakest version of llama 3.1

play01:15

and that's not even to mention all the

play01:17

electricity costs that you're going to

play01:18

have to pay to run this thing 247 you

play01:21

could also go with GPU machines running

play01:23

in the cloud but those can cost more

play01:25

than a dollar per hour that adds up

play01:28

really really quick and that's even on

play01:30

the low end and this is a big problem

play01:32

because on one hand you and I we want to

play01:34

use local llms they give us the most

play01:36

flexibility privacy and ability to scale

play01:39

because we aren't paying per token but

play01:41

on the other hand running the more

play01:43

powerful llms locally can cost hundreds

play01:46

or thousands of dollars up front and

play01:48

local llms are actually the most

play01:51

affordable when you really start to

play01:52

scale but getting to that point with

play01:54

them is absolutely painful I have seen

play01:57

this problem this contention play out

play01:59

with many businesses as they start to

play02:01

integrate AI so I have developed a

play02:04

simple but effective strategy for this

play02:06

that I want to reveal to you now here is

play02:09

the premise of the strategy unless

play02:11

you're willing to put in a large initial

play02:13

investment running local AI right off

play02:16

the bat is not realistic that's what you

play02:18

and I just covered but what you can do

play02:21

is pay per token with these same self-

play02:24

hostable models in an incredibly cheap

play02:26

way and then later on when the Price Is

play02:29

Right scale to hosting these exact same

play02:32

models yourself in your own Hardware

play02:34

that way you're only paying the big

play02:35

bucks when it makes sense and you don't

play02:38

even have to switch your llm which that

play02:40

can have a lot of unintended

play02:42

consequences for your application so in

play02:44

this video allow me to show you both

play02:46

sides of this strategy here first

play02:48

starting with paying per token and then

play02:50

going to hosting your llm yourself and

play02:53

there is an exact calculable time when

play02:55

it makes sense to make the switch and I

play02:57

will cover that as well the way that you

play02:59

start this strategy so you aren't paying

play03:00

thousands Upfront for local llms is with

play03:03

one of my favorite AI services on the

play03:05

entire planet Gro Gro is your way to pay

play03:09

per token for super fast openly

play03:11

available llm inference and it is

play03:13

insanely cheap oftentimes it's actually

play03:16

free if your requirements are light

play03:18

enough because grock has an awesome and

play03:20

indefinite free tier and I also just

play03:22

want to say that I'm not sponsored by

play03:24

grock in any way their platform is just

play03:25

the best for Speed and price when it

play03:28

comes to this kind of thing and so

play03:29

that's why I'm including grock

play03:31

specifically in this strategy instead of

play03:33

another service or being more generic

play03:35

all right so now we get to the fun part

play03:36

because I'm going to very quickly show

play03:38

you around Gro I'm going to show you how

play03:40

easy it is to use how powerful it is and

play03:43

also how affordable it is as well and

play03:45

then we'll dive into what it looks like

play03:47

for you to make that decision to

play03:48

eventually switch from grock to hosting

play03:51

your own llms what that looks like and

play03:53

when exactly you would make that choice

play03:55

so I'm going to give a quick overview

play03:57

here just give you a lot of value very

play03:59

very quickly so I'm here on the grock

play04:01

website it's just gro.com and they boast

play04:04

their speeds immediately because they

play04:06

literally have this chat window on their

play04:07

homepage where you can talk to grock I'm

play04:09

not sure which llm this uses under the

play04:11

hood but just see how fast this is I'll

play04:13

enter in just a random prompt that I

play04:15

have here it spits out the answer really

play04:17

fast and then tells you exactly how

play04:19

quick it is

play04:21

1,200 tokens per second that is insane

play04:24

just for reference one word is about

play04:28

1.25 tokens

play04:30

and so this is about 1,000 words per

play04:33

second roughly that is incredibly fast

play04:36

much much faster than most llms out

play04:38

there and it's also insanely easy to use

play04:41

so if I scroll down on their homepage

play04:43

here and I go down yep right here it

play04:45

just shows you how all you have to do to

play04:47

work with grock is you have to replace

play04:50

the base URL within your open aai

play04:52

instance with the grock API like it says

play04:55

right here and then add in your grock

play04:57

API key and then you just switch it out

play04:59

your model right here it is just so so

play05:02

easy like just 10 lines of code they

play05:03

show you right here all you have to do

play05:05

is PIP install open Ai and then boom

play05:07

this is your code and you're now working

play05:08

with grock and with Lang chain it's even

play05:10

easier because you have to just uh pip

play05:13

install this Lang chain grock package

play05:15

set your grock API key in the

play05:17

environment variable and then you have

play05:19

access to this chat grock instance that

play05:22

you can import from Lang chain grock you

play05:24

can Define your model any other

play05:25

parameters like the temperature and now

play05:27

you can do an lm. invoke do stre

play05:29

whatever you'd usually do in Lang chain

play05:31

so it's just so easy to get started with

play05:33

grock here as well and then I also love

play05:35

n8n put out a lot of content on it so

play05:38

I'll show you that really quick as well

play05:40

so in your n workflow you'll typically

play05:43

have a tools agent node when you're

play05:45

working with large language models in n

play05:47

and then for the chat model here I'll

play05:49

just click on this you can see that

play05:51

grock is one of the options supported so

play05:53

no custom integration you can use grock

play05:56

right off the bat with n8n all you have

play05:58

to do for your credentials here is put

play06:00

in your grock API key just like we saw

play06:02

with code and then boom you have access

play06:04

to all of the models within grock super

play06:07

super nice and easy and so with that I

play06:10

wanted to talk about the price for grock

play06:12

as well so we'll go over here and look

play06:14

at this for llama

play06:17

317b it is 59 cents per million tokens

play06:22

and so what that equates to is for every

play06:24

$1 you get

play06:26

1.69 million tokens that is actually EX

play06:29

extremely affordable compared to any

play06:31

Clos Source model really all the

play06:33

powerful ones like GPT 40 and Claw 3.5

play06:36

Sonic and then even comparing to other

play06:38

services that offer llama 3.1 like

play06:41

together for example they have their

play06:43

light version of llama 3.1 70b that

play06:47

actually is technically a little bit

play06:48

cheaper but if you want to go turbo to

play06:51

even come close to matching the speeds

play06:53

of llama 3.1 with grock then you're

play06:55

going to be paying more and so together

play06:58

AI it's a fine service I've used it

play06:59

before I like it uh but it just shows

play07:02

how insanely affordable grock is it is

play07:04

just amazing so there we go in just a

play07:06

few minutes you now know how to use

play07:08

grock what the pricing looks like and

play07:10

also how fast it really is one thing

play07:13

that I do want to mention here with Gro

play07:14

is it still is another company that is

play07:17

hosting the llm for you so it's not

play07:19

quite the same as local AI in terms of

play07:22

your data privacy but it is still a lot

play07:24

better than sending your data to a close

play07:27

Source model like GPT or CLA that's

play07:29

going going to train the model on your

play07:31

data and eventually theoretically

play07:33

regurgitate it back out to other people

play07:36

so it's still much better to use

play07:37

something like Gro with llama but keep

play07:39

that in mind that if you're developing a

play07:41

proof of concept using Gro because you

play07:43

don't want to pay for self-hosting your

play07:45

llm you might want to use mock data for

play07:47

things that's really really private just

play07:49

as you are building out your application

play07:51

and then once you scale and you're

play07:52

running things locally then you can work

play07:54

with your private data and not have to

play07:57

worry about anything so with that we can

play07:59

now dive into the next step of this

play08:01

strategy all right so I have my

play08:03

calculator up right now because we are

play08:04

going to be doing some napkin math to

play08:06

figure out exactly when it makes sense

play08:09

for you to switch from paying per token

play08:11

with a service like grock to hosting

play08:13

your own llm because eventually it will

play08:16

get to the point when you scale your

play08:17

business that paying for token is more

play08:19

expensive than just paying for the

play08:21

hardware now one thing that I want to

play08:23

mention is that I'm going to be covering

play08:25

what it looks like to pay for a GPU

play08:27

machine in the cloud to run your l

play08:29

because you can technically build a

play08:32

computer and have it in your house or a

play08:34

data center but you have to deal with

play08:35

maintenance paying for electricity it's

play08:38

harder to upgrade when the next level of

play08:39

gpus come out in a half a year or a year

play08:42

and so generally it's a lot more

play08:43

flexible to just pay for something in

play08:45

the cloud and that's generally what I

play08:46

would recommend and so that's what I'm

play08:47

going to be focusing on right now there

play08:49

are a lot of services out there to get

play08:52

GPU machines a couple that I want to

play08:54

highlight here is runp pod which is what

play08:56

I'm on right now and then also digital

play08:58

ocean I'm not not affiliated with either

play09:00

of them but I use both of them and so

play09:02

that's kind of my recommendation of just

play09:04

where you'd want to start when looking

play09:06

for somewhere to host your llms in the

play09:08

cloud now digital ocean is actually

play09:10

pretty pricey and they don't have a ton

play09:12

of options I know that they are

play09:13

expanding this in the future just

play09:15

because I've done some research since I

play09:17

like digital ocean so much um but in the

play09:20

meantime runpod actually has much more

play09:22

competitive pricing you can just kind of

play09:23

look at some comparisons here for like

play09:25

an h100 just like they have right here

play09:28

it's definitely a lot cheer and also if

play09:30

you want to run something like llama 3.1

play09:33

70b which is a really classic local llm

play09:37

you typically would only need something

play09:38

like an A40 right here so this is

play09:40

actually the model or the machine that I

play09:43

want to use for my example here for my

play09:46

napkin math so an A40 it is 39 cents in

play09:48

the secure Cloud to run this with 48 gbt

play09:51

of vram definitely good enough for llama

play09:53

3.1

play09:55

70b and so what we're going to do right

play09:57

now is I'm going to walk you through

play09:58

with this calculator step by step how

play10:01

you can determine exactly when it

play10:03

becomes more expensive to pay per token

play10:05

with Gro and so let's go over to the

play10:07

calculator here so first of all what we

play10:10

want to do is compute how much it costs

play10:12

per month to use that A40 instance with

play10:15

run pod so I'm just going to take

play10:17

0.39 because it is 39 cents an hour

play10:21

multiply it by 24 because there's 24

play10:23

hours in a day and then multiply by 30

play10:25

CU there's roughly 30 days in a month

play10:27

and that gives us a grand total of

play10:30

$280 a month so you pay a pretty penny a

play10:33

few hundred a month for this instance

play10:35

but let's do the math and figure out

play10:37

when it makes sense to actually do this

play10:38

instead of go with grock so I'm going to

play10:41

go back to the grock pricing here and

play10:43

we'll take a look at this so for grock

play10:45

for llama 3.1 70b you get

play10:50

1.69 million tokens for every $1 that

play10:53

you spend so let's plug that into the

play10:55

calculator here I'm going to clear it

play10:57

completely and I'll add add in this

play10:59

number here so

play11:01

1, 690,000 tokens per $1 now there are a

play11:08

lot of tokens in your typical prompt

play11:10

especially if you have complex

play11:13

instructions for your llm or long chat

play11:15

histories or you have rag where you're

play11:17

retrieving multiple big chunks from

play11:19

documents and dumping those into your

play11:21

prompt and so the average prompt is

play11:23

actually like a good solid few thousand

play11:26

tokens even up to like 10 or 20,000

play11:28

tokens I have seen before and so this is

play11:31

a really rough estimate but in your use

play11:34

case you will know the average amount of

play11:36

tokens that you have per prompt and

play11:39

that's something that you can actually

play11:40

compute so you can track that in the

play11:42

back end as you're developing your proof

play11:44

of concept and starting to use your

play11:46

application and bring users on and

play11:47

figure out the average tokens so that

play11:50

you can plug your exact number into this

play11:52

to have a better idea when exactly you

play11:54

would switch from grock to hosting your

play11:56

local LM so I'm going to go with 5,000

play11:58

as just just a good example here and so

play12:01

what I'll do is I'm going to divide this

play12:03

number by

play12:04

5,000 because that is going to tell me

play12:07

the number of prompts that I can make to

play12:10

my llm per $1 you see that here because

play12:13

it's

play12:14

1,690 th000 tokens per $1 when I divide

play12:19

by the number of tokens per prompt this

play12:21

is the number of prompts that I can make

play12:23

to my llm per $1 and so now what we're

play12:26

going to do is multiply that by the

play12:29

price of the runp Pod instance because

play12:31

what this is going to give us is it is

play12:33

going to give us the number of prompts

play12:37

that we can make per month until it gets

play12:41

to the point that runp pod is cheaper so

play12:43

this is the magic number here

play12:52

94,95 instance to host our llama

play12:55

317b as long as that instance can keep

play12:58

up with the demand which I'm pretty sure

play13:00

um 48 GB of vram and I think it had like

play13:02

80 GB of RAM or something it could

play13:04

probably keep up with the demand of this

play13:06

many prompts per month because if we

play13:07

divide by 30 that is going to give us

play13:10

the number of prompts that we would have

play13:12

per day so if you get to like 3,000

play13:15

promps per day with your application you

play13:17

would want to self-host your llm instead

play13:20

and if you think about it let me get

play13:21

back to that number here I'll divide by

play13:22

30 again I'm not sure why I went away if

play13:24

you think about it that's not actually

play13:26

that many prompts per day if you have

play13:28

three ,000 users on your platform in a

play13:31

day they could each make one prompt one

play13:34

call to your llm and then it would

play13:35

already be worth it to switch over to

play13:37

hosting your own llm so I hope that

play13:40

makes sense here I know there's a couple

play13:41

of rough numbers that I put in here for

play13:43

examples but you can make this very very

play13:46

accurate to your use case figure out the

play13:48

instance you want with runp pod the

play13:50

exact pricing for your model in grock

play13:52

how many tokens you have on average per

play13:54

prompt and this can come down to an

play13:56

exact science when you'd make that

play13:58

switch so there you have it that is my

play14:00

grand strategy for how you can work with

play14:02

local llms effectively to not spend a

play14:05

ton of money up front but eventually

play14:07

scale where you're hosting everything

play14:08

completely and saving thousands and

play14:10

thousands of dollars once you scale your

play14:12

application to thousands and thousands

play14:14

of users now grock pricing is pretty

play14:17

competitive you can see from my math

play14:18

there that you might want to use it for

play14:20

a very very long time and you might even

play14:22

want to use it for longer just because

play14:24

of the convenience of not having to

play14:25

maintain your own llms but eventually it

play14:28

does do get to the point where it just

play14:30

makes so much sense and you save so much

play14:32

money so I hope that you found this

play14:34

strategy and all the logic behind it

play14:36

valuable if you did I would really

play14:38

appreciate a like and a subscribe and

play14:40

with that I will see you in the next

play14:42

video

Rate This

5.0 / 5 (0 votes)

Etiquetas Relacionadas
Local LLMsAI StrategySelf-hostingToken CostsLLM ScalingCost EfficiencyCloud GPUsData PrivacyAI ApplicationsTech Setup
¿Necesitas un resumen en inglés?