Using Ollama to Run Local LLMs on the Raspberry Pi 5

Ian Wootten
17 Jan 202409:29

Summary

TLDRThe video demonstrates how to use an 8 GB Raspberry Pi 5 to run open-source large language models (LLMs) on a local network. The creator installs and tests models like Tiny Llama and Llama 2, comparing their performance to a MacBook Pro. Tiny Llama runs efficiently, but larger models, like the 7 billion-parameter Llama 2, perform much slower on the Raspberry Pi. The video also showcases image recognition capabilities, albeit at a slow speed. Overall, it highlights the Pi’s potential in running LLMs despite its limitations in processing power and speed.

Takeaways

  • 🖥️ The Raspberry Pi 5, with 8 GB of RAM, costs £80 in the UK or $80 in the US and is great for running open-source projects.
  • ⚙️ The video demonstrates running a large language model (LLM) on the Raspberry Pi 5 and compares its performance to a MacBook Pro.
  • 💡 The creator installed and tested Tiny LLaMA, an open-source LLM, on the Raspberry Pi using simple commands.
  • 🌐 Tiny LLaMA was able to process questions and generate text, though its phrasing was different from larger models due to its size.
  • ⚖️ Performance comparison: Raspberry Pi 5 generated responses at about half the speed of the MacBook Pro M1, with an eval rate of 12.9 tokens per second.
  • 🚀 The larger LLaMA 2 model was significantly slower on the Raspberry Pi, with an eval rate of 1.78 tokens per second, demonstrating the impact of model size.
  • 🔒 LLaMA 2 uncensored version was used to bypass overzealous filtering found in the default LLaMA 2 model.
  • 🔍 The creator tested LLaMA’s ability to recognize images, such as a Raspberry Pi board, which was processed successfully but took over five minutes.
  • 🛠️ Smaller models like Tiny LLaMA are recommended for faster performance on the Raspberry Pi, whereas larger models like LLaMA 2 are too slow.
  • 🎬 The video emphasizes the usefulness of the Raspberry Pi 5 for experimenting with LLMs but highlights the need to choose models wisely based on speed and capability.

Q & A

  • What is a Raspberry Pi 5?

    -The Raspberry Pi 5 is a small, affordable computer designed for educational use and loved by makers. The model mentioned in the script has 8 GB of RAM and costs around £80 in the UK or $80 in the US.

  • What is the main purpose of the video described in the script?

    -The video's main purpose is to demonstrate how to use the 8 GB Raspberry Pi 5 to run an open-source large language model (LLM) on a local network and compare its performance to other devices like the MacBook Pro.

  • What open-source large language models are mentioned in the script?

    -The script mentions several open-source models, including LLaMA, LLaMA 2, Tiny LLaMA, Code LLaMA, and Mistral.

  • How does the Tiny LLaMA model perform on the Raspberry Pi 5?

    -The Tiny LLaMA model was successfully installed and tested on the Raspberry Pi 5. It generated an output with an evaluation rate of 12.9 tokens per second, which is about half of the rate the presenter achieved using the MacBook Pro's M1 processor.

  • What are the performance benchmarks for the Raspberry Pi 5 running larger models like LLaMA 2?

    -When running the LLaMA 2 uncensored model (7 billion parameters), the performance was slower, with an evaluation rate of 1.78 tokens per second. This is significantly slower compared to Tiny LLaMA due to the larger model size.

  • Why did the presenter choose the uncensored version of LLaMA 2?

    -The presenter chose the uncensored version of LLaMA 2 because the standard version applies more restrictions, which may prevent the model from providing certain information, such as regular expressions (regex) in Python.

  • What was the presenter’s experience with running image interpretation on the Raspberry Pi 5?

    -The presenter tested the model's ability to interpret an image of a Raspberry Pi, which worked but was very slow, taking over 5 minutes to generate a response. The model was able to describe the image accurately without relying on external services.

  • What challenges did the presenter face when running large models on the Raspberry Pi 5?

    -The primary challenges were related to the slower processing speed and the high memory requirements of larger models like LLaMA 2, resulting in much slower token generation rates compared to smaller models like Tiny LLaMA.

  • What are the presenter's recommendations for running LLMs on a Raspberry Pi 5?

    -The presenter recommends using smaller models, such as Tiny LLaMA or Mistral, due to their faster performance on the Raspberry Pi 5, which has limited hardware capabilities compared to more powerful machines like the MacBook Pro.

  • What additional features of LLaMA does the presenter mention?

    -The presenter briefly mentions LLaMA's ability to provide API functionalities, which can be explored further in other videos or tutorials.

Outlines

00:00

🖥️ Introduction to Raspberry Pi 5 and its Capabilities

The Raspberry Pi 5, released a few months ago, is a tiny computer designed for schools and makers. This version comes with 8 GB of RAM and costs around £80 in the UK or $80 in the US. The video focuses on using this computer to run a large language model (LLM) on a local network. The speaker compares the performance of the Pi 5 against a MacBook Pro, beginning with installing and running a Tiny LLaMA model. The model installs quickly, and the Pi 5 processes tasks at a respectable rate, although its performance is lower compared to the MacBook Pro. The fan activates during high CPU usage, indicating the computer's processing demands. The Tiny LLaMA runs smoothly, responding to prompts with decent results, despite the model's smaller size.

05:03

💡 Testing LLaMA 2 Uncensored and Performance Observations

The speaker moves on to testing the LLaMA 2 uncensored model, which has a larger 7-billion parameter size, requiring the full 8 GB of RAM on the Raspberry Pi 5. The model runs significantly slower compared to the smaller Tiny LLaMA, confirming that it’s not well-suited for this setup. The speaker emphasizes that the uncensored version of LLaMA 2 allows more flexibility in responses, avoiding overly restrictive system prompts. They prompt the model to generate a regular expression (regex) for matching email addresses, and while it does respond, the slower speed highlights the limitations of running larger models on the Pi.

Mindmap

Keywords

💡Raspberry Pi

The Raspberry Pi is a small, affordable computer used for educational purposes and by hobbyists for a wide range of projects. In the video, the Raspberry Pi 5 with 8 GB of RAM is highlighted as the device being used to run an open-source large language model. The Raspberry Pi is important because it allows users to perform complex tasks, such as running AI models, on a budget-friendly platform.

💡Raspberry Pi 5

Raspberry Pi 5 is the latest version of the Raspberry Pi computer, offering improved performance and specifications, such as 8 GB of RAM. In the video, the creator uses the Raspberry Pi 5 to install and run large language models, showcasing its ability to handle computationally demanding tasks despite its small size and low cost.

💡Llama

Llama is an open-source large language model mentioned in the video. The user installs and runs a smaller version called Tiny Llama on the Raspberry Pi 5. Llama models are used for natural language processing tasks such as answering questions and generating text, and the video demonstrates how even a small version can function on a compact device like the Raspberry Pi.

💡Tiny Llama

Tiny Llama is a smaller version of the Llama language model that can be run on less powerful hardware, such as the Raspberry Pi. In the video, the creator successfully runs Tiny Llama on their Raspberry Pi 5 to test its performance and responsiveness, illustrating its utility for lightweight applications where resource availability is limited.

💡MacBook Pro

The MacBook Pro is used as a point of comparison in the video to demonstrate the performance differences between running large language models on a high-end laptop versus the Raspberry Pi. The creator notes that the Raspberry Pi's performance is slower than the MacBook Pro’s, but still impressively capable given the Pi’s lower cost and size.

💡Tokens per second

Tokens per second is a performance metric used to measure how quickly a language model generates output. In the video, the Raspberry Pi is shown to handle around 12.9 tokens per second for Tiny Llama, compared to higher speeds on the MacBook Pro. This metric helps to assess the efficiency of the device when processing text.

💡Llama 2

Llama 2 is a larger and more advanced version of the Llama language model. The video creator installs and runs the uncensored version of Llama 2 on the Raspberry Pi 5. However, due to its larger size and memory requirements, the model runs much slower than Tiny Llama, showcasing the limitations of running such large models on a small device.

💡Uncensored model

The uncensored version of Llama 2 removes some of the restrictions typically placed on language models, allowing the user to access content that might otherwise be blocked. In the video, the creator uses the uncensored version because it provides more flexibility when generating responses, such as writing regular expressions that the restricted version might block.

💡Regular expression (regex)

A regular expression is a sequence of characters that define a search pattern, often used for pattern matching in strings. In the video, the creator asks Llama 2 to generate a regular expression to match email addresses. This showcases the model’s ability to perform specific programming-related tasks, even though the larger model runs slower on the Raspberry Pi.

💡Image interpretation

Image interpretation refers to the ability of a model to analyze and describe the content of an image. In the video, the creator tests the Raspberry Pi's ability to run a model that interprets an image of a Raspberry Pi. Although the process is slow, the model successfully provides a detailed description of the image, demonstrating the potential for running more complex tasks on this hardware.

Highlights

Introduction of Raspberry Pi 5 with 8GB RAM, available for £80 in the UK or $80 in the US.

Demonstration of running an open-source large language model (LLM) on Raspberry Pi 5.

Successfully installed and ran the Tiny Llama model on the Raspberry Pi 5.

Tested Tiny Llama's response to the question 'Why is the sky blue?' with successful generation of the response.

Tiny Llama performance evaluation: Achieved 12.9 tokens per second generation rate, which is about half the speed of a MacBook M1 Pro.

Experiment with running the larger Llama 2 uncensored model, highlighting the significant difference in performance.

Llama 2 uncensored model was much slower compared to Tiny Llama, processing at only 1.78 tokens per second.

Llama 2 model required 8GB RAM, matching Raspberry Pi 5's capacity but showed performance limitations.

Installed and tested the LLaVA model for image interpretation, analyzing a picture of a Raspberry Pi.

Successfully interpreted the image of a Raspberry Pi, recognizing the circuit board and its components.

LLaVA image processing took over 5 minutes, demonstrating slow performance on Raspberry Pi 5.

Conclusion: Tiny Llama is a more practical option for Raspberry Pi 5 compared to larger models like Llama 2.

Discussion on how using smaller models like Tiny Llama or Mistal improves performance on limited hardware.

Mention of Raspberry Pi 5's fan kicking in during LLM processing, indicating significant CPU usage.

Final takeaway: Running LLMs on Raspberry Pi is feasible but larger models are significantly slower due to hardware constraints.

Transcripts

play00:00

this tiny computer is a Raspberry Pi

play00:02

it's made for schools and loved by

play00:05

makers and more specifically this is the

play00:08

Raspberry Pi 5 which was released a few

play00:11

months ago now this version is an 8 GB

play00:15

of RAM model costs just £80 in the UK or

play00:19

$80 in the US if you're lucky enough to

play00:21

be able to get a hold of one so this

play00:22

tiny computer can be used for many

play00:24

things but specifically in this video I

play00:25

want to show you how you can use that 8

play00:27

GB of RAM for running an open source

play00:29

large language model on your own network

play00:32

and what sort of benchmarks we can get

play00:34

versus say something like the MacBook

play00:36

Pro that I've used a ll on in the past

play00:39

so that all said let's get started so

play00:42

I'm on my Pi five um I'm going to try

play00:44

and install a ll on this and see how it

play00:47

goes I should be able to run the AL

play00:52

instructions and just see how they pan

play00:53

out so if we just copy that curl

play00:56

commands and uh paste that

play01:02

see how that

play01:03

does Okay cool so that seems to just

play01:06

gone in and installed straight away so

play01:09

if you're not familiar with a llama you

play01:10

can go and pick up any of these models

play01:12

it's got listed here so we got mix dra

play01:14

we got uh llama 2 tiny llama code

play01:18

llama um I'm just going to try and run

play01:21

tiny llama at this point we can just run

play01:23

tiny llama and it pull down that model

play01:25

so let's see how we

play01:28

do

play01:31

I've never run tiny llama before so this

play01:33

is going to be a new one for me so I'm

play01:35

running raspian as you can see and I've

play01:37

updated everything installed all the

play01:39

latest packages and I haven't installed

play01:41

anything else I literally just installed

play01:43

a llama there okay cool so it's pulled

play01:45

down everything let's have

play01:47

a question classic why is a sky

play01:50

blue see how that

play01:54

does the sky blue is a natural color why

play01:57

is the sky blue oh that's inter interes

play02:00

in the way it's phrased

play02:01

that I'm guessing this is just basically

play02:03

down because it's a tiny llama which is

play02:06

not as big model as other options okay

play02:10

so it's actually work which is superb so

play02:12

I'm actually pretty surprised that that

play02:14

got installed so quickly and was so easy

play02:17

um I'm going to try out a few things the

play02:19

fan did kick in on the um heat sinks

play02:22

there when I was trying things so it is

play02:24

obviously uh using the CPU a bit be

play02:27

interesting to know when we do this um

play02:30

so we can run this for both commands so

play02:34

if I do a llama run tiny llama d d I

play02:39

think it

play02:42

is uh then what me do why do same

play02:49

thing we should get some stats out in

play02:53

terms of how fast it's generating those

play02:56

responses now when I was doing this on

play02:58

my M1

play03:00

Pro on uh llama not tiny llama we're

play03:05

getting about 202 a second I think and

play03:09

on my M1 I think it was like 17 or

play03:11

something like that so eval rate 12.9

play03:14

tokens a second that

play03:17

is not bad that's prompt EV right so the

play03:20

EV is 10 tokens so roughly half what I

play03:23

was getting on the M1 Pro which is not

play03:26

too shabby we could actually do a fairly

play03:28

a better comparison and if we pull down

play03:30

the other

play03:31

model so we say buy and come out and

play03:34

then do llama

play03:37

run llama 2 I'm actually going to pull

play03:41

down the uncensored one because um llama

play03:44

2 is

play03:49

pretty

play03:51

restrictive it's pretty um aggressive

play03:54

with it the restrictions apply so you

play03:56

could ask for a

play03:58

really spicy s so I think in my other

play04:01

video I asked for a Rex for in Python

play04:04

and it it wouldn't give me the answer to

play04:06

those rexes because um it felt that they

play04:08

were inappropriate and that I might be

play04:10

trying to do nefarious things with them

play04:14

so this is saying it's going to take

play04:15

about 10 minutes so that's obviously a 4

play04:17

gig model

play04:19

now just wait a second and let that pull

play04:22

that down okay cool that's all finished

play04:24

downloading as well as the Llama 2

play04:27

uncensored model I've pulled down uh L

play04:30

as well because I wanted to check out if

play04:31

it can do um how well it CES with doing

play04:35

image kind of interpretations so let's

play04:38

first do a llama let's run the Llama 2

play04:42

uncensored and see how that

play04:46

fares and in fact actually let's uh

play04:49

let's do that with the ver post command

play04:56

again so I'm going to prompt it with can

play04:59

you write a regular expression to match

play05:02

email addresses

play05:06

addresses um so in previous video when I

play05:09

did this it actually this is the reason

play05:11

for using uh the uncensor version

play05:14

because then this doesn't get

play05:16

caught like I said things is a little

play05:19

overzealous stuff and that generally is

play05:20

to do with the initial system prompt you

play05:23

can see that this is much

play05:25

slower than the tiny llama that we

play05:28

running

play05:33

okay so it's doing it in uh JavaScript I

play05:35

didn't actually specify that or in

play05:37

Python but there we go that's

play05:41

fine I have no idea if that's going to

play05:43

match an email address well this is yeah

play05:46

this is this is really slow in

play05:48

comparison so you probably want to be

play05:50

using one of those smaller models so

play05:53

yeah this is the 7 billion parameter

play05:54

model I didn't state that but it it says

play05:56

on the Llama website under the un sensor

play06:00

the um llama 2 model that the memory

play06:02

requirements are that 7 billion

play06:04

parameter models require generally 8 gig

play06:07

ram which we've got here but you can see

play06:09

that it's not it's not fast okay yeah so

play06:12

you can see there we've got an e rate of

play06:15

1.78

play06:16

so tiny in comparison to what we had

play06:19

just now with the um tiny llama so

play06:23

obviously the model is that much more

play06:25

bigger it's double size we've gone from

play06:27

1.78 to 3.

play06:30

78 um I think that's a 3 billion

play06:32

parameter model let me have a squiz at

play06:35

the website in fact actually no it's a

play06:38

1.1 billion parameter model which is

play06:42

obviously a lot smaller we're going from

play06:45

7 1.1 billion to 7 billion and we

play06:49

getting much slower eval rate so this is

play06:52

probably not the way you want to go you

play06:54

probably want to be using something like

play06:56

mistol on this or in fact tiny larm is a

play07:00

good option there because it seem to be

play07:01

going pretty fast I'm going to try this

play07:05

um image as well so I download this

play07:08

image into downloads is a picture of the

play07:13

Raspberry Pi let me see if I can get it

play07:17

to understand that because that would be

play07:20

pretty awesome to know that it can do

play07:22

that as

play07:23

well so let's run

play07:28

lava and we're going to run that verose

play07:30

as

play07:31

well man got an absolute tweet storm

play07:36

going on in a tree in my garden this

play07:39

happens all the times okay so let's see

play07:41

what's in this

play07:46

picture home

play07:49

in

play07:52

downloads image

play07:55

JP I think that's what it was called

play07:57

okay let's go

play08:02

wow this is this is slow and you got no

play08:05

feedback um is the only is the other

play08:08

thing here we're not seeing anything

play08:10

aside from spinny snaked and it's

play08:14

finally responding with an answer here

play08:16

we go the image features a close-up of

play08:19

the back of a computer circuit board

play08:20

green and yellow Compu board has many

play08:23

screws on it attaching various

play08:25

components detail view showcases and

play08:27

inner workings of electron devices such

play08:29

as laptops or

play08:30

computers so it's obviously looked at

play08:32

that image and it understands it and

play08:34

it's done all that locally which is

play08:36

really impr impressive it's not gone out

play08:38

to a third party service in order to do

play08:40

that it hasn't been able to pick

play08:41

anything out from the image file name

play08:43

because I've made sure that it's not

play08:45

identifiable from what I've named the

play08:49

file so that's really impressive but

play08:51

it's incredibly slow it took how long

play08:54

did that take total duration 5 minutes

play08:57

33 so a long time we've obviously got

play09:01

all of the features that um alarm has as

play09:03

well such as the API stuff you can go

play09:05

and check my previous videos if you want

play09:07

to see how to do that but yeah I hope

play09:09

you found this useful uh let me know if

play09:11

you're going to be trying it out on your

play09:12

own rosby

play09:14

Pi I'll speak to you soon in new video

play09:17

and check out one of my other videos on

play09:19

alarm there'll be one popping up in a

play09:20

minute probably okay bye for now

play09:28

bye

Rate This

5.0 / 5 (0 votes)

Related Tags
Raspberry PiLlama 2AI modelsOpen-sourceTiny LlamaPerformance comparisonMacBook ProTech tutorialAI benchmarksLocal processing