Ollama.ai: A Developer's Quick Start Guide!

Maple Arcade
1 Feb 202426:31

Summary

TLDRThis video offers a developer's perspective on the AMA interface, explaining its role in AI development tools and future prospects. It discusses the evolution from API-based interactions with large language models to the need for on-device processing due to limitations and legal restrictions. The video explores solutions like WebML and introduces Olama, which allows running large models on consumer GPUs for real-time inferences. It also covers various models, including Llama 2, Mistral, and Lava, demonstrating their capabilities and use cases, such as summarizing URLs and analyzing images. The script concludes by showing how to access these models via REST API, highlighting the potential for locally hosted AI in enhancing user experience and performance.

Takeaways

  • 🌐 The video discusses the evolution and current state of AI development tools, particularly focusing on the limitations of cloud-hosted large language models and the benefits of on-device AI models.
  • πŸ”Œ Large language models traditionally required API calls and cloud infrastructure, but this approach has limitations in latency, data sensitivity, and real-time processing needs.
  • πŸ₯ In sensitive industries like healthcare and finance, there are legal restrictions on sending patient or financial data to cloud-based models due to privacy concerns.
  • πŸŽ₯ Use cases such as live streaming or video calling require real-time inference capabilities, which are not feasible with cloud-based models that introduce latency.
  • 🌐 WebML offers a solution for client-side rendering through libraries like TensorFlow.js and Hugging Face's Transformers.js, allowing models to run directly in the browser.
  • πŸ’Ύ WebML allows for quantized versions of models to be stored in browser cache for real-time inferences, but it is limited to web applications and can have loading constraints.
  • πŸ–₯️ The video introduces 'Olama', an interface that enables the fetching and running of large language models on consumer GPUs, providing more flexibility for desktop applications.
  • πŸ“š The script covers various models like Llama 2, Mistral, and Lava, highlighting their capabilities, sizes, and use cases, including summarizing URLs and multimodal tasks.
  • πŸ” The video demonstrates how to interact with these models using both command-line interface (CLI) and REST API calls, showcasing the versatility of on-device AI models.
  • πŸ”„ The process of pulling and running models locally includes downloading the model, spinning up an instance, and interacting with it to perform tasks like summarization or image analysis.
  • πŸ“ The script also touches on the philosophical and ethical considerations of open-source AI models, discussing the importance of avoiding cultural biases and maintaining model openness.

Q & A

  • What is the main focus of the video?

    -The video provides a developer's perspective on the AMA (Auto Model Adapter) interface, discussing how it fits into AI development tools and its potential future impact.

  • Why were large language models initially limited to running on big organizations' infrastructures?

    -Large language models were initially limited to big organizations' infrastructures because they required significant computational resources and were typically accessed via API calls, which had limitations in terms of latency and data privacy.

  • What are some limitations of using API calls to interact with large language models?

    -API calls have limitations such as potential delays in response times, which can be problematic for real-time applications, and privacy concerns when dealing with sensitive information that cannot be sent to cloud-hosted models.

  • How does using WebML with libraries like TensorFlow.js or Hugging Face Transformers.js address some of the limitations of API calls?

    -WebML allows developers to fetch quantized versions of models that are smaller in size and run them in the browser cache, enabling real-time inferences without the need to send data to a backend server.

  • What is the significance of client-side rendering for certain applications like live streaming or video calling apps?

    -Client-side rendering is crucial for applications that require real-time processing, such as live streaming or video calling apps, where waiting for a response from a backend API would not provide a seamless user experience.

  • What is the promise of Olama and how does it differ from WebML?

    -Olama is an interface that allows large language models to be fetched and run on the client environment, including on consumer GPUs. Unlike WebML, which is limited to web browsers, Olama can be used for desktop applications and other environments that require local model inference.

  • What are some popular models that can be fetched and run locally using Olama?

    -Some popular models include Llama 2, developed by Meta, Mistral, which is gaining popularity for its performance, and Lava, a multimodal model that can process both images and text.

  • What are the system requirements for running the 7B and 70B versions of the Llama 2 model?

    -The 7B version of Llama 2 requires 8GB of RAM, while the 70B version requires 64GB of RAM, indicating that larger models need more substantial system resources to run effectively.

  • How does the video demonstrate the use of locally hosted models for summarizing URLs?

    -The video shows how, by using a locally hosted model like Mistral, a URL can be summarized on-device without needing to send the request to a remote server, which can be more efficient and privacy-preserving.

  • What is the philosophical argument made by the creators of the uncensored Llama 2 models regarding alignment in AI models?

    -The creators argue that truly open large language models should not have alignment built into them, as it can be influenced by popular culture and may not represent diverse perspectives. Instead, they advocate for models that remain unbiased and open to various cultural influences.

  • How can developers interact with locally hosted models using REST API calls?

    -Developers can send REST API calls to a locally hosted web API, specifying the model name and other parameters in the request body. The API will return the inference results in a JSON object, which can be formatted as needed by the developer.

Outlines

00:00

πŸ€– Large Language Models and Their Evolution

The video discusses the shift from API-based interaction with large language models to on-device deployment. Initially, these models were used within large organizations and accessed via APIs, which had limitations such as outdated responses and latency issues. The video highlights the need for on-device models in sensitive industries like healthcare and finance, as well as for real-time applications like live captioning. It introduces the concept of WebML and libraries like TensorFlow.js and Hugging Face's Transformers.js, which allow for smaller, quantized models to be run on the client side for real-time inferences.

05:02

πŸš€ Introduction to OLAMA and Its Benefits

This section introduces OLAMA, an interface that enables developers to fetch large language models onto the client environment, allowing them to run on consumer GPUs. It emphasizes the advantages of running models locally, such as avoiding legal restrictions on data transfer and reducing latency. The video provides a walkthrough of how to download and set up OLAMA, and how to choose different models and their versions, including Llama 2 and Mistral, which are highlighted for their popularity and performance.

10:03

πŸ” Exploring Multimodal Models and Their Capabilities

The script delves into multimodal models like Lava, which can process both images and text. It demonstrates the process of fetching and interacting with Lava, showcasing its ability to analyze images and provide detailed responses based on the visual content. The video also touches on other models like Codama, designed for developers, and discusses the philosophical aspects of open-source models, including the importance of avoiding cultural biases and the concept of alignment in AI models.

15:06

πŸ“š Hands-On with Local Model Inference and REST API Integration

The speaker provides a practical demonstration of using locally hosted models for inference, showing how to pull models like Llama 2 and Mistral, and interact with them via command line interface. It also covers summarizing URLs and using multimodal models to analyze images. Additionally, the video shows how to send REST API calls to a locally hosted model, Mistral, to get inference responses, and how to format these responses in JSON, highlighting the flexibility and power of on-device AI models.

20:08

🌐 Philosophical Considerations and Open-Source Models

The video script touches on the philosophical debate around open-source AI models, referencing an article by George Sun and Jared H that argues for the importance of keeping models truly open and unbiased by any single culture. It mentions the uncensored versions of Llama 2 models as examples of this philosophy in action, emphasizing the ethical considerations in AI development and the potential societal impacts.

25:08

πŸ”— Wrapping Up with Local Model Deployment and API Interaction

The final part of the script wraps up the discussion by summarizing the capabilities of local model deployment and the ease of interacting with these models via REST API calls. It reiterates the benefits of reduced latency and the ability to handle sensitive data locally, and encourages viewers to explore the use of these technologies in their own projects.

Mindmap

Keywords

πŸ’‘Developers Perspective

This term refers to the viewpoint or approach taken by software developers when dealing with a particular subject or technology. In the video, the developer's perspective is essential for understanding how to interact with AI models and the challenges faced in their integration and deployment. It is used to frame the discussion on how developers can utilize AI tools effectively.

πŸ’‘API Calls

API stands for Application Programming Interface, which is a set of rules and protocols for building software applications. In the context of the video, API calls are used to interact with large language models hosted on cloud infrastructure. The script discusses the limitations of relying on API calls for real-time applications and the need for on-device processing.

πŸ’‘WebML

WebML refers to machine learning models that run in a web browser environment. The video mentions WebML in the context of using libraries like TensorFlow.js or Hugging Face Transformers.js to enable real-time inferences on the client side, such as in a browser, without the need to send data to a server.

πŸ’‘Quantized Models

Quantization in machine learning involves reducing the precision of the numbers used in a model to make it smaller and faster. In the script, quantized models are smaller versions of large language models that can be stored in a browser cache and used for real-time inference on the client side.

πŸ’‘Client-Side Rendering

Client-side rendering refers to the process of generating a graphical user interface on the client's device rather than on the server. The video discusses the necessity of running large language models on the client side for applications like live streaming or video calling, where real-time processing is crucial.

πŸ’‘Llama 2

Llama 2 is a specific large language model developed by Meta (formerly Facebook). The video script discusses Llama 2 as one of the models that can be fetched and run on a local environment using the AMA interface, highlighting its capabilities and different versions available for different use cases.

πŸ’‘Mistral

Mistral is another large language model gaining popularity, as mentioned in the script. It is noted for outperforming Llama 2 in certain benchmarks, making it a preferred choice for some developers looking to integrate AI models for tasks like summarizing text or understanding content.

πŸ’‘Multimodal Model

A multimodal model is capable of processing and understanding multiple types of data, such as text, images, and audio. The video introduces Lava as a multimodal model that can take input from images and text to generate responses based on the context it sees, showcasing the advancement in AI's ability to understand and process different forms of data.

πŸ’‘Local Inference

Local inference refers to the process of running AI models on a local device rather than relying on cloud-based services. The script emphasizes the benefits of local inference, such as reduced latency and the ability to process sensitive data without sending it over the internet.

πŸ’‘ONNX Runtime

ONNX Runtime is an open-source engine for running machine learning models that conform to the Open Neural Network Exchange (ONNX) format. While not explicitly mentioned in the script, the concept of ONNX Runtime is relevant to the discussion of running models locally, as it is a tool that can be used to execute these models efficiently on various platforms.

πŸ’‘Ethical Considerations

The video touches on the ethical implications of AI models, particularly the philosophical debate around whether open-source models should have alignment or not. The script mentions uncensored models like Llama 2, which raise questions about the influence of popular culture on AI and the importance of maintaining an unbiased and open AI model.

πŸ’‘REST API

REST API, or Representational State Transfer API, is a set of guidelines for implementing networked applications that use HTTP. The script demonstrates how to interact with locally hosted AI models via REST API calls, which allows developers to send requests and receive responses in a standardized way, facilitating integration with other software components.

Highlights

Developer's perspective on AMA (Autonomous Model Agents) and its interface explained.

AMA's role in the schema of AI development tools and its future implications.

Historical context of large language models running on big organizations' infrastructures and limitations of API-based interaction.

Challenges with real-time responses and legal restrictions on sending sensitive data to cloud-hosted models.

Introduction to WebML and TensorFlow.js as a solution for client-side rendering.

Demonstration of running TensorFlow.js for real-time object detection from webcam feed.

Limitations of WebML in terms of model loading times and user experience.

The need for large language models on desktop for certain applications, such as live captioning plugins.

Overview of Olama as an interface to fetch and run large language models on client environments.

Instructions on setting up AMA, including downloading and model fetching process.

Details on different models available, such as Llama 2, Mistral, and their respective requirements.

Demonstration of summarizing a URL using on-device large language models.

Introduction to Lava, a multimodal model capable of processing images and text.

Live demonstration of Lava's inference capabilities on provided images.

Discussion on the philosophical aspect of open-source models and the importance of alignment.

Accessing large language models via REST API and formatting the response as needed.

Practical examples of interacting with locally hosted models using API calls.

Transcripts

play00:00

so in this video I'm going to give you a

play00:01

developers perspective on AMA and I'm

play00:04

going to tell you everything that you

play00:06

need to know for this interface and I'm

play00:09

going to give you how it fits in the

play00:11

overall schema of all the AI development

play00:14

tools that we have as developers and how

play00:17

I see it panning out in the future I'm

play00:19

going to tell you everything you need to

play00:20

know about this in this video so when uh

play00:23

large language models were introduced

play00:25

they used to run in these big

play00:27

organizations infrastructures and we

play00:29

used to be able to interact with them

play00:32

using API calls for example if you were

play00:34

a python developer you would probably

play00:36

install a python package if you were a

play00:39

njs developer you would install an npm

play00:41

package and you would use that package

play00:44

as an interface to send API calls to

play00:47

these backend Cloud hosted large

play00:49

language models and get Json response

play00:52

back over the years we found out that

play00:54

that approach has its limitations not

play00:56

only the fact that we would get

play00:58

eventually date Limited but also in a

play01:01

lot of cases sending a query back to the

play01:04

API and waiting for 5 seconds to get the

play01:06

response back is not just going to cut

play01:08

it if I take an example of let's say if

play01:11

you are building a solution for a

play01:12

healthcare client then you might be

play01:15

legally restricted from sending

play01:18

sensitive patient information back to

play01:20

this Cloud hosted large language models

play01:22

or if you're working for a financial

play01:24

industry client then also you might not

play01:26

be able to send sensitive information

play01:28

back to this large language models that

play01:30

are hosted in the cloud also there are

play01:32

certain other use cases so if you're

play01:33

building a automatic captioning plug-in

play01:36

for a live streaming app or a video

play01:39

calling app then you are not going to

play01:41

send a request to this backend API and

play01:44

wait for 5 seconds right you'll have to

play01:46

run this large language models on the

play01:49

client so basically you need client side

play01:51

rendering in such cases and there are

play01:53

two ways the development Community has

play01:55

tried to solve it one is webml so we're

play01:59

talking about the the tensorflow JS

play02:00

Library which is very popular uh or

play02:03

hugging faces Transformers JS Library

play02:05

all these libraries allow you to fetch a

play02:08

quantized version of these models which

play02:10

is much smaller in size probably 100 MB

play02:12

or so and we get to Fish this small

play02:15

quantis models and store them in browser

play02:18

cache and run inferences based upon that

play02:21

so you'll be able to run real-time

play02:23

inferences I have some projects that I

play02:25

worked on and I have made a video where

play02:28

I am using the tensorflow jss COC SSD

play02:31

model I'm fetching the the quantize

play02:34

version of it and which is about 100 MB

play02:36

in size and I running real-time

play02:39

inferences uh to detect objects from the

play02:42

webcam feed and whenever a person is

play02:45

detected I'm recording a 30 second clip

play02:48

and saving that video in the downloads

play02:50

folder so for this kind of use cases you

play02:52

are not going to send a request to the

play02:54

back end backend API frame by frame that

play02:57

you are getting from the webcam feed

play02:59

rather that you want to run realtime

play03:01

inference that is running on device but

play03:03

as you can see for webml you are limited

play03:06

by the fact that whenever you you'll be

play03:08

loading this web page the model also has

play03:10

to load obviously the model has to load

play03:12

only once until or unless you clear the

play03:13

browser cache but still you are limited

play03:16

by the fact that you don't want this

play03:18

model to load for a very long time

play03:20

because that is not going to be the

play03:22

right user experience on top of that

play03:24

there are certain other apps that are

play03:27

also running on a desktop that might

play03:30

also want to use this large language

play03:31

model inferences right so let's say for

play03:33

example I build this app as a web app

play03:37

and what if I want to package this and

play03:39

share it as a desktop app right now with

play03:42

webml I'm only bound within the web

play03:45

browser so if I take the example of live

play03:48

captioning where I'm building a plugin

play03:50

for let's say for example Zoom where I

play03:53

want to create live captioning for this

play03:56

video call then webml is just not going

play03:58

to cut it I need this l large language

play04:00

models to be running in the desktop or

play04:02

if I take another example so Adobe has a

play04:04

nice tool if I go to Adobe speech

play04:07

enhancer so in this you get to upload a

play04:10

very long audio and so as you can see it

play04:13

is about 43 minutes of audio you get to

play04:15

upload this audio and it enhances the

play04:18

audio and makes the studio quality so it

play04:20

not only suppresses the noise it also

play04:22

applies some artificial intelligence Bas

play04:25

based settings onto the audio so that it

play04:28

becomes very high quality you might want

play04:30

to try it this is really good but the

play04:32

problem is that let's say for example

play04:34

you are working on theci resolve and you

play04:36

want the audio quality to be enhanced by

play04:39

this AP by this AI you'll have to export

play04:42

that as a audio file you'll have to

play04:45

upload it here then again align the

play04:47

audio with the video and then do it for

play04:50

all the audio tracks that you have

play04:51

recorded think about all the B RS that

play04:54

you might have recorded and aligning all

play04:56

of them and doing it by uploading it to

play04:59

this web we app if you had this AI

play05:02

models running in the desktop and if you

play05:04

had created a resolve plug-in you'd be

play05:07

able to utilize that locally hosted

play05:10

artificial intelligence model and you

play05:13

would probably be able to do it within

play05:14

your desktop environment within the

play05:16

resolve without leaving the environment

play05:18

of the D resolve so that is the

play05:21

basically the promise of olama so this

play05:24

is nothing but a interface that allows

play05:27

you to fetch this large language models

play05:30

onto the client environment so they are

play05:32

saying we allow you to fetch the large

play05:35

Fetch and run this large young language

play05:37

models on consumer GPU so if you have a

play05:41

dedicated GPU or some extra Ram that is

play05:44

going to be more power to you you'll be

play05:46

able to fetch and run even larger models

play05:49

but on basic average consumer GPU this

play05:52

allows you to fetch these models and

play05:55

run to set up AMA you'll have to go to

play05:58

this Ama a and click on this download

play06:01

but before I click on this download I'm

play06:03

going to go to the models page where

play06:05

you'll be able to see that the list of

play06:07

models is quite large and Lama 2 is one

play06:11

of the most popular models that is

play06:13

developed by meta and if I go to this

play06:16

tags section you'll be able to see that

play06:18

you can run or rather pull this model

play06:21

using this command so whenever when you

play06:24

will run this command for the first time

play06:26

this model is going to be pulled onto

play06:28

your desktop and it is going to spin up

play06:30

an instance of that specific large

play06:32

language model and as you can see the

play06:34

size is 3.8 GB this is a default model

play06:37

that gets pulled uh using this command

play06:39

but then again there are so many other

play06:42

versions so the default version is

play06:44

basically the 7B version and the chat

play06:48

fine-tuned versions of it so if you read

play06:50

the documentation so as they mention it

play06:53

here and each models are different

play06:55

obviously this Lama 2 model is a large

play06:58

language model there are so many other

play06:59

large language models like like for

play07:00

example mistal which is a more popular

play07:03

model at this moment so they mention

play07:05

that the 7B model requires 8GB of RAM

play07:08

and the ram requirements and everything

play07:10

and 70b requires 64 GB of RAM and they

play07:13

also mention details about the model

play07:15

variants they mentioned that the chat

play07:17

version is basically the chat uh the

play07:21

finetune version for for chat use cases

play07:25

and the text version so if you use the

play07:28

tag text you get the model without the

play07:31

chat fine tuning so if I go back to the

play07:33

tag section this is a default version

play07:35

which is the 7B chat version that you

play07:37

get but then you get the 13v version as

play07:40

well and obviously the text and chat

play07:42

versions of those models as well and if

play07:45

you scroll down to the bottom you will

play07:47

see that you get to pull the actual

play07:50

model in itself that is a 70 billion

play07:52

parameter version with floating Point

play07:55

Precision of 16 and that is how meta

play07:57

train this model and as you can see that

play07:59

that is of size 138 GB and you can pull

play08:02

it using this specific tag so if you

play08:04

have the capability ideally you get to

play08:06

pull the full size model and run it

play08:08

locally and run inferences on that so

play08:11

let me show you some other models so

play08:13

mistl is gaining a lot of popularity

play08:15

these days due to the fact that as they

play08:18

have mentioned that it outperforms the

play08:20

Llama 2 13 billion parameter version the

play08:22

7B mral uh actually outperforms the 13

play08:26

billion parameter version of Lama 2 so

play08:29

if I go to the tag section the the

play08:32

default version is the 7 billion version

play08:34

7 billion parameter version which is

play08:36

about size 4.1 GB and if you remember

play08:40

the 13 billion version of Lama 2 was

play08:42

about 8 GB in size so it's about half

play08:45

the size which has outperformed in all

play08:47

the benchmarks and that is why m is one

play08:49

of the most popular large language

play08:51

models at this moment and if you scroll

play08:53

to the bottom you'll see the floating

play08:55

Point 16 Precision 7 billion parameter

play08:58

version is only about size 14 GB much

play09:01

smaller much smaller uh large language

play09:04

model and you get to pull it using this

play09:06

specific command if you wish to do so so

play09:08

let me show you some other models so

play09:10

there is this model called lava which is

play09:13

a multimodel model and as you know gp4

play09:17

is also a multimodel model and a lot of

play09:19

people are saying that 2024 is going to

play09:21

be the year of multimodel for AI as you

play09:24

know that Apple has launched it Vision

play09:26

Pro and meta has launched with the

play09:28

collaboration of rayan it's meta smart

play09:31

glasses so you might have some seen some

play09:33

videos of the multimodel applications

play09:36

and lava is also you can say is a

play09:40

completely free open-source alternative

play09:43

to gbd4 and it does not generate any

play09:46

image but it can take as input an image

play09:51

or multiple images and text and it can

play09:54

response based on the context that is

play09:56

sees in the image as well as in the text

play09:59

then there are some other models that

play10:00

are like cod Lama which is also very

play10:03

popular and this is also developed by

play10:06

meta and if you want to generate some

play10:07

kind of solution for the coders then

play10:10

this specific llm can be of use to you

play10:13

so with that let me go ahead and

play10:15

download this so as you can see it is

play10:18

not yet available for Windows if you go

play10:21

to the GitHub and do some digging you

play10:23

will find that there are some

play10:24

workarounds that you can use to run it

play10:27

on Windows but right now the official

play10:29

support is for Mac OS and

play10:32

Linux so as you can see this is only 160

play10:35

MB in size so this is basically

play10:37

downloading the interface using which

play10:39

you'll be able to interact with these

play10:41

backend large language models by that I

play10:44

mean you'll be able to pull those models

play10:46

onto your desktop you'll be able to run

play10:48

those models pin up instances of those

play10:50

models and you'll be able to also

play10:52

interact with these models using by

play10:55

writing some kind of text and get

play10:57

inferences back as response on top of

play11:00

that it also spins up uh web API in your

play11:04

local host and you get to interact with

play11:06

it by sending a post request to it so if

play11:09

I go to any of the models let's say I'm

play11:11

going to the Mr model so there are two

play11:14

ways of interacting with it one is using

play11:16

the CLI and another is using the API

play11:18

call so you get to send a post request

play11:21

to a locally hosted web API and that is

play11:24

exposed from this 11434 port and you get

play11:28

to send uh post request and you get a

play11:30

response back so I'm going to demo that

play11:33

uh just in a bit so after that is

play11:35

downloaded I'll go ahead and install it

play11:38

I'll click on open and it says that it

play11:40

works best if you move it to

play11:42

Applications directory this is very

play11:44

straightforward installation for mac and

play11:47

it is moved to applications and now I

play11:49

can search for AMA and as you can see it

play11:52

just comes up in the taskbar that means

play11:54

the AMA is running in the back end the

play11:56

interface is running in the back end it

play11:57

hasn't yet spun up any instance of any

play12:01

large language model in fact we haven't

play12:03

even fed any large language model onto

play12:04

our desktop so for that we'll have to

play12:07

open the terminal so I'm going to bring

play12:08

up the terminal and we can run any model

play12:11

here so let's say for example I want to

play12:13

first fish this Lama 2 model and I can

play12:17

go to the tag section and choose a

play12:20

specific model that can be like the text

play12:22

model which is without any fine tuning

play12:25

or we can get the default chat model or

play12:27

I can get the 13 billion parameter

play12:29

version and whatever I want here but I'm

play12:32

just going to go with the default one

play12:34

which is larun Lama 2 which is also the

play12:38

command we see in the overview so I'm

play12:41

going to run this command here in the

play12:43

terminal and it is going to pull the

play12:46

model first so as you can see the 3.8gb

play12:49

sized model is being pulled onto our

play12:51

desktop and as soon as it is pulled it

play12:54

is going to spin up a instance of that

play12:55

model and will allow us to interact with

play12:58

it using this command line interface

play13:00

itself so this is going to take some

play13:02

time um depending upon upon your

play13:04

internet connection so I'm going to come

play13:07

back when this is done so as you can see

play13:09

it has downloaded the model first and

play13:11

then some additional small files and it

play13:14

has spun up a instance of this Lama 2

play13:17

model and it allows us to interact with

play13:20

it via the command line interface so I'm

play13:23

going to type a simple message

play13:27

here okay so it says that why don't

play13:30

scientist trust atoms because they make

play13:32

up everything I find that pretty amazing

play13:35

it's not too bad uh so let me stop this

play13:37

and open a new model um so you can hit

play13:42

controll D and that is going to stop the

play13:45

instance of this llama 2 and that is

play13:47

going to free up any resources that it's

play13:49

it's using but it is going to keep this

play13:51

model so next time if you want to run

play13:53

you just run the same command and it is

play13:55

going to just spin up the instance of

play13:57

this large language model model now let

play13:59

me bring up the mistal model so I have

play14:02

gone ahead and fetched this mistal model

play14:05

using olar mistal and it has also spun

play14:08

up instance of the of the same model

play14:11

using M model we can also do large

play14:14

language model tasks so let's use it for

play14:17

uh summarizing a URL earlier when chat

play14:20

GPT were launched that time we had the

play14:22

ability to paste in a URL and get a

play14:25

summary of that specific web page but I

play14:28

don't know why they have stopped that

play14:30

feature but now uh we can do that using

play14:33

this uh on device large language models

play14:36

so I'll go to go and fetch uh Eon essay

play14:41

this is one of the websites that are

play14:42

really good for uh reading essays and

play14:46

these are usually very very long essays

play14:48

and I'm just going to copy this URL and

play14:51

paste it before that I'm going to say

play14:53

that please summarize this URL and this

play14:56

is going to start summarizing this is

play14:58

all all running on device and I really

play15:01

don't know about the content of this

play15:02

essay so I think it is doing a decent

play15:05

job as it picks up the ears and

play15:09

everything it's a very long essay and

play15:11

the summary is also pretty long so I

play15:13

think this is doing a really good job of

play15:16

summarizing it so this is one of the llm

play15:18

task of summarizing URLs that you can

play15:20

run now on device by fetching these

play15:23

models so let me go ahead and fetch a

play15:25

multimodel model and for that I'm going

play15:28

to use use lava which is a completely

play15:31

free open source alternative to gp4 so

play15:33

you can pull this using this command I

play15:36

have already pulled it so if I run this

play15:39

command this is just going to spin up

play15:41

the instance so I'm going to stop this

play15:43

stop the instance of mistol by hitting

play15:45

control D and command K will clear the

play15:49

terminal window and then run the command

play15:51

to spin up an instance of

play15:53

lava so what I've done is that I have

play15:56

gone ahead and saved a few images on my

play15:59

desktop and I'm going to first pass

play16:01

along these images and then ask

play16:03

questions based on this that image so I

play16:06

can I can get the path of that image by

play16:08

hitting command I as you can say see

play16:10

this is a coffee Shop's image and let's

play16:13

see how this multimodel model does with

play16:15

this image I'm going to copy the path

play16:17

here and I'm going to write what is this

play16:20

image about and I'm going to say 1. jpg

play16:24

because that is the name of the image so

play16:26

it says that it has added the image the

play16:28

context and then trying to generate

play16:35

response and that's really really a lot

play16:38

of details so if I open the image up you

play16:41

can see the inference that it has

play16:43

generated is really good um it says that

play16:46

um it not only detect are two chairs and

play16:49

it is placed in front of a door It also

play16:51

says that there's a coffee cup in

play16:53

getting someone has been enjoying a hot

play16:56

beverage and they they the Handbags and

play16:59

everything it can also detect the cars I

play17:01

think these are cars covered I'm not

play17:03

sure but it says it also detects the

play17:06

objects that are outdoors and I'm not

play17:09

sure if it is actually correct but it is

play17:12

also saying that it is suggesting that

play17:14

this is uh that this is close to a

play17:17

roadway I think that is really really

play17:19

good so let let's try it with another

play17:21

image so I'm just going to change the

play17:23

name here because I have named them as 1

play17:26

two three let me bring up that image

play17:28

well I think there is a lot of details

play17:30

and it also says that possibly it's a

play17:33

promotional photo so it is not saying

play17:35

that this is a promotional photo it says

play17:36

that there is a likelihood that this is

play17:38

a promotional photo and I think this is

play17:40

indeed a promotional photo and that that

play17:42

is really good inference in my opinion

play17:44

so let's try it with another the third

play17:46

image and I'll bring up the image here

play17:49

so this is this says uh I like the fact

play17:51

that this actually talks about the

play17:53

possibilities and as it seems as though

play17:56

and what could have been uh the case and

play17:59

those kind of things and it also talks

play18:00

about the mood that it sees in the

play18:02

picture that it seems as though it had

play18:04

been it has been left behind unattended

play18:07

for quite some time and this is really

play18:09

good in in my opinion I'd like to know

play18:12

what you guys think about it and uh the

play18:14

environment appears to be deserted so it

play18:17

talks about that mood and it also talks

play18:19

about the what the what might be the

play18:21

possibilities and please note that this

play18:23

is all running on device and I'm using a

play18:26

am on Mac which is very old uh so if you

play18:29

have M3 Mac or a dedicated GPU then you

play18:33

might actually have a much better

play18:35

experience and you might as well pull in

play18:37

even a larger model and get much more

play18:40

detail inference out of it so let's uh

play18:43

throw in a little bit different task at

play18:46

it so I have pulled in a econ economics

play18:49

image so this is basically the 2,000

play18:51

years of economic history in one chart

play18:54

this this shows the GDP share of

play18:56

different countries over the last 2,000

play18:58

years so let's see how this multimodal

play19:01

model does for this specific image so

play19:04

I'm going to write what do you infer

play19:07

from this image well I think that would

play19:10

that is a very generic inference uh I

play19:13

mean it's a infographic and it talks

play19:16

about the gdps of different countries

play19:18

and that is very generic so let me try

play19:20

with a different uh prompt and see how

play19:23

that goes with it also it did did detect

play19:26

the year as a

play19:28

and when I prompted it using so when I

play19:31

prompted it uh with the trend keyword it

play19:35

detected year as a AIS but uh

play19:38

unfortunately it did not do a pretty

play19:41

good job I mean the spike that it is

play19:43

talking about is probably this one 19th

play19:46

mid 19th century so that means it would

play19:49

be 1850s so this is probably the spike

play19:52

that it is talking about um where there

play19:54

is an increase in the um in the GDP

play19:58

but that is actually not the case

play20:00

because this is 100% because this is a

play20:03

100% uh in total so some countries have

play20:07

grown and that has come at a cost of

play20:10

other countries so it has not been able

play20:12

to detect that probably because the

play20:14

country names are on the right side and

play20:16

on the left side there is this

play20:18

percentage chart and I would I would

play20:19

agree that this is probably one of the

play20:21

uh not a very easy chart to read for a

play20:24

machine learning model but uh I would

play20:27

like to test it with other charts or I

play20:30

would like to test this image with gp4

play20:33

and see how that performs so these are

play20:36

three very popular models that we tried

play20:38

also there are so many models you also

play20:40

get some uh really open source

play20:43

completely open uncensored models so

play20:45

these are llama 2's uncensored models so

play20:49

Creator George sun and Jared H they have

play20:52

written a very nice article if you are

play20:54

interested into the philosophical aspect

play20:57

of whether whether open source model

play20:59

should be uh should have alignment they

play21:02

are talking about what are the alignment

play21:04

issues and what the philosophical

play21:06

aspects of having alignment built into

play21:09

open source models this does show you

play21:11

that how this AMA uh takes of the

play21:14

concept of truly open so if you read

play21:17

through this article they are making a

play21:18

very strong argument that uh there

play21:20

should not be any alignment built into

play21:23

the truly open large language models and

play21:26

the way this large language model are

play21:28

trained does not have the sensoring or

play21:31

alignment built into it these are built

play21:33

on after this training is done these are

play21:36

built on top of it and these alignments

play21:38

are usually influenced by pop cultures

play21:41

and they're arguing that there should

play21:43

not be any one single popular culture

play21:46

that this model should follow and it

play21:48

should be truly open that is the strong

play21:50

argument that they are making in this

play21:52

article and based on that ethics Lama to

play21:55

uncensored model is built and you you

play21:58

can download and run it locally using

play22:00

this ol just by running this command and

play22:03

if you are into the philosophical aspect

play22:06

of artificial intelligence and its

play22:08

impact on society then this article is a

play22:11

must read I'm going to put a link of

play22:13

this in the description but again you

play22:15

can just come to this llama 2 uncensored

play22:18

models page and at the bottom you get

play22:20

this link with that there is one last

play22:23

thing that I'd like to show which is the

play22:25

accessing this large language models via

play22:28

this rest API so we have already

play22:30

installed the mral large language model

play22:33

and let's use the API route to send rest

play22:37

API calls to this locally hosted large

play22:40

language models so I'm going to open vs

play22:42

code and I'm going to hit command shift

play22:44

R to bring up the Thunder C and if this

play22:47

is not installed you might want to use

play22:50

Postman or if you want to use this

play22:53

thunder CL only then you can hit command

play22:55

shift X to bring up the extension panel

play22:58

and search for Thunder client and you

play23:00

can install this so I'm just going to go

play23:02

this go go to this thunder C which is

play23:05

already installed and I'm going to click

play23:07

on new request I'm to close this sidebar

play23:10

and send a post request to this Local

play23:13

Host so before I do that if I go to this

play23:16

uh URL which is if I go to this specific

play23:19

Port it says that ama is running so

play23:21

we'll be able to send uh API call is

play23:24

running because we can see that this is

play23:26

coming in the taskbar so let's send a

play23:29

API request so here I'll be pasting that

play23:32

URL API generate and I'll be sending a

play23:35

post request in the body I will mention

play23:38

the model name and there is one thing I

play23:40

I'll also add which is the stream I'll

play23:43

set it as false so whenever a locally

play23:47

hosted model runs it sets the stream as

play23:50

true that is how you can see each of

play23:52

those token printed one after the other

play23:54

but since we are running sending it as a

play23:57

rest API call I would like to get the

play24:00

all the response at one time in one

play24:03

single Json object and how I know that

play24:05

is actually they haven't me mentioned

play24:08

this API documentation link here but I

play24:10

think they need to update that link uh

play24:13

in in this model page but if you go to

play24:15

any other models page you get to you get

play24:18

this link API documentation just under

play24:20

this API and here this page actually

play24:24

serves uh for all the models so you see

play24:27

Lama 2 here you also see mistl you see

play24:32

the lava you see if you search for mistl

play24:35

you'll get mistal as well here so this

play24:37

page actually serves the purpose for all

play24:39

the all the models and this link should

play24:41

also be there in the mystal uh API

play24:44

documentation so as you can see by

play24:46

default it says that a stream Json

play24:49

object is returned so basically you get

play24:51

a stream object but you can set the

play24:54

stream to false so true is the default

play24:58

option and when you set it as false you

play25:00

get the whole response in one just an

play25:03

object so that is what I'm going to do

play25:05

here and hit send so I'm sending a post

play25:08

request to a locally hosted web API with

play25:10

the body which mentions the model name

play25:12

as mistal and I'm just asking for

play25:15

probably uh okay so let me ask something

play25:17

else let me ask like what is the capital

play25:21

of India and send and we get the

play25:23

response the capital of India is New

play25:25

Delhi and it does give a lot of other

play25:28

responses uh designed to build by

play25:30

prominent British Architects and

play25:33

everything a lot of details are given

play25:35

you can format it in Json format so in

play25:37

the prompt you can say that please

play25:40

populate the following Json which says

play25:43

let's see how how this goes well it does

play25:46

I mean okay so at least you can extract

play25:49

the Json so if you add a filter between

play25:51

this uh just to extract a Json from the

play25:54

string and then you do get the Json uh

play25:57

and the answer in a nicely formatted way

play26:00

also it does add another new line and it

play26:03

says New Delhi is the capital city of

play26:06

India so you get to format the response

play26:09

in a way that you want U by just doing

play26:13

some kind of prompt engineering but the

play26:15

point is that all of these are running

play26:17

in a locally hosted environment and you

play26:19

are able to send a API request to this

play26:22

Local Host port and get inference

play26:25

response

play26:26

back

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
AI DevelopmentLocal HostingLarge ModelsReal-time InferenceWebMLClient-SideTensorFlow JSTransformers JSHealthcare ComplianceFinancial DataMultimodal AI