Ollama.ai: A Developer's Quick Start Guide!
Summary
TLDRThis video offers a developer's perspective on the AMA interface, explaining its role in AI development tools and future prospects. It discusses the evolution from API-based interactions with large language models to the need for on-device processing due to limitations and legal restrictions. The video explores solutions like WebML and introduces Olama, which allows running large models on consumer GPUs for real-time inferences. It also covers various models, including Llama 2, Mistral, and Lava, demonstrating their capabilities and use cases, such as summarizing URLs and analyzing images. The script concludes by showing how to access these models via REST API, highlighting the potential for locally hosted AI in enhancing user experience and performance.
Takeaways
- π The video discusses the evolution and current state of AI development tools, particularly focusing on the limitations of cloud-hosted large language models and the benefits of on-device AI models.
- π Large language models traditionally required API calls and cloud infrastructure, but this approach has limitations in latency, data sensitivity, and real-time processing needs.
- π₯ In sensitive industries like healthcare and finance, there are legal restrictions on sending patient or financial data to cloud-based models due to privacy concerns.
- π₯ Use cases such as live streaming or video calling require real-time inference capabilities, which are not feasible with cloud-based models that introduce latency.
- π WebML offers a solution for client-side rendering through libraries like TensorFlow.js and Hugging Face's Transformers.js, allowing models to run directly in the browser.
- πΎ WebML allows for quantized versions of models to be stored in browser cache for real-time inferences, but it is limited to web applications and can have loading constraints.
- π₯οΈ The video introduces 'Olama', an interface that enables the fetching and running of large language models on consumer GPUs, providing more flexibility for desktop applications.
- π The script covers various models like Llama 2, Mistral, and Lava, highlighting their capabilities, sizes, and use cases, including summarizing URLs and multimodal tasks.
- π The video demonstrates how to interact with these models using both command-line interface (CLI) and REST API calls, showcasing the versatility of on-device AI models.
- π The process of pulling and running models locally includes downloading the model, spinning up an instance, and interacting with it to perform tasks like summarization or image analysis.
- π The script also touches on the philosophical and ethical considerations of open-source AI models, discussing the importance of avoiding cultural biases and maintaining model openness.
Q & A
What is the main focus of the video?
-The video provides a developer's perspective on the AMA (Auto Model Adapter) interface, discussing how it fits into AI development tools and its potential future impact.
Why were large language models initially limited to running on big organizations' infrastructures?
-Large language models were initially limited to big organizations' infrastructures because they required significant computational resources and were typically accessed via API calls, which had limitations in terms of latency and data privacy.
What are some limitations of using API calls to interact with large language models?
-API calls have limitations such as potential delays in response times, which can be problematic for real-time applications, and privacy concerns when dealing with sensitive information that cannot be sent to cloud-hosted models.
How does using WebML with libraries like TensorFlow.js or Hugging Face Transformers.js address some of the limitations of API calls?
-WebML allows developers to fetch quantized versions of models that are smaller in size and run them in the browser cache, enabling real-time inferences without the need to send data to a backend server.
What is the significance of client-side rendering for certain applications like live streaming or video calling apps?
-Client-side rendering is crucial for applications that require real-time processing, such as live streaming or video calling apps, where waiting for a response from a backend API would not provide a seamless user experience.
What is the promise of Olama and how does it differ from WebML?
-Olama is an interface that allows large language models to be fetched and run on the client environment, including on consumer GPUs. Unlike WebML, which is limited to web browsers, Olama can be used for desktop applications and other environments that require local model inference.
What are some popular models that can be fetched and run locally using Olama?
-Some popular models include Llama 2, developed by Meta, Mistral, which is gaining popularity for its performance, and Lava, a multimodal model that can process both images and text.
What are the system requirements for running the 7B and 70B versions of the Llama 2 model?
-The 7B version of Llama 2 requires 8GB of RAM, while the 70B version requires 64GB of RAM, indicating that larger models need more substantial system resources to run effectively.
How does the video demonstrate the use of locally hosted models for summarizing URLs?
-The video shows how, by using a locally hosted model like Mistral, a URL can be summarized on-device without needing to send the request to a remote server, which can be more efficient and privacy-preserving.
What is the philosophical argument made by the creators of the uncensored Llama 2 models regarding alignment in AI models?
-The creators argue that truly open large language models should not have alignment built into them, as it can be influenced by popular culture and may not represent diverse perspectives. Instead, they advocate for models that remain unbiased and open to various cultural influences.
How can developers interact with locally hosted models using REST API calls?
-Developers can send REST API calls to a locally hosted web API, specifying the model name and other parameters in the request body. The API will return the inference results in a JSON object, which can be formatted as needed by the developer.
Outlines
π€ Large Language Models and Their Evolution
The video discusses the shift from API-based interaction with large language models to on-device deployment. Initially, these models were used within large organizations and accessed via APIs, which had limitations such as outdated responses and latency issues. The video highlights the need for on-device models in sensitive industries like healthcare and finance, as well as for real-time applications like live captioning. It introduces the concept of WebML and libraries like TensorFlow.js and Hugging Face's Transformers.js, which allow for smaller, quantized models to be run on the client side for real-time inferences.
π Introduction to OLAMA and Its Benefits
This section introduces OLAMA, an interface that enables developers to fetch large language models onto the client environment, allowing them to run on consumer GPUs. It emphasizes the advantages of running models locally, such as avoiding legal restrictions on data transfer and reducing latency. The video provides a walkthrough of how to download and set up OLAMA, and how to choose different models and their versions, including Llama 2 and Mistral, which are highlighted for their popularity and performance.
π Exploring Multimodal Models and Their Capabilities
The script delves into multimodal models like Lava, which can process both images and text. It demonstrates the process of fetching and interacting with Lava, showcasing its ability to analyze images and provide detailed responses based on the visual content. The video also touches on other models like Codama, designed for developers, and discusses the philosophical aspects of open-source models, including the importance of avoiding cultural biases and the concept of alignment in AI models.
π Hands-On with Local Model Inference and REST API Integration
The speaker provides a practical demonstration of using locally hosted models for inference, showing how to pull models like Llama 2 and Mistral, and interact with them via command line interface. It also covers summarizing URLs and using multimodal models to analyze images. Additionally, the video shows how to send REST API calls to a locally hosted model, Mistral, to get inference responses, and how to format these responses in JSON, highlighting the flexibility and power of on-device AI models.
π Philosophical Considerations and Open-Source Models
The video script touches on the philosophical debate around open-source AI models, referencing an article by George Sun and Jared H that argues for the importance of keeping models truly open and unbiased by any single culture. It mentions the uncensored versions of Llama 2 models as examples of this philosophy in action, emphasizing the ethical considerations in AI development and the potential societal impacts.
π Wrapping Up with Local Model Deployment and API Interaction
The final part of the script wraps up the discussion by summarizing the capabilities of local model deployment and the ease of interacting with these models via REST API calls. It reiterates the benefits of reduced latency and the ability to handle sensitive data locally, and encourages viewers to explore the use of these technologies in their own projects.
Mindmap
Keywords
π‘Developers Perspective
π‘API Calls
π‘WebML
π‘Quantized Models
π‘Client-Side Rendering
π‘Llama 2
π‘Mistral
π‘Multimodal Model
π‘Local Inference
π‘ONNX Runtime
π‘Ethical Considerations
π‘REST API
Highlights
Developer's perspective on AMA (Autonomous Model Agents) and its interface explained.
AMA's role in the schema of AI development tools and its future implications.
Historical context of large language models running on big organizations' infrastructures and limitations of API-based interaction.
Challenges with real-time responses and legal restrictions on sending sensitive data to cloud-hosted models.
Introduction to WebML and TensorFlow.js as a solution for client-side rendering.
Demonstration of running TensorFlow.js for real-time object detection from webcam feed.
Limitations of WebML in terms of model loading times and user experience.
The need for large language models on desktop for certain applications, such as live captioning plugins.
Overview of Olama as an interface to fetch and run large language models on client environments.
Instructions on setting up AMA, including downloading and model fetching process.
Details on different models available, such as Llama 2, Mistral, and their respective requirements.
Demonstration of summarizing a URL using on-device large language models.
Introduction to Lava, a multimodal model capable of processing images and text.
Live demonstration of Lava's inference capabilities on provided images.
Discussion on the philosophical aspect of open-source models and the importance of alignment.
Accessing large language models via REST API and formatting the response as needed.
Practical examples of interacting with locally hosted models using API calls.
Transcripts
so in this video I'm going to give you a
developers perspective on AMA and I'm
going to tell you everything that you
need to know for this interface and I'm
going to give you how it fits in the
overall schema of all the AI development
tools that we have as developers and how
I see it panning out in the future I'm
going to tell you everything you need to
know about this in this video so when uh
large language models were introduced
they used to run in these big
organizations infrastructures and we
used to be able to interact with them
using API calls for example if you were
a python developer you would probably
install a python package if you were a
njs developer you would install an npm
package and you would use that package
as an interface to send API calls to
these backend Cloud hosted large
language models and get Json response
back over the years we found out that
that approach has its limitations not
only the fact that we would get
eventually date Limited but also in a
lot of cases sending a query back to the
API and waiting for 5 seconds to get the
response back is not just going to cut
it if I take an example of let's say if
you are building a solution for a
healthcare client then you might be
legally restricted from sending
sensitive patient information back to
this Cloud hosted large language models
or if you're working for a financial
industry client then also you might not
be able to send sensitive information
back to this large language models that
are hosted in the cloud also there are
certain other use cases so if you're
building a automatic captioning plug-in
for a live streaming app or a video
calling app then you are not going to
send a request to this backend API and
wait for 5 seconds right you'll have to
run this large language models on the
client so basically you need client side
rendering in such cases and there are
two ways the development Community has
tried to solve it one is webml so we're
talking about the the tensorflow JS
Library which is very popular uh or
hugging faces Transformers JS Library
all these libraries allow you to fetch a
quantized version of these models which
is much smaller in size probably 100 MB
or so and we get to Fish this small
quantis models and store them in browser
cache and run inferences based upon that
so you'll be able to run real-time
inferences I have some projects that I
worked on and I have made a video where
I am using the tensorflow jss COC SSD
model I'm fetching the the quantize
version of it and which is about 100 MB
in size and I running real-time
inferences uh to detect objects from the
webcam feed and whenever a person is
detected I'm recording a 30 second clip
and saving that video in the downloads
folder so for this kind of use cases you
are not going to send a request to the
back end backend API frame by frame that
you are getting from the webcam feed
rather that you want to run realtime
inference that is running on device but
as you can see for webml you are limited
by the fact that whenever you you'll be
loading this web page the model also has
to load obviously the model has to load
only once until or unless you clear the
browser cache but still you are limited
by the fact that you don't want this
model to load for a very long time
because that is not going to be the
right user experience on top of that
there are certain other apps that are
also running on a desktop that might
also want to use this large language
model inferences right so let's say for
example I build this app as a web app
and what if I want to package this and
share it as a desktop app right now with
webml I'm only bound within the web
browser so if I take the example of live
captioning where I'm building a plugin
for let's say for example Zoom where I
want to create live captioning for this
video call then webml is just not going
to cut it I need this l large language
models to be running in the desktop or
if I take another example so Adobe has a
nice tool if I go to Adobe speech
enhancer so in this you get to upload a
very long audio and so as you can see it
is about 43 minutes of audio you get to
upload this audio and it enhances the
audio and makes the studio quality so it
not only suppresses the noise it also
applies some artificial intelligence Bas
based settings onto the audio so that it
becomes very high quality you might want
to try it this is really good but the
problem is that let's say for example
you are working on theci resolve and you
want the audio quality to be enhanced by
this AP by this AI you'll have to export
that as a audio file you'll have to
upload it here then again align the
audio with the video and then do it for
all the audio tracks that you have
recorded think about all the B RS that
you might have recorded and aligning all
of them and doing it by uploading it to
this web we app if you had this AI
models running in the desktop and if you
had created a resolve plug-in you'd be
able to utilize that locally hosted
artificial intelligence model and you
would probably be able to do it within
your desktop environment within the
resolve without leaving the environment
of the D resolve so that is the
basically the promise of olama so this
is nothing but a interface that allows
you to fetch this large language models
onto the client environment so they are
saying we allow you to fetch the large
Fetch and run this large young language
models on consumer GPU so if you have a
dedicated GPU or some extra Ram that is
going to be more power to you you'll be
able to fetch and run even larger models
but on basic average consumer GPU this
allows you to fetch these models and
run to set up AMA you'll have to go to
this Ama a and click on this download
but before I click on this download I'm
going to go to the models page where
you'll be able to see that the list of
models is quite large and Lama 2 is one
of the most popular models that is
developed by meta and if I go to this
tags section you'll be able to see that
you can run or rather pull this model
using this command so whenever when you
will run this command for the first time
this model is going to be pulled onto
your desktop and it is going to spin up
an instance of that specific large
language model and as you can see the
size is 3.8 GB this is a default model
that gets pulled uh using this command
but then again there are so many other
versions so the default version is
basically the 7B version and the chat
fine-tuned versions of it so if you read
the documentation so as they mention it
here and each models are different
obviously this Lama 2 model is a large
language model there are so many other
large language models like like for
example mistal which is a more popular
model at this moment so they mention
that the 7B model requires 8GB of RAM
and the ram requirements and everything
and 70b requires 64 GB of RAM and they
also mention details about the model
variants they mentioned that the chat
version is basically the chat uh the
finetune version for for chat use cases
and the text version so if you use the
tag text you get the model without the
chat fine tuning so if I go back to the
tag section this is a default version
which is the 7B chat version that you
get but then you get the 13v version as
well and obviously the text and chat
versions of those models as well and if
you scroll down to the bottom you will
see that you get to pull the actual
model in itself that is a 70 billion
parameter version with floating Point
Precision of 16 and that is how meta
train this model and as you can see that
that is of size 138 GB and you can pull
it using this specific tag so if you
have the capability ideally you get to
pull the full size model and run it
locally and run inferences on that so
let me show you some other models so
mistl is gaining a lot of popularity
these days due to the fact that as they
have mentioned that it outperforms the
Llama 2 13 billion parameter version the
7B mral uh actually outperforms the 13
billion parameter version of Lama 2 so
if I go to the tag section the the
default version is the 7 billion version
7 billion parameter version which is
about size 4.1 GB and if you remember
the 13 billion version of Lama 2 was
about 8 GB in size so it's about half
the size which has outperformed in all
the benchmarks and that is why m is one
of the most popular large language
models at this moment and if you scroll
to the bottom you'll see the floating
Point 16 Precision 7 billion parameter
version is only about size 14 GB much
smaller much smaller uh large language
model and you get to pull it using this
specific command if you wish to do so so
let me show you some other models so
there is this model called lava which is
a multimodel model and as you know gp4
is also a multimodel model and a lot of
people are saying that 2024 is going to
be the year of multimodel for AI as you
know that Apple has launched it Vision
Pro and meta has launched with the
collaboration of rayan it's meta smart
glasses so you might have some seen some
videos of the multimodel applications
and lava is also you can say is a
completely free open-source alternative
to gbd4 and it does not generate any
image but it can take as input an image
or multiple images and text and it can
response based on the context that is
sees in the image as well as in the text
then there are some other models that
are like cod Lama which is also very
popular and this is also developed by
meta and if you want to generate some
kind of solution for the coders then
this specific llm can be of use to you
so with that let me go ahead and
download this so as you can see it is
not yet available for Windows if you go
to the GitHub and do some digging you
will find that there are some
workarounds that you can use to run it
on Windows but right now the official
support is for Mac OS and
Linux so as you can see this is only 160
MB in size so this is basically
downloading the interface using which
you'll be able to interact with these
backend large language models by that I
mean you'll be able to pull those models
onto your desktop you'll be able to run
those models pin up instances of those
models and you'll be able to also
interact with these models using by
writing some kind of text and get
inferences back as response on top of
that it also spins up uh web API in your
local host and you get to interact with
it by sending a post request to it so if
I go to any of the models let's say I'm
going to the Mr model so there are two
ways of interacting with it one is using
the CLI and another is using the API
call so you get to send a post request
to a locally hosted web API and that is
exposed from this 11434 port and you get
to send uh post request and you get a
response back so I'm going to demo that
uh just in a bit so after that is
downloaded I'll go ahead and install it
I'll click on open and it says that it
works best if you move it to
Applications directory this is very
straightforward installation for mac and
it is moved to applications and now I
can search for AMA and as you can see it
just comes up in the taskbar that means
the AMA is running in the back end the
interface is running in the back end it
hasn't yet spun up any instance of any
large language model in fact we haven't
even fed any large language model onto
our desktop so for that we'll have to
open the terminal so I'm going to bring
up the terminal and we can run any model
here so let's say for example I want to
first fish this Lama 2 model and I can
go to the tag section and choose a
specific model that can be like the text
model which is without any fine tuning
or we can get the default chat model or
I can get the 13 billion parameter
version and whatever I want here but I'm
just going to go with the default one
which is larun Lama 2 which is also the
command we see in the overview so I'm
going to run this command here in the
terminal and it is going to pull the
model first so as you can see the 3.8gb
sized model is being pulled onto our
desktop and as soon as it is pulled it
is going to spin up a instance of that
model and will allow us to interact with
it using this command line interface
itself so this is going to take some
time um depending upon upon your
internet connection so I'm going to come
back when this is done so as you can see
it has downloaded the model first and
then some additional small files and it
has spun up a instance of this Lama 2
model and it allows us to interact with
it via the command line interface so I'm
going to type a simple message
here okay so it says that why don't
scientist trust atoms because they make
up everything I find that pretty amazing
it's not too bad uh so let me stop this
and open a new model um so you can hit
controll D and that is going to stop the
instance of this llama 2 and that is
going to free up any resources that it's
it's using but it is going to keep this
model so next time if you want to run
you just run the same command and it is
going to just spin up the instance of
this large language model model now let
me bring up the mistal model so I have
gone ahead and fetched this mistal model
using olar mistal and it has also spun
up instance of the of the same model
using M model we can also do large
language model tasks so let's use it for
uh summarizing a URL earlier when chat
GPT were launched that time we had the
ability to paste in a URL and get a
summary of that specific web page but I
don't know why they have stopped that
feature but now uh we can do that using
this uh on device large language models
so I'll go to go and fetch uh Eon essay
this is one of the websites that are
really good for uh reading essays and
these are usually very very long essays
and I'm just going to copy this URL and
paste it before that I'm going to say
that please summarize this URL and this
is going to start summarizing this is
all all running on device and I really
don't know about the content of this
essay so I think it is doing a decent
job as it picks up the ears and
everything it's a very long essay and
the summary is also pretty long so I
think this is doing a really good job of
summarizing it so this is one of the llm
task of summarizing URLs that you can
run now on device by fetching these
models so let me go ahead and fetch a
multimodel model and for that I'm going
to use use lava which is a completely
free open source alternative to gp4 so
you can pull this using this command I
have already pulled it so if I run this
command this is just going to spin up
the instance so I'm going to stop this
stop the instance of mistol by hitting
control D and command K will clear the
terminal window and then run the command
to spin up an instance of
lava so what I've done is that I have
gone ahead and saved a few images on my
desktop and I'm going to first pass
along these images and then ask
questions based on this that image so I
can I can get the path of that image by
hitting command I as you can say see
this is a coffee Shop's image and let's
see how this multimodel model does with
this image I'm going to copy the path
here and I'm going to write what is this
image about and I'm going to say 1. jpg
because that is the name of the image so
it says that it has added the image the
context and then trying to generate
response and that's really really a lot
of details so if I open the image up you
can see the inference that it has
generated is really good um it says that
um it not only detect are two chairs and
it is placed in front of a door It also
says that there's a coffee cup in
getting someone has been enjoying a hot
beverage and they they the Handbags and
everything it can also detect the cars I
think these are cars covered I'm not
sure but it says it also detects the
objects that are outdoors and I'm not
sure if it is actually correct but it is
also saying that it is suggesting that
this is uh that this is close to a
roadway I think that is really really
good so let let's try it with another
image so I'm just going to change the
name here because I have named them as 1
two three let me bring up that image
well I think there is a lot of details
and it also says that possibly it's a
promotional photo so it is not saying
that this is a promotional photo it says
that there is a likelihood that this is
a promotional photo and I think this is
indeed a promotional photo and that that
is really good inference in my opinion
so let's try it with another the third
image and I'll bring up the image here
so this is this says uh I like the fact
that this actually talks about the
possibilities and as it seems as though
and what could have been uh the case and
those kind of things and it also talks
about the mood that it sees in the
picture that it seems as though it had
been it has been left behind unattended
for quite some time and this is really
good in in my opinion I'd like to know
what you guys think about it and uh the
environment appears to be deserted so it
talks about that mood and it also talks
about the what the what might be the
possibilities and please note that this
is all running on device and I'm using a
am on Mac which is very old uh so if you
have M3 Mac or a dedicated GPU then you
might actually have a much better
experience and you might as well pull in
even a larger model and get much more
detail inference out of it so let's uh
throw in a little bit different task at
it so I have pulled in a econ economics
image so this is basically the 2,000
years of economic history in one chart
this this shows the GDP share of
different countries over the last 2,000
years so let's see how this multimodal
model does for this specific image so
I'm going to write what do you infer
from this image well I think that would
that is a very generic inference uh I
mean it's a infographic and it talks
about the gdps of different countries
and that is very generic so let me try
with a different uh prompt and see how
that goes with it also it did did detect
the year as a
and when I prompted it using so when I
prompted it uh with the trend keyword it
detected year as a AIS but uh
unfortunately it did not do a pretty
good job I mean the spike that it is
talking about is probably this one 19th
mid 19th century so that means it would
be 1850s so this is probably the spike
that it is talking about um where there
is an increase in the um in the GDP
but that is actually not the case
because this is 100% because this is a
100% uh in total so some countries have
grown and that has come at a cost of
other countries so it has not been able
to detect that probably because the
country names are on the right side and
on the left side there is this
percentage chart and I would I would
agree that this is probably one of the
uh not a very easy chart to read for a
machine learning model but uh I would
like to test it with other charts or I
would like to test this image with gp4
and see how that performs so these are
three very popular models that we tried
also there are so many models you also
get some uh really open source
completely open uncensored models so
these are llama 2's uncensored models so
Creator George sun and Jared H they have
written a very nice article if you are
interested into the philosophical aspect
of whether whether open source model
should be uh should have alignment they
are talking about what are the alignment
issues and what the philosophical
aspects of having alignment built into
open source models this does show you
that how this AMA uh takes of the
concept of truly open so if you read
through this article they are making a
very strong argument that uh there
should not be any alignment built into
the truly open large language models and
the way this large language model are
trained does not have the sensoring or
alignment built into it these are built
on after this training is done these are
built on top of it and these alignments
are usually influenced by pop cultures
and they're arguing that there should
not be any one single popular culture
that this model should follow and it
should be truly open that is the strong
argument that they are making in this
article and based on that ethics Lama to
uncensored model is built and you you
can download and run it locally using
this ol just by running this command and
if you are into the philosophical aspect
of artificial intelligence and its
impact on society then this article is a
must read I'm going to put a link of
this in the description but again you
can just come to this llama 2 uncensored
models page and at the bottom you get
this link with that there is one last
thing that I'd like to show which is the
accessing this large language models via
this rest API so we have already
installed the mral large language model
and let's use the API route to send rest
API calls to this locally hosted large
language models so I'm going to open vs
code and I'm going to hit command shift
R to bring up the Thunder C and if this
is not installed you might want to use
Postman or if you want to use this
thunder CL only then you can hit command
shift X to bring up the extension panel
and search for Thunder client and you
can install this so I'm just going to go
this go go to this thunder C which is
already installed and I'm going to click
on new request I'm to close this sidebar
and send a post request to this Local
Host so before I do that if I go to this
uh URL which is if I go to this specific
Port it says that ama is running so
we'll be able to send uh API call is
running because we can see that this is
coming in the taskbar so let's send a
API request so here I'll be pasting that
URL API generate and I'll be sending a
post request in the body I will mention
the model name and there is one thing I
I'll also add which is the stream I'll
set it as false so whenever a locally
hosted model runs it sets the stream as
true that is how you can see each of
those token printed one after the other
but since we are running sending it as a
rest API call I would like to get the
all the response at one time in one
single Json object and how I know that
is actually they haven't me mentioned
this API documentation link here but I
think they need to update that link uh
in in this model page but if you go to
any other models page you get to you get
this link API documentation just under
this API and here this page actually
serves uh for all the models so you see
Lama 2 here you also see mistl you see
the lava you see if you search for mistl
you'll get mistal as well here so this
page actually serves the purpose for all
the all the models and this link should
also be there in the mystal uh API
documentation so as you can see by
default it says that a stream Json
object is returned so basically you get
a stream object but you can set the
stream to false so true is the default
option and when you set it as false you
get the whole response in one just an
object so that is what I'm going to do
here and hit send so I'm sending a post
request to a locally hosted web API with
the body which mentions the model name
as mistal and I'm just asking for
probably uh okay so let me ask something
else let me ask like what is the capital
of India and send and we get the
response the capital of India is New
Delhi and it does give a lot of other
responses uh designed to build by
prominent British Architects and
everything a lot of details are given
you can format it in Json format so in
the prompt you can say that please
populate the following Json which says
let's see how how this goes well it does
I mean okay so at least you can extract
the Json so if you add a filter between
this uh just to extract a Json from the
string and then you do get the Json uh
and the answer in a nicely formatted way
also it does add another new line and it
says New Delhi is the capital city of
India so you get to format the response
in a way that you want U by just doing
some kind of prompt engineering but the
point is that all of these are running
in a locally hosted environment and you
are able to send a API request to this
Local Host port and get inference
response
back
Browse More Related Video
Ollama-Run large language models Locally-Run Llama 2, Code Llama, and other models
Introduction to large language models
LLaMA 3 using API | Free | No GPU | No colab | No installation | LPU | GROQ
RUN LLMs Locally On ANDROID: LlaMa3, Gemma & More
This new AI is powerful and uncensoredβ¦ Letβs run it
AI Generativa
5.0 / 5 (0 votes)