Building a RAG application using open-source models (Asking questions from a PDF using Llama2)
Summary
TLDREste video ofrece una guía detallada para ejecutar modelos locales de LLM en tu computadora, centrándose en modelos de código abierto como alternativas económicas y privadas a GPT. El presentador explica la importancia de comprender y utilizar LLM locales, destacando su accesibilidad, coste reducido y privacidad. Además, demuestra cómo construir un sistema de generación aumentado por recuperación (RAG), utilizando modelos locales para responder preguntas desde un PDF, subrayando el razonamiento detrás del código más que el código mismo. Este contenido es valioso para quienes buscan desplegar modelos de IA en escenarios sin conectividad o como respaldo a modelos como OpenAI.
Takeaways
- 🌟 Para ejecutar un modelo LLM local en tu computadora, se utilizan modelos open source que son accesibles y de bajo costo.
- 🛠️ Los modelos open source son importantes para la privacidad, permitiendo que las empresas manejen sus datos internamente sin conectarse a APIs externas.
- 🔄 La razón clave es entender la lógica detrás del código y no solo el código en sí, lo que se busca transmitir en el video.
- 📚 El proceso comienza con una exploración de la herramienta AMA, que sirve como envoltura común para diferentes modelos.
- 🔗 Se pueden descargar modelos como Lama 2, mol mixol, etc., a través de la plataforma AMA.
- 📋 La instalación de AMA es simple y disponible para diferentes sistemas operativos como Mac, Linux y Windows.
- 📈 Los modelos LLM son como fórmulas matemáticas gigantes, compuestos por parámetros de pesos y sesgos.
- 💻 Al descargar un modelo, se obtienen todos estos valores y se almacenan en el disco duro.
- 🔍 Se utiliza L chain para construir un sistema simple de rack desde cero, utilizando el modelo local para responder preguntas de un PDF.
- 🔖 Se crea un ambiente virtual en Visual Studio Code para instalar las librerías necesarias sin afectar el sistema principal.
- 📊 Se demuestra cómo se pueden obtener respuestas de un modelo local y cómo se puede manejar la información de un PDF para responder preguntas específicas.
Q & A
¿Por qué es importante saber cómo usar un LLM local?
-Es importante por varias razones: los modelos de código abierto están mejorando, son más económicos que los GPT para ciertos casos de uso, ofrecen ventajas de privacidad al no necesitar conectarse a una API externa, y son útiles en escenarios sin conectividad, como en robótica o dispositivos edge.
¿Qué beneficios ofrece el uso de modelos de código abierto en comparación con los modelos GPT?
-Los modelos de código abierto son más económicos y ofrecen soluciones eficaces para ciertos casos de uso sin requerir la potencia completa de los modelos GPT, lo que los hace especialmente valiosos para empresas preocupadas por la privacidad o para aplicaciones en entornos sin acceso a internet.
¿Cómo se puede utilizar un modelo de código abierto como respaldo de los modelos de OpenAI?
-Se puede configurar un modelo de código abierto para actuar como respaldo de los modelos de OpenAI, de manera que si la API de OpenAI sufre una caída, se puede cambiar inmediatamente al modelo de código abierto y mantener el flujo de trabajo sin interrupciones.
¿Qué es AMA y para qué sirve en el contexto de ejecutar un LLM localmente?
-AMA es un envoltorio común que permite ejecutar diferentes modelos de LLM, como Lama 2 o Mixol, localmente en un ordenador. Facilita la instalación y la ejecución de estos modelos proporcionando una interfaz común para su manejo.
¿Cuál es la diferencia principal entre los modelos de chat y los modelos de completamiento mencionados en el guion?
-Los modelos de chat están diseñados para conversaciones, devolviendo estructuras especiales para mensajes de IA y humanos, mientras que los modelos de completamiento generan respuestas directas en forma de texto sin estructuras adicionales, enfocándose en completar un texto basado en un prompt dado.
¿Qué papel juegan los embeddings en los sistemas de generación aumentada por recuperación (RAG)?
-Los embeddings permiten convertir documentos en representaciones vectoriales para compararlos con las preguntas de los usuarios. Esto facilita la identificación de las partes del documento más relevantes para responder a una pregunta, mejorando la precisión de las respuestas generadas por el modelo.
¿Cómo se utiliza L chain para construir un sistema RAG simple?
-L chain se utiliza para encadenar diferentes componentes, como cargadores de documentos, generadores de prompts, modelos de LLM y analizadores de salida, para crear un flujo de trabajo que permita al modelo responder preguntas basadas en el contenido de un documento cargado, como un PDF.
¿Qué ventajas ofrece el enfoque de usar una cadena (chain) en la construcción de sistemas RAG?
-El enfoque de cadena permite una modularidad y reutilización de componentes, facilitando la construcción de sistemas complejos al conectar diferentes piezas de funcionalidad de manera flexible y eficiente, y permitiendo ajustes y optimizaciones sin alterar el sistema en su conjunto.
¿Por qué es relevante la funcionalidad de 'streaming' y 'batching' al usar LLMs?
-El 'streaming' permite recibir respuestas en tiempo real a medida que el modelo las genera, mejorando la interactividad, mientras que el 'batching' permite procesar múltiples preguntas en paralelo, aumentando la eficiencia y reduciendo el tiempo de respuesta total.
¿Cuáles son los desafíos al utilizar modelos de LLM de código abierto en comparación con los modelos GPT de OpenAI?
-Los modelos de código abierto pueden no ser tan avanzados o precisos como los modelos GPT de OpenAI, lo que puede resultar en respuestas menos precisas o relevantes. Además, la implementación y mantenimiento local de estos modelos pueden requerir recursos adicionales y conocimientos técnicos.
Outlines
🚀 Introducción a LLM locales y sistemas RAG
El presentador introduce la importancia de correr modelos de lenguaje de código abierto (LLM) localmente, destacando tres beneficios clave: costo-eficiencia comparado con modelos como GPT, razones de privacidad para empresas que prefieren no usar APIs externas, y la utilidad en aplicaciones sin conexión a internet como dispositivos de borde y robótica. Adicionalmente, se menciona el uso de modelos abiertos como respaldo en caso de caída de servicios como el de OpenAI. El objetivo es enseñar a construir un sistema de generación aumentada por recuperación (RAG) usando LLM locales, empezando desde cero y enfatizando la lógica detrás del código más que el código en sí.
🛠 Preparación y Configuración Inicial
El vídeo procede a guiar sobre cómo preparar el ambiente para correr un LLM localmente, comenzando con la descarga de AMA, un proyecto que permite la ejecución de modelos de código abierto en la computadora del usuario. Se describe el proceso de instalación de AMA y cómo este permite acceder a una amplia gama de modelos LLM, como Lama 2 y Mixol, y se detalla cómo descargar y gestionar estos modelos. A través del uso de la línea de comandos, se muestra cómo iniciar un modelo y realizar consultas básicas, estableciendo la base para el desarrollo de aplicaciones más complejas.
📝 Creando el Entorno de Desarrollo
El narrador continúa con la configuración del entorno de desarrollo, utilizando Jupyter notebooks dentro de Visual Studio Code para escribir el código Python necesario. Se enfoca en la creación de un entorno virtual para manejar las dependencias del proyecto de manera aislada, y se ilustra cómo instalar paquetes y gestionar variables de entorno, como la clave de API de OpenAI. Este paso prepara el terreno para interactuar programáticamente con modelos LLM, tanto locales como en la nube, y establece las bases para construir el sistema RAG.
🤖 Accediendo y Ejecutando Modelos LLM
Se explica cómo ejecutar consultas a modelos LLM mediante código, primero utilizando el modelo GPT-3.5 de OpenAI y luego modelos locales como Lama 2. Se introducen conceptos como los de chat y modelos de completado, y cómo adaptar el código para trabajar con ambos tipos. También se discute la importancia de los parsers y cómo estos pueden ser utilizados para formatear las respuestas de los modelos, facilitando la integración de diversos LLM en un único flujo de trabajo.
🔍 Integración de Modelos LLM con PDFs
El video muestra cómo construir un sistema capaz de responder preguntas utilizando como contexto el contenido de un PDF. Se utiliza un ejemplo práctico donde el presentador guarda una página web como PDF para usarla como fuente de información. Se introducen herramientas para cargar y procesar el PDF en memoria, y cómo utilizar plantillas de prompts para formular preguntas a los modelos LLM basándose en el contenido del PDF. Este paso es crucial para el funcionamiento del sistema RAG, permitiendo la interacción dinámica con información estructurada en documentos.
📊 Utilizando Vector Stores para Mejorar Respuestas
Este segmento aborda la optimización del sistema RAG a través del uso de Vector Stores, que permiten almacenar y recuperar eficientemente páginas de documentos basadas en su relevancia para una pregunta dada. Se explica el concepto de embeddings y cómo estos facilitan la comparación de la similitud entre el texto de las preguntas y el contenido del documento. El presentador ilustra cómo configurar un Vector Store en memoria y utilizarlo para seleccionar los fragmentos más pertinentes del PDF antes de hacer una consulta al modelo LLM, mejorando significativamente la precisión de las respuestas.
🔧 Configurando y Probando el Sistema RAG Completo
El último tramo del video detalla cómo ensamblar todos los componentes previamente introducidos para construir y probar el sistema RAG completo. Se muestra cómo integrar el Vector Store con el sistema de prompts y cómo pasar las preguntas a través de este sistema para obtener respuestas basadas en el contenido del PDF. El presentador realiza pruebas en vivo, comparando las respuestas generadas por diferentes modelos LLM a las mismas preguntas para demostrar la funcionalidad y flexibilidad del sistema. También se tocan temas como la importancia de afinar los prompts y la posibilidad de utilizar el sistema en aplicaciones prácticas.
Mindmap
Keywords
💡Modelo de Lenguaje Local (LLM)
💡Generación Aumentada por Recuperación (RAG)
💡Modelos de código abierto
💡AMA
💡Parámetros
💡Interfaz de línea de comandos
💡Embeddings
💡Almacenamiento Vectorial
💡Plantillas de Prompts
💡Cadenas de Lenguaje (Lang Chains)
Highlights
如何在你的电脑上运行一个本地的大型语言模型(LLM),并构建一个完整的检索增强型生成系统。
使用开源模型的重要性,因为它们在解决大多数问题上非常有效,而且成本更低。
开源模型对于隐私保护的优势,允许企业将所有数据保留在本地,不连接到外部API。
即使在使用OpenAI作为主要模型的情况下,开源模型也可以作为备份,以保证工作流程的持续运行。
介绍了AMA项目,它是一个允许在本地计算机上运行开源模型的通用封装器。
下载和安装AMA的简单步骤,以及如何通过AMA下载和运行不同的模型。
解释了大型语言模型的本质是庞大的数学公式,由数十亿个参数组成。
展示了如何使用命令行工具Ama来拉取和运行Llama模型,并展示了如何通过Ama llama命令与模型交互。
讨论了创建一个简单的基于L链的系统,从PDF文件中获取数据并回答问题的过程。
介绍了如何在Visual Studio Code中创建一个新的目录和Jupyter笔记本,以及如何设置虚拟环境来安装必要的库。
解释了如何使用环境文件来存储和使用OpenAI密钥,以及如何通过L链访问和使用该密钥。
展示了如何使用L链创建一个提示模板,以及如何将PDF文档的内容作为上下文传递给模型。
讨论了如何使用DocArray和Pantic来创建和操作文档的嵌入表示,以及如何使用这些嵌入表示来检索最相关的文档。
介绍了如何构建一个L链,包括提示、模型、解析器和检索器,并如何使用这个链来回答关于特定文档内容的问题。
展示了如何使用L链的流式和批处理功能来提高与大型语言模型交互的效率和速度。
Transcripts
Hey so uh today I'm going to show you
how to run a local llm on your computer
and how to build an entire rack system
retrieval augmented generation system uh
using those locally running llms and
these are going to be open source models
that we're going to use here is the
reason why this is crucial right so even
though uh the GPT models are so far uh
the best at solving most problems that
we want them uh we want them for uh
knowing how to use a local llm it's very
important first because these open
source models are getting really really
good number two because they're cheaper
and you don't need all of the Power from
GPT to solve certain use cases so if you
know how to use a local uh model now you
can do the same task at the same level
of quality for much uh cheaper number
three for privacy reasons so many many
companies do not want to use the GPT
models they don't want to connect to an
external API they want to have
everything in house and that's what an
open source model will give you also if
you are planning to deploy one of these
models into a scenario where you don't
have connectivity let's say robotics
right or an edge device uh there is no
chance for you to just connect to an API
you will have to use an open source
model so those are some of the reasons
there is one that's also my favorite one
that is even if you're using open AI as
your primary model you can use one of
these open source models as a backup so
if the open a API goes down the other
day they had downtime for a day I think
the model was just returning nonsense
you can immediately flip to an open-
Source model and keep your workflow
running with no disruption right so many
many reasons uh the the goal of today's
video is to show you how to do that on
your computer starting from nothing so
I'm going to start with an open browser
I'm going to do that I'm going to make
it run I'm going to build that simple
very simple rack system uh we're going
to read or answer questions from a PDF
so I'm going to get a website download
it as a PDF and answer questions from
that and the most important thing is
that here the thing that matters the
least is the code that I'm going to
write that is not what's important about
this anyone can write the code the thing
that matters the most is the reasoning
behind that code why do we need to do
this or that right that is what I want
to convey on this video and I hope
that's what you get uh out of it right
the understanding of the stuff that
we're building here and the reasoning
behind it all right so uh with no uh
finishing that introduction let me get
here to my browser and I opened we're
going to start here very simple I opened
uh this website is AMA so AMA is the
project that is going to allow us to run
an open- Source model in our computer
okay so here's the thing here's the
thing that I want you to know like when
you think about a model I want you to
think about a like a gigantic
mathematical formula because that's what
it boils down to it's just a bunch of
values weights and biases we call them
parameters and they're put together in a
gigantic mathematical formula that is
huge like when we talk about Lama 2 the
7B model that 7B means 7 billion
parameters so it refers to the number of
weights and biases that we need to store
and execute uh in order to make any
prediction uh Lama
270b that's 70 billion parameters so
when we download one of these models
what we are downloading is all of those
values all of those parameters the 70
billion parameters we're storing that in
our our hard drive and we're storing uh
some sort of like a instructions on how
to put together those parameters and run
them as a big mathematical formula
that's basically what we are storing so
AMA is going to serve as the common
wrapper around all of these different
models so we can download Lama 2 but we
can also download uh mix straw and we
can run them through AMA through this
common interface so to install AMA is
very simple go to this website ama.com
they have versions for Mac Linux and
they just release a preview for Windows
so even if you're on Windows you can run
uh AMA on your computer so I already
downloaded it uh it's just it's very
thin it's just it's very small uh when
you download it you can run it and you
can see here on my status bar uh you can
see there is a little llama right there
that tells me that a Lama is running on
my computer by the way the first time
you run a llama it's going to ask you to
uh to install the command line tools I
think that's what it does or to download
the Llama 2 model I don't remember what
is the first instruction they give you
but it's very simple to navigate uh and
it's very quick so there is here here in
the website uh you're going to see a
link it's called models so click on that
and here is the list of all of the
models that all you can run using a Lama
okay so you have the Gemma models that
just Google uh released I think it was a
week ago this was updated two days ago
Jesus it's is so fast uh you get llama 2
mol mixol with an x uh lava I mean just
goes on and on and on if you care about
code you have code Lama here so all of
these models you can download now how do
you do that I mean you can obviously you
can go here inside one of these models
and see the entire family you see the 7B
version the 70b version the chat model
you can you know you can explore this on
your own you're going to get information
about your your model how to use it Etc
uh this is what I'm going to do it's
going to be very very simple so I'm
going to go to my command line and
you're going to have Ama as the uh as a
command after you install it and here
you can see a bunch of available
commands here you get the pull command
so basically you can say Lama pull Lama
2 and that will download the latest
version of Lama 2 to your computer okay
I already did that so I'm not going to
do it here plus it takes I mean it's
downloading I think it's like 20
Gigabytes of data so it's going to take
a few a few minutes to do uh you can do
a llama list and that will tell you what
are the models that I have installed
here on my computer so I have the latest
version of Lama 2 uh I also have
mixol the a *
7B uh version and I have the latest
version of mixol okay and you can see
here the ID you can see the size 26 gabt
mixol Lama 2 is only 3 GB and when I was
modified I have here just to show you
when you download a model uh that is
going to get I don't know how big this
is on the on the recording but when you
download a model uh it basically
downloads these files within a folder
I'm on a Mac here so it will download uh
within my my base directory there is
aama directory and it will put in there
every file that I downloads and you can
see like some of these like this one
here is the mixol one see the 26 gab
that is just basically all of the
parameters are stored there after the
latest training version of mix trol so
all of those values are going to be
store here so just so you know if you
want to delete them uh where to come you
can also obviously delete them using the
RM command here so now that I have this
I think I can do I don't remember the
command but I'm going to do Ama
llama 2 is not like that so how do I
serve this oh there we go maybe serve
let's see show run ah let's do
run there we go so now we're running
Lama 2 here on my
computer okay and now I can say tell me
a
joke and there we go sure here is one we
why don't scientists trust um this is
just a bad joke I'm sorry sorry so
anyway here is the model running on my
computer which is awesome uh if I do
this this is going to give me the help
uh let's just say buy okay awesome so
now I have an open-source model running
on my computer from the command line
that's great but that's not what I want
what I want is to be able to access this
model programmatically so in order to do
that I'm going to create a directory
here and we're going to build a very
simple rack system from zero from
scratch using L chain to get data from a
PDF file okay so that's what I want to
do all right so I'm going to create a
directory I'm going to call it uh I
don't know local model let's call it
local model okay and let's
go open Visual Studio code on that
directory there we go local
model okay awesome so here's visual stud
Studio code I'm going to make it nice
and big and beautiful so you guys can
see and I'm going to create a notebook
I'm going to call it
notebook and this is a Jupiter notebook
uh just so you know in order for you to
be able to use Jupiter with a visual
studio code you need to install the
Jupiter plugin uh this plugin is created
by Microsoft and obviously the python
plugin because this is going to be
python but yeah I'm going to be using
Jupiter from within my notebook here uh
awesome I'm going to open a terminal
window within Jupiter and I'm going to
create a virtual environment so I can
install all of the libraries that we're
going to need uh within that virtual
environment I don't want to be
installing anything directly on my
computer so from within uh Jupiter
notebook here from within the terminal
I'm going to do Python 3 to just call u
a module and the module is vm. the
virtual environment module that comes
with python and I'm going to call
itvm like that's the name of the folder
that is going to create so I'm going to
do that and that is going to execute and
now I have a new folder here is a hidden
folder but it's going to be a new folder
that's going to be called VM uh many
people use different virtual
environments they use poetry and they
use cond Mina I'm an old school guy I'm
want to I want to keep it very simple
that's why I'm using the virtual
environment here here all right so let
me just uh go inside that virtual
environment cool and now I can start
installing stuff okay so what do I need
here uh well I I don't know still what I
need but let's first start by just doing
something from this notebook just making
sure this notebook is actually working
it tells me hey you need to select a
kernel what is the kernel I'm going to
go to the python environment that I just
created it I'm going to select that
Visual Studio is going to tell me that
it's going to install everything that it
needs to execute that print line and
boom it runs so this is awesome this is
working okay so this is something that
we are going to need uh here I'm going
to need an environment file so this
environment file is just going to store
any environment variables that I'm going
to use during this presentation so for
this presentation
uh we're going to need the open AI key
we're going to need to pass that uh
obviously I don't have the value here
but this is going to be something like
blah and I'm going to store it there and
then I want to read that key from my
code from my notebook so I can get
access to the open AI key why do I need
access to the open a API because I want
to test everything that I'm doing
locally also with GPT just to make sure
how they compare so that's why I'm going
to access the the open AI key so I have
my key over here I'm going to paste it
uh maybe in order to do that I don't
want to just put it on the on the
screen so you guys don't just use my key
to build your applications for
free
okay I know you guys are not seeing what
I'm doing but basically just pasting my
open AI
key uh off screen so I know I could do
it here and then just change it but
anyway so after doing that this is the
stuff that I'm going to do so I'm
importing the OS library and I'm
importing this Library uh it's called DM
and that's going to read the environment
variables that I just store in the data
end file it's going to read that in
memory uh when I call this load. m and
now I can use my open AI key just like
this it's just very simple now in order
for me to be able to use this I need to
install uh it's called python. M so I
need to install that Library boom
awesome and then I'm defining here three
it's the same variable just overwriting
in shorter I'm going to start by using
the GPT
3.5 I'm going to start by using this
just a variable boom that works
okay so now I'm ready to just using L
chain uh create a very simple model just
to make sure that the API is working
okay so I'm going to be copying the code
here uh let's see so how do we how do we
access this I'm going to import the
chat. openai model and yeah thanks to to
copilot I'm going to create model chat.
open AI I'm going to pass the open AI
key I'm going to pass the definition of
the model which is going to be or the
name of the model which is going to be
GPT 3.5 and now that I have here I can
just to invoke and I can pass hey tell
me a joke okay so I need to install L
chain open AI of
course I want to do
that uh what else I need to install more
stuff
uh this might have been installed but
just in case I'm going to install the L
chain as well
yep it wasn't okay awesome so now can I
run
this boom obviously this is beautiful
okay so here we go I asked the open a
API tell me a joke and I get back why
couldn't the lower I get back a response
from open AI that is awesome but this is
just ch GPT or the GPT model we don't
want that we want to do it from the
locally uh running mod model so how do
we do that it's very simple we need to
use the instead of using the
specifically the open AI chat model we
need to use anama model and I think
that's on a different Library uh might
have been I'm not sure I think it's in
the L chain Community
Library uh let's
see let's import it here and we're going
to get this class which is AMA and what
I I want to do is I want to instantiate
anama model if if the model is not a GPT
model so I'm going to do something like
this if model sorry if
model uh starts with not not like this
but the other way around GPT then
do this right
else okay let's do an AMA model uh by
the way I I can take this
out and do it
regardless okay so this is awesome so
basically if I Define that my model is a
GPT then I want to instan shate model
using the chat open AI model if not I
want to instan shate the model using the
class or Lama passing the name of the
model in this case uh I'm going to test
this just using
GPT let's see so that should work like
it did before it does now I'm going to
change the name of the model let's do
Lama too so I'm going to change the name
of the model and there we go you get an
answer back from llama notice something
interesting here notice that when I call
the Llama model I get back a string
that's what I get back from the Llama 2
model but when I call the GPT
model I get an AI message instance here
that has the content inside so the
reason that's happening is because the
GPT turbo model model this model here is
what we call a chat model so it's a
model that's meant for a conversation
right they're going to be AI messages
which is what I get here they're also
going to be human messages which is like
the questions that I'm going to ask that
model for example they're going to be
classified as human messages so when I
interact with that class Lan chain would
return a special structure in this case
is an AI message instance containing the
content inside so but the Lama 2 model
is a completion model it's not a chat
model I could be using a chat model as
well but I'm not I'm using a completion
model and that's why what you get back
is a string so how do we fix this uh or
just it's not a problem that we need to
fix but I don't like to see an AI
message here well we can use l chain to
parse out that AI message just remove
that and turn it into a string right so
L chain supports the concept concept of
parsers and I have uh I have here what
we need to do and you're going to see
how simple it is so let me paste it here
so I'm going to import a string output
parser and a parser again it's just a
class that's going to take an input and
it's going to transform it in one way in
this case this one here is going to
transform the input into a string which
is what we want and I'm going to create
my first L chain chain here Lang chain
is language chain I'm going to create a
chain here uh that's where the name
comes from so my chain is going to take
the model and I want to pipe the output
of that model into the input of the next
component in this case it's a parser so
if I do this basically what's going to
happen is that Lan chain will talk to
the model will send a request to the
model and then we'll get the output from
the model and we'll pipe that output
into the input of the parer which then
will return the string so if I do this
I'm going to reexecute this line I get
my AI message but I'm going to do it
again just now from the chain so I'm
invoking the chain not the model anymore
I'm invoking the chain so the parser
gets involved when I do that boom I just
get back the string why because this
parser that's the job I'm going to
remove the parser invoke it now you get
the AI message I put back dep paror
invoke it now you get the string see how
that works that's beautiful and that is
one of the main characteristics of L
chain you can put uh you can create
increasingly more complex chains using
different components okay but this is
just the beginning we cannot run a model
and that model can run locally I have a
llama here and we know how to do a chain
uh let me just build a simple rack
system just very very simple I want to
answer questions from a PD PF so the
first question is what is going to be
that PDF so I'm going to go to this
website here and this is the class that
I that I teach I teach a live program is
called a machine learning a school
program well it's actually called
building machine learning uh systems
that don't suck and I'm going to save
this as a string okay so I'm going to uh
go here like if I were to print this I'm
going to save my whole website as a PDF
so let me save that as a PDF and I'm
going to save it the same folder I'm
going to put it in the same
folder uh let's see where do I put it
here this is just it's horrible the
dialogue that Apple decided to create to
save components I just hate it so much
all right I'm going to call it uh ml
school which is the the name of the
website okay so I'm going to call it ml
School boom let's go back here here we
go we have our PDF right there so what I
want to do now is use my model to answer
questions from that PDF the first thing
that we need to do is load that PDF in
memory so how do we do that we're going
to need a library that I have it here
somewhere okay the library is uh PP
install it's called Pi PDF or python PDF
however you want to call it so we need
to install that library and then using L
chain we can actually load that in
memory so we're going to come here and
look at this L chain supports document
loaders and by the way there are a bunch
of different document loaders that you
can use to load information from
anywhere okay so I'm going to use the P
PDF loader that's why I needed the
library in the first place I'm going to
type what the name of the PDF is going
to be here and I'm going to load it and
split it and that is very important and
then I'm printing out the p pages so you
see what the result was so I'm going to
execute this and see what just happened
so Lan chain using this loader class
loaded and splited my PDF into different
pages you're going to see uh here it
says page one of 14 page two of 14 all
the way to page 14 of 14 so he split my
entire PDF document into different pages
and loaded each one of those pages pages
in memory okay so I have 15 pages or 14
pages in memory right now that's awesome
that was a great great step I have that
in memory here the next thing is
creating a template so I need to create
a template I want I say a template it's
a prompt template to ask the model to
answer questions from specific context
so let's do that uh and let's do it the
right way so look at this by the way all
of these stuff all of the steps that I'm
covering here talking about retrieval
augmented generation systems I covered
in much more details in a different
video I'm going to put it somewhere here
uh or maybe the description below or if
not you can find the latest video in my
in my YouTube channel and you're going
to see more detail about these steps
here so okay so here is a promp template
that we are going to use to pass to the
model so here is a stram it says answer
the question based on the context below
if you can answer the question reply I
don't know okay you can make this comp
this more complex if you want it this is
good enough for for us then I'm going to
provide the context and then I'm going
to provide the question that I want to
answer I'm basically telling the model
the question please answer it using this
information don't go to your memory
don't go to the stuff that you've
learned before just answer out of this
section here okay so ideally I'm going
to be able to grab some of the pages on
that PDF put them here the content of
the pages and then answer a question or
ask a question about those pages so I'm
going to create this prom template
notice these two squiggly braces here uh
context and question those are variables
and L chain will turn them into
variables that I can pass and provide
values for
so here is my prompt I'm using the
prompt template a class to create a
template from the string that I just
passed right here and now I'm just just
to test it out I'm calling the format
function I'm saying hey format my prompt
passing context see how context just
became a variable here or an argument to
this function here is some context and
the question here is a question so I'm
going to execute this just to make sure
this works and you can see actually let
me do a print here let's see if those no
lines much better much better like this
answer the question based on the context
below if you can answer the question
reply I don't know and then we say
context here's some context and then we
say question here is a question okay so
that's awesome we have a prompt so how
do we pass this prompt into the model
well we can just keep building our chain
remember that our chain was like
this and we were piping that into a
parser so we could make this chain
better if we do something like this so
what happens if we do this we have a
chain we start with a prompt and that
that prompt is going to go into the
model and that that model is going to go
into a parer so we can create that the
question now is what do you think is
going when we say chain invoke what do
you think we need to pass well remember
there are two variables that we need to
provide therefore we're going to have to
invoke this chain passing cont context
I'm passing a question so if I say
uh the name I was
given was
Santiago and then I'm gonna ask what's
my
name oops and if I do this your name is
Santiago boom that works important
lesson here notice how this chain when
we invoke the chain we need to
understand and what the input of the
chain will be now there is something
that might help there uh there is an
input
schema functionality that you can call
the chain I mean obviously you can just
look at the first component of the chain
and if you understand what the first
component is waiting for or is is
expecting that is the in invocation that
is going to need to happen but I find
this input
schema uh trick or or or tool very
helpful because it tells me it gives me
information about the chain without me
having to overanalyze what that chain
looks like so in this case you see okay
so this is the chain what is the input
schema to that chain and it talks about
the prompt input so it's going to be the
prompt and it tells me well the
properties is expecting an object and
the properties of that object are a
context okay which is a string and a
question which is also a string so these
are the two variables context and
question see so that is why I know that
I need to invoke that chain with the
context and the question okay awesome so
what do we have right now we have a
chain that already has a
prompt a model and a parser we have the
documents in memory the by the documents
I mean the PDF document we already have
it in memory split by Pages now we need
to find a way to take that document and
pass it as a context but only pass
the relevant portions of that document
as a context so how do we do that well
I'm going to use a very simple Vector
store that is going to uh do several
things for us so number one it's going
to save it's going to serve as a
database for the content in a different
way we're not going to store the pages
of the content just straight into the
database we are going to be storing
embeddings of those documents so we're
going to get the whole PDF we're going
to generate embeddings for each one of
the pages of that PDF and the reason we
generate these embeddings is so we can
later compare those embeddings with the
question that the user is asking and
find the embeddings that are the most
similar or the pages of the document
that are the most similar to the
question the user asks and I know I'm
wiping my hands here a little bit uh in
the video that I mentioned before in the
other video which is it's is called
building a rack system from scratch on
that video I go into a lot of details
about how embeddings work uh hopefully
by now you know that if not just check
that video out because it's going to
help you understand what is the reason
we create these embeddings the good news
is that all of these embeddings are
going to be created for us behind the
scenes and the doc the uh the vector
store in memory is going to help us
retrieve the pages that are the most
relevant to answer specific questions so
how do we do that well first I need to
install a couple of libraries here uh
the first one is going to be Doc array
so I'm going to do pip install do array
that is
important the second one is a specific
version of
pantic by way I'm installing all of
these by hand in the description and in
this YouTube video you're going to find
a link to the repo with all of this
content so you don't have to do any of
the so you don't have to follow through
you can just go directly and grab the
content including all the libraries that
you need to install okay I think that's
it no I need something else I need to
install this or not uh let's
see no this we might not need this or we
do I don't know let's see let's see if
we need that or
not okay
so
let's create by the way let me just hide
here
the okay much better I have a little bit
more space so here is what I'm going to
create now I'm going to use a dock array
memory search Okay so this dock array
memory search uh this is just going
to create a vector story memory now if
we were building a real app application
we wouldn't be using this Vector storing
memory obviously we will be using
something that has permanent storage
like pine cone or any other Vector
database out there but this is good
enough for our video here for our
purpose this is just going to do the
same thing but just in memory here in
our computer and the good thing about
this dock array in memory Vector store
is that we can just load it and create
it off of the documents that we
generated okay okay so you can see here
that I'm passing the pages of the
document that we generated from the PDF
right remember that we here are the
pages not not here these are the pages
of the document you can see here pages
is just an array with every single page
so we can create a vector store directly
off of all of those pages what's going
to happen is that all of those pages are
going to uh go into the database the
database is going to generate embeddings
for all of those it's going to save all
of that in memory there is something
else that I need to pass and I need to
provide the embeddings class so what
class we're going to be using to
generate the embeddings here's the thing
every model uses a different model to
generate embeddings so depending on the
model that we create we need to uh
generate embeddings one way or the other
we have here right
now uh either a GPT model or an AMA
model therefore or we are going to have
to generate embeddings uh in a different
way so let's see embeddings let's create
a variable uh this is not co-pilot is
not being helpful right there I think
the open ey embeddings are here okay
there we go so we're going to need this
class so if if we're using a GPT model
the embedding instance is going going to
be we're going to instan shate
embeddings with the open AI embeddings
model and theama
one theama ones is going to be this one
here
so there is an AMA embeddings and that
is the one that we want to use
if
oops this is the one that we want to use
if we're
using Lama 2 or mixol or any other llama
model let me execute that again all
right let me go here all of this is good
all of this is good here is
ooom okay so there is a problem here
with the
library okay so I have a problem here
with the Lou let me restart this just to
make sure that is not what's happening I
know I remember the first time I
installed this that I had issues as well
with
uh the vector store in memory Vector
store uh but no that now it's working
fine so now I have a vector store here
which is gray and I think I can do
something like
uh retrieve not tell me a joke but let
me retrieve something related
to machine learning let me see if this
works oh maybe like this there we go
okay so if I get to my Vector store and
I turn it into a retriever and and a
retriever is a component of L chain that
will allow you to retrieve information
from anywhere so basically here what I'm
doing just to put it in a different way
that it's it's it's
uh a little bit less convoluted I'm
going to create a retriever off of the
vector store and again you don't need a
vector store for a retriever you can
create a retriever that that's going to
be using Google searches you can create
a retriever that's going to get
information from anywhere so just in
this case it's just going to come from
the vector store and then I can invoke
my Retriever and then pass information
and what's going to happen is that that
retriever or that Vector store is going
to return the four top documents that
are the most relevant to the concept
machine learning okay so anything that's
relevant to that concept going to come
sorted in order of importance back and I
think I can say there might be a k here
let's do two not here somewhere
somewhere there is a actually uh maybe
top I don't think is is top K but there
is a parameter which I don't remember
right now I will have to look at the
documentation if you wanted to control
how many documents are going to come
back you can do that through the
retriever I'm not sure what the name is
right now it doesn't really matter we
can use four all right so now we have a
retriever so the idea here is going back
to our chain let me copy our
chain down here so getting back to my
chain I have a prompt I have a model and
I have a parser and remember that that
prompt is expecting a context and a
question okay that's what I need to pass
that prompt now the context is g to come
from this retriever
okay so this is what's going to get a
little bit tricky here the prompt is
expecting a map so imagine that I do
this if I take a map I'm going to create
a map here and I'm going to pass a
context uh let's see the name what was
the the name I was given was Santiago
and I pass a question what is my name uh
can I do this or not I think I I'm going
to need something like
this okay
can I do
invoke doesn't work let's figure this
out
white okay oh this is it runnable
okay can I do this actually nah that's
not it yeah I thought that this was
going to get uh turned into a runnable
directly let's see why this is not
working so it says let me let me
recreate
this I mean I know how to fix the
problem I just don't want to fix it like
that let me do let me do this uh import
operator no sorry from operator
import item getter and then let's get an
item getter and let's do question and
let's pass that to a
retriever and then let's do
this still doesn't
work why is why it doesn't work let me
see the documentation here really quick
and see why that is not
working okay so you know what let me
just pass this question here oh
well I I understand why that doesn't
work I need to pass a question
obviously now let's do
this now there's something let me just
check
here okay so I have the
retriever okay
I have a prompt I have a model I have my
parser so that is working
fine this looks much better here the
question let me gra let me grab the same
question that I'm
passing okay there we go this is what
that was that was the problem all right
so I'm going to explain here really
quick what was happening here uh because
it's not it's not obvious what was
happening here all right
so I have my prom my model my arer I
need to pass the prompt I need to pass a
context and I I need to pass a question
to my prompt okay so what I want to do
here is the context is going to come
from a retriever so I'm going to put
this here organize it in a different way
so it's it's more obvious so the context
is going to come from a retriever but
that retriever requires the question
that I'm invoking this chain with so
they have this weird here and there is
another way of doing it but we're using
here uh something that's called the item
getter function and just so you guys
understand what the item getter is if I
do item getter and I pass
uh ABC now I can call that actually
compilot is is being helpful
here there we go if I do this I create
this item getter with ABC I can apply
that to a function later that becomes a
function that when I call it with a
dictionary for example it's going to
return one two three so if I execute
this you get the one two three here so
in this particular case if I if I have
item getter of question and I pass it I
pass this dictionary here of course what
I'm going to get back is just the
question right what is machine learning
because this is the question okay so
this is the way you sort of like put
together this chain I'm going to expect
here the first module is going to be
what they call a runnable so it's
basically a component that can run so a
runnable here is going to generate a map
because that map is what's going to be
passing we're going to be passing that
map with context and question to The
Prompt we're going to generate a map and
the first value of the first variable of
the map which is context is going to
come from the retriever but this is a
unit here so I'm basically grabbing the
question from the invoke I'm grabbing
the question piping that question into
the Retriever and the output of that is
what's going to go into context and we
know what the output of that is going to
be because we already did it here you
can see here this is my
Retriever and I'm piping or or invoking
that with a question and I'm going to
get an array of documents so this
context here is nothing else that this
array of documents okay and then
question question which is the second
value of the map is just the value of
the question I'm passing through the
question from the invocation I'm passing
it through here to the second component
and that creates my entire chain so now
I need to test this chain and to test
this chain I actually have a bunch of
questions that I'm going to be using to
test this chain and here are my
questions what is the purpose of the
course how many hours of live sessions
how many coding assignments and just a
bunch of questions and now I'm going to
go one by one and I'm going to answer
those questions so this is uh let's
actually let's do this for question and
questions uh we can invoke the chain
saying hey this is my question I'm going
to do something else which is I'm going
to print here the name of the question
so I'm going to do hey this is the
actually let me do a printf question I'm
going to print the name of the question
I'm going to print the answer so let's
do let's do something like this
answer and then let's put all of this
here inside I'm going to need to change
this to single quotes and then let's
just print a new line and that is going
to call let me give it a try what is the
purpose of the curse and this is the
answer okay let's see let's see this is
the GPT model by the way yeah this is
fine how many hours the program offers
18 hours of live interactive sessions
that is correct how many coding
assignments there are 30 coding
assignments that is correct is there a
program certificate yes that is correct
what programming language will be used
python how much does the program cost
the program costs $450 for Lifetime
access so this is correct that's the GPT
model answering questions from my PDF
let's now change the model to something
different I'm going to go up here and
I'm going to do Lama 2 going to execute
and everything here should stay the same
if we did our
jobs everything should work without
making any other changes to what we
built let's do all of that by the way
loading the documents on Lama 2 uh is
running here on my computer which is not
I don't have like a big Nvidia GPU I
have my my uh M3 laptop it's pretty good
laptop but obviously with a bigger GPU
it's going to run much faster so let's
[Music]
see
okay answer look at this this Lama 2 is
just so BOS the answer to your question
is 18 hours of Hands-On live training
spread over three weeks but it works
it's correct how many coding assignments
I don't know the exact number of coding
assignments in the program Accord
according to the provided document there
are 30 coding assignments Jesus Christ
it knows where the information is it
just sucks at summarizing it is there a
program certificate yes what programming
language uh python how much does the
program cost based on the context
provided the program costs a th000 to
join that is just not true it is very
clear I don't think there is even a
$11,000 mention in the entire higher
page so that it just hallucinated that
completely so obviously to get this
working by the way I I also have this
model let me just try this model just
for fun and then I'm going to show you
something else before we finish uh I'm
going to try
mixol
uh just running the whole thing
here uh one thing that I wanted to
mention is that I've had obviously this
these models are not as good as the GPT
models and here I'm not even using GP G
pt4 gp4 is so much better uh but if you
if you play with the with the prompts uh
you can get this models uh to do a very
good job summarizing stuff obviously my
my prompt here is very very Bare Bones
just for this example but this I mean
don't don't be discouraged because the
model doesn't answer correctly some of
these questions uh playing with a prompt
will go a long way uh yeah so let's see
it's still running this is a big model
uh it takes a little bit of time to
generate all of those embeddings because
the mol a7b is is huge compared to Lama
two let's see uh did we start a
answering yet not
yet still invoking this
chain it's coming it's coming through 20
seconds just to invoke this question
here there we
go okay so now it's G to try to answer
all of those questions again this is
just a big model if we go back
to uh let me see mraw is this one here
it's the 26 gigabyte compare that to the
3 gigabyte this is this one here is L 2
so the 26 gigabyte is is mixol so it
takes quite a bit of time on my computer
to produce any results and speaking of
and while that works speaking of of of
being slow and whatnot uh I want to show
you something else which is it's pretty
cool and the first thing is how to
stream so you can use here I'm invoking
my chain and waiting for an answer to
come back to display the whole answer
but a trick just to to for your users
when you present this information
could be just streaming answers back so
you can see here I'm calling the chain
do stream instead of calling invoke I'm
calling stream with a question and
whenever this finishes uh it has answer
one uh this hallucinated this four hours
of live sessions per week oh no no no
this is true with two live sessions each
lasting two hours these live sessions
take place every Monday and Thursday so
that is true
but it's not
returning oh there we go look at this so
mixol is is doing a multiplication here
so it's saying it's 12 hours and it's
ignoring ah there we go however the
document mentions that there are 18
hours of Hands-On live training okay
it's just very very bothos try to do
some math uh
yeah it says that the number is not
specifying the documents
but it can be infer that there are at
least 30 codent
assignments what what do you mean like
you first tell me that you cannot infer
it I mean that that is not mentioned and
then that you can infer that no you just
read that it says D coding assignments
in my document is models are just
yeah okay uh what model will be used is
Python and it says that the document
does not provide information on the on
the on the cost of the program by the
way the information is there you saw GPT
doing it let's go back to GPT here
really quick so we can test
the the streaming I'm going to show you
streaming uh something else really quick
and then we are just going to be done
with
this okay
so this is how it works when it answers
one question after the other right and
you can see boom boom boom boom it
displays the answers all together but if
we do streaming look at what's going to
happen boom boom boom see let's let me
try to do it
again see how it just sort of like
Builds on the question and it's really
fast that's why you barely notice but it
builds on the answer uh just because
it's streaming out the characters as
they are produced by the model so that's
super cool the other thing that you can
do is just batching uh which is also
super cool so here I have a bunch of
questions and I'm answering those
questions one by one we can also just do
batching so batching basically I'm just
passing instead of passing just a single
question I'm passing an array of
questions and when I do that look what's
going to happen it's just going to take
a little bit more time but boom it's
just going to display all of the answers
at the same time and the good news is
that all of these calls are going to be
in parallel behind the scenes so we
don't have to wait for one answer in
order to ask the next question we can
answer we can ask many questions at the
same time time so the overall result is
going to be way faster so all of that is
thanks to L chain so again the code is
going to be down uh in the description
below just make sure you like this video
it's a ton of work that goes into this
videos make sure you like these videos
I'm going to be creating more videos but
it's your likes what makes me create
more videos If you guys don't like it
well just going to stop creating like
these videos uh what you learned today
just as the final thing thing that I
need to mention is how to use these
models locally right and you can do this
on your Linux server on your own
computer or whatever use these models
locally and combine them or create a
piece of code that will allow you to use
these models uh regardless of the that
the exact model you can use one or the
other and the entirety of your code does
not need to change to reflect that so
hopefully you enjoyed it um I have a
bunch of videos that are going to be
coming through I think the next one is
going to be a simpler one how instead of
just doing a PDF how you can connect to
the web directly and answer questions
from a website I think that's the one
that's going to come next we'll see uh
but anyway thank you and I will see you
in the next one
byebye
Voir Plus de Vidéos Connexes
¡EMPIEZA A USAR la IA GRATIS en tu PC! 👉 3 Herramientas que DEBES CONOCER
Usa tus modelos de Tensorflow en páginas web | Exportación a Tensorflow.js
Cómo EMBEBER UN GPT en una página WEB [Tutorial paso a paso]
¡EJECUTA tu propio ChatGPT en LOCAL gratis y sin censura! (LM Studio + Mixtral)
Validate & Standardize LLM Output with Guardrails-AI
A Guide to Picking Between OpenDevin and Devika
5.0 / 5 (0 votes)