Python RAG Tutorial (with Local LLMs): AI For Your PDFs

pixegami
17 Apr 202421:33

Summary

TLDRDans cette vidéo, nous construisons une application Python RAG qui permet de poser des questions sur un ensemble de PDFs, tels que des manuels d'instructions de jeux de société, en utilisant un langage naturel. L'application fournit des réponses et des références aux sources. Nous abordons l'exécution locale avec des modèles de LLM open source, la mise à jour de la base de données de vecteurs avec de nouvelles entrées et l'évaluation de la qualité des réponses générées par l'IA. Des tutoriels antérieurs sont recommandés pour les débutants, et un dépôt GitHub fournit le code source pour une compréhension approfondie.

Takeaways

  • 📚 Le tutoriel explique comment construire une application Python RAG (Retrieval-Augmented Generation) pour poser des questions sur un ensemble de PDFs, tels que des manuels d'instructions de jeux de société.
  • 🔍 L'application peut répondre à des questions en utilisant le contenu des PDFs, en fournissant également une référence au matériel source.
  • 💻 Le tutoriel couvre comment faire fonctionner l'application localement en utilisant des modèles de grandeurs libres (LLMs) open source.
  • 🆕 Il est également expliqué comment mettre à jour la base de données vectorielle avec de nouvelles entrées sans avoir à reconstruire la base de données entière.
  • 🔧 Le script aborde le test et l'évaluation de la qualité des réponses générées par l'IA pour valider rapidement les modifications apportées à l'application.
  • 🔄 RAG est l'acronyme de Retrieval, Augmented Generation, une méthode d'indexation de données pour les combiner avec un LLM afin de créer une expérience de chat IA utilisant ces données.
  • 📈 Pour créer la base de données, les documents PDF sont divisés en morceaux plus petits, transformés en vecteurs (embeddings) et stockés dans la base de données vectorielle.
  • 📝 L'exemple de démonstration montre comment l'application peut répondre à des questions sur les règles du jeu Monopoly, en utilisant des données issues des PDFs des manuels d'instructions.
  • 🔗 L'utilisation de l'embedding est cruciale pour que les requêtes correspondent aux morceaux d'information pertinents dans la base de données.
  • 🛠️ Le tutoriel propose l'utilisation d'Ollama pour gérer et exécuter des modèles LLM open source localement sur un ordinateur, bien que d'autres options comme OpenAI ou AWS Bedrock soient également mentionnées.
  • 📝 L'application peut être testée en utilisant des tests unitaires avec des questions et des réponses attendues, en utilisant un LLM pour évaluer si les réponses sont équivalentes.

Q & A

  • Qu'est-ce que RAG et comment fonctionne-t-il dans le cadre de cette application Python ?

    -RAG signifie Retrieval, Augmented Generation. Il s'agit d'une méthode d'indexation d'une source de données pour qu'elle puisse être combinée avec un LLM (modèle de langage de longueur), permettant ainsi une expérience de chat IA qui exploite ces données.

  • Quels types de documents sont utilisés dans cet exemple d'application RAG ?

    -Des manuels d'instructions de jeux de société, tels que Monopoly ou CodeNames, sont utilisés comme documents source pour cette application RAG.

  • Comment l'application peut-elle fournir une réponse à une question sur les PDFs ?

    -L'application divise les données des PDFs en petits morceaux, les transforme en vecteurs (embeddings) et les stocke dans une base de données vectorielle. Lorsqu'une question est posée, une recherche est effectuée dans la base de données pour trouver les entrées les plus pertinentes, qui sont ensuite utilisées pour générer la réponse.

  • Quels sont les avantages d'utiliser un LLM local pour générer la réponse ?

    -Un LLM local permet d'éviter les coûts associés aux services en ligne et de bénéficier de la flexibilité de modification et d'ajout d'informations sans avoir à reconstruire la base de données entière.

  • Comment l'application peut-elle être mise à jour avec de nouvelles entrées dans la base de données ?

    -En attribuant un ID unique à chaque morceau de texte, l'application peut vérifier si un élément existe déjà dans la base de données et, le cas échéant, mettre à jour ou l'ajouter.

  • Quels sont les outils et bibliothèques clés utilisés dans ce tutoriel ?

    -Ce tutoriel utilise des outils comme Langchain pour le chargement de documents, ChromaDB pour la base de données vectorielle, et Ollama pour le LLM local.

  • Comment les nouveaux PDF peuvent-ils être ajoutés à l'application sans重建 la base de données ?

    -En utilisant un ID unique pour chaque morceau de texte, l'application peut identifier les nouveaux documents et les ajouter à la base de données existante sans avoir à reconstruire celle-ci.

  • Quelle est la différence entre un LLM et une fonction d'embedding ?

    -Un LLM (modèle de langage) est un modèle de deep learning utilisé pour générer du texte, tandis qu'une fonction d'embedding est utilisée pour transformer des données en vecteurs, qui servent de clés dans une base de données vectorielle.

  • Comment l'application peut-elle évaluer la qualité des réponses générées par l'IA ?

    -En utilisant des tests unitaires et en demandant à un autre LLM d'évaluer si les réponses sont équivalentes, l'application peut déterminer si les réponses sont correctes ou non.

  • Comment les tests unitaires sont-ils utilisés pour évaluer les réponses de l'application ?

    -Des tests unitaires sont écrits avec des questions et des réponses attendues. L'application est interrogée avec ces questions et les réponses sont comparées à celles attendues, en utilisant un LLM pour évaluer l'équivalence des réponses.

  • Quels sont les défis potentiels lors de l'utilisation d'un LLM pour évaluer les réponses dans les tests unitaires ?

    -Un défi est que l'LLM pourrait être trop généreux dans l'évaluation, ce qui pourrait conduire à accepter des réponses incorrectes. Il est donc important d'inclure des cas négatifs pour s'assurer que les mauvaises réponses sont correctement identifiées.

Outlines

00:00

😀 Création d'une application Python RAG

Dans cette vidéo, l'objectif est de développer une application Python RAG qui permet de poser des questions sur un ensemble de PDFs à l'aide d'un langage naturel. Les PDFs utilisés sont des manuels d'instructions de jeux de société comme Monopoly ou CodeNames. L'application peut répondre aux questions et fournir une référence au matériel source. Le tutoriel aborde des fonctionnalités avancées telles que le fonctionnement local avec des modèles de grandeurs libres, la mise à jour de la base de données de vecteurs avec de nouvelles entrées sans avoir à reconstruire la base de données entière, et la manière de tester et d'évaluer la qualité des réponses générées par l'IA. Un rappel rapide sur le concept de RAG (Retrieval, Augmented Generation) est donné, suivi d'une démonstration de l'application terminée.

05:04

📚 Traitement des données et création d'un vecteur d'indexation

Le script traite les données PDF en les divisant en petits morceaux, puis transforme ces morceaux en vecteurs (embeddings) et les stocke dans une base de données de vecteurs. Il est essentiel d'utiliser la même fonction d'embedding pour la création de la base de données et pour les requêtes. Différentes fonctions d'embedding sont discutées, y compris AWS Bedrock et Ollama, qui permet de gérer et d'exécuter des modèles de grandeurs libres localement. La création de la base de données avec ChromaDB est expliquée, en montrant comment ajouter ou mettre à jour des entrées existantes en utilisant des identifiants uniques pour chaque morceau de texte.

10:06

🔍 Mise à jour de la base de données et gestion des données

Le texte explique comment ajouter de nouveaux PDFs à la base de données sans avoir à reconstruire entièrement celle-ci. Chaque morceau de texte est identifié par un ID unique basé sur le chemin d'accès, le numéro de page et l'index de morceau. L'application peut détecter les nouveaux documents et les ajouter à la base de données, tout en évitant les doublons. La mise à jour d'une page existante est un problème plus complexe qui n'est pas abordé dans cette vidéo, mais les spectateurs sont encouragés à proposer des solutions.

15:06

🤖 Intégration de l'IA et génération de réponses

L'IA est utilisée pour générer des réponses en utilisant un modèle de langage local (Ollama avec le modèle Mistral). Le script crée un prompt qui inclut les morceaux de texte les plus pertinents pour la question posée et la question elle-même. L'LLM est alors invoqué pour générer une réponse basée sur ce contexte. L'importance de la qualité des embeddings est soulignée, car elle affecte directement la pertinence des informations retournées par l'application.

20:07

📝 Tests d'évaluation de la qualité des réponses

Pour évaluer la qualité des réponses de l'application, un ensemble de tests unitaires est utilisé. Ces tests comprennent des questions prédéfinies avec des réponses attendues, et utilisent un LLM pour déterminer si les réponses de l'application sont équivalentes aux réponses attendues. Les tests sont structurés pour permettre une évaluation approchée de la justesse des réponses, en tenant compte de la subjectivité de la langue naturelle. Les tests positifs et négatifs sont utilisés pour s'assurer de la fiabilité de l'évaluation.

🚀 Conclusion et invitation à la participation future

La vidéo se termine par une invitation aux téléspectateurs à apporter des suggestions pour les prochains projets, comme le déploiement sur le cloud. Les liens vers le code source sur GitHub sont fournis pour ceux qui souhaitent examiner ou exécuter le projet complet. L'auteur souligne l'importance de comprendre les morceaux de code clés et encourage les téléspectateurs à fournir des commentaires pour améliorer et élargir le contenu futur.

Mindmap

Keywords

💡RAG

RAG signifie 'Retrieval, Augmented Generation', c'est une méthode d'indexation des données pour les combiner avec un LLM (modèle de langage de longueur). Dans la vidéo, cela permet de créer une expérience de chat IA qui peut exploiter les données fournies. Par exemple, pour répondre à une question sur les règles d'un jeu, l'application RAG recherche dans les données indexées et fournit une réponse basée sur ces informations.

💡LLM (Modèle de Langage de Longueur)

Un LLM est un 'Long Language Model', un type de modèle de langage sophistiqué capable de générer ou comprendre du texte en langage naturel. Dans le script, l'auteur utilise un LLM pour générer des réponses à des questions posées par les utilisateurs, en s'appuyant sur les données indexées par le système RAG.

💡PDF

PDF signifie 'Portable Document Format', un format de fichier utilisé pour représenter des documents de manière indépendante du matériel ou du logiciel. Dans la vidéo, les PDF sont utilisés comme source de données pour les manuels d'instructions de jeux, qui sont analysés et indexés par l'application RAG.

💡Embedding

Dans le contexte de l'IA, un 'embedding' est une représentation vectorielle compacte de données, comme du texte, qui capture les caractéristiques essentielles. Dans la vidéo, les morceaux de texte extraits des PDF sont transformés en embeddings, qui sont utilisés comme clés pour stocker et rechercher des informations dans la base de données vectorielle.

💡Base de données vectorielle

Une 'base de données vectorielle' est une base de données conçue pour stocker et gérer des vecteurs, souvent utilisée pour des tâches de recherche d'information basées sur la similarité. Dans le script, les embeddings des morceaux de texte sont stockés dans une base de données vectorielle pour une recherche efficace des informations pertinentes.

💡ChromaDB

ChromaDB est une base de données vectorielle mentionnée dans le script, utilisée pour stocker les embeddings des documents. Elle permet de rechercher et d'ajouter des entrées de manière efficace, même si des documents sont modifiés ou mis à jour.

💡Ollama

Ollama est une plateforme qui permet de gérer et d'exécuter des LLMs open source localement sur un ordinateur. Dans le script, l'auteur mentionne l'utilisation d'Ollama pour générer des embeddings localement, bien que pour sa propre application, il utilise des services en ligne pour une meilleure qualité d'embedding.

💡Unit testing

Le 'unit testing' est une méthode de test de logiciel où des unités de code (telles que des fonctions ou des méthodes) sont testées individuellement pour s'assurer qu'elles fonctionnent correctement. Dans la vidéo, l'auteur explique comment utiliser les tests unitaires pour évaluer la qualité des réponses générées par l'application RAG.

💡Langchain

Langchain est une bibliothèque utilisée dans le script pour charger et gérer les documents, ainsi que pour la création de chunks et d'embeddings. Elle fournit des outils pour intégrer différents types de documents et de générer des fonctionnalités d'IA basées sur ces documents.

💡Mistral

Mistral est un modèle de langage open source mentionné dans le script, utilisé pour générer des réponses à des questions dans l'application RAG. Il est exécuté via un serveur Ollama local pour fournir une interface de chat IA.

Highlights

Création d'une application Python RAG pour poser des questions sur un ensemble de PDFs à l'aide d'un langage naturel.

Utilisation de manuels d'instructions de jeux de société comme Monopoly ou CodeNames en PDF pour l'indexation de données.

Introduction de fonctionnalités avancées telles que le fonctionnement local à l'aide de modèles de langage machine (LLM) open source.

Mise à jour de la base de données vectorielle avec de nouvelles entrées sans reconstruction complète de la base de données.

Évaluation de la qualité des réponses générées par l'IA pour valider rapidement les modifications apportées à l'application.

Présentation d'un démo de l'application RAG terminée avec des questions sur les manuels d'instructions de jeux.

Utilisation d'un modèle LLM local pour générer des réponses basées sur les données trouvées dans les PDFs.

Explication du fonctionnement en arrière-plan de la transformation de données et de la génération de réponses.

Utilisation de Langchain pour charger les documents PDF et la gestion de différents types de documents.

Démonstration de la façon de diviser les documents en morceaux plus petits à l'aide du diviseur de texte récursif de Langchain.

Création d'un embedding pour chaque morceau de texte pour l'indexation et le stockage dans la base de données.

Utilisation d'AWS Bedrock pour générer des embeddings et la possibilité d'utiliser d'autres fonctions d'embedding.

Présentation d'Ollama comme plateforme pour gérer et exécuter des modèles de langage machine open source localement.

Création de la base de données vectorielle avec ChromaDB et la gestion de l'ajout ou de la mise à jour d'éléments existants.

Mise en place d'un système de test et d'évaluation de la qualité des réponses de l'application à l'aide de tests unitaires et d'LLM.

Démonstration de l'exécution de l'application avec des requêtes et la génération de réponses à partir des données incorporées.

Utilisation de pytest pour écrire des tests unitaires et évaluer la qualité des réponses de l'application RAG.

Méthode d'évaluation des réponses à l'aide d'un LLM pour déterminer si les réponses sont équivalentes.

Conclusion sur l'ajout de nouvelles fonctionnalités à l'application RAG et la possibilité d'apprendre ensemble avec la communauté.

Transcripts

play00:00

In this video, we're going to build a Python RAG

play00:02

application that lets us ask questions about

play00:05

a set of PDFs we have using natural language.

play00:08

The PDFs I'm going to use here are a bunch of board game instruction

play00:11

manuals for games like Monopoly or CodeNames.

play00:14

I can ask questions about my data, like "how do I

play00:16

build a hotel in Monopoly?" The app will give me

play00:19

an answer and a reference to the source material.

play00:22

Now, I have done a basic RAG tutorial before on this

play00:24

channel, but in this video we're going to take it up

play00:27

a notch by introducing some more advanced features

play00:30

that you guys asked about in the comments last time.

play00:33

We're going to cover how to get it running locally

play00:35

on your computer using open source LLMs.

play00:38

I'll also show you how to update the vector database with new entries.

play00:42

So if you want to modify or add information, you can do that

play00:45

without having to rebuild the entire database from scratch.

play00:49

Finally, we'll take a look at how we can test and evaluate

play00:52

the quality of our AI generated responses.

play00:55

This way you can quickly validate your app whenever you make

play00:58

a change to the data source, the code or the LLM model.

play01:02

All right, let's get started.

play01:03

If you haven't built an app like this before,

play01:06

then I highly recommend you to check out my

play01:09

previous video tutorial on this topic first.

play01:13

It will help you to get up to speed with all of the basic concepts.

play01:16

Otherwise, here's a quick recap. RAG stands for Retrieval

play01:19

Augmented Generation, and it's a way to index a

play01:22

data source so that we can combine it with an LLM.

play01:26

This gives us an AI chat experience that can leverage that data.

play01:30

Here's a quick demo of the completed app.

play01:32

I have my Python script here and I'm going to

play01:34

ask a question about my data source, which

play01:36

is going to be board game instruction manual.

play01:39

So I can ask, "how do I build a hotel in Monopoly?""

play01:44

And the result is that it gives me a response based on the

play01:48

data that it found in the PDF sources that I provided it.

play01:52

So the response is going to use that and actually phrase

play01:55

it into a proper natural language response.

play01:58

It's not just going to copy and paste the raw data source.

play02:01

And here it's telling me that if I want to build

play02:03

a hotel, I need to have four houses in a single

play02:05

color and then I can buy the hotel from the bank.

play02:08

And in this version of the app, I'm also using

play02:10

a local LLM model to generate this response.

play02:13

So here I have my Ollama server running in a separate terminal.

play02:17

If you don't know what that is yet, that's okay. We'll cover it later.

play02:20

But here's the actual LLM reading the question

play02:22

and then turning this into a response.

play02:25

Here's a quick recap on how that all works behind the scenes.

play02:29

First, we have our original data source, the PDFs.

play02:32

This data is going to be split into small chunks

play02:35

and then transformed into an embedding

play02:37

and stored inside of the vector database.

play02:40

Then when we want to ask a question, we'll also turn our query into an embedding.

play02:45

This will let us fetch the most relevant entries from the database.

play02:48

We can then use those entries together in a prompt

play02:51

and that's how we get our final response.

play02:54

For this tutorial, we're going to mainly focus on the

play02:56

features I mentioned at the beginning of the video.

play02:59

But for everything else, we're going to be speeding through it a little bit.

play03:02

So if you feel like it's all going a little bit

play03:04

too fast, you can either check out my previous

play03:06

RAG tutorial video first to learn the basics.

play03:09

Or you could also follow along by looking through the code itself on GitHub.

play03:14

Links will be in the description.

play03:16

Here are the main dependencies I'll be using in this project.

play03:18

So go ahead and install or update them first before you start.

play03:21

First, we'll need some data to feed our RAG application with.

play03:25

Gather some documents that you'd like to use as your source material.

play03:28

In my previous video, a lot of you asked me how to do this with PDFs.

play03:32

So I'm going to be using PDFs here.

play03:34

I'm going to use board game instruction manuals.

play03:36

I've got one for Monopoly and I've also got one for A Ticket to Ride.

play03:40

And I just found these for free online.

play03:42

So you can use whatever you want, but this is what I'm going to use here.

play03:45

Just download the PDFs you want to use online and then put them inside a folder.

play03:49

In this case, I've put it inside this data folder here in my project.

play03:53

This is the code I can then use to load the documents from inside that folder.

play03:57

It's using a PDF document loader that comes with the Langchain library.

play04:01

And for future reference, if you want to load other types of

play04:04

documents, you can head over to the Langchain documentation.

play04:07

Look up document loaders and then just pick from any

play04:10

of the various available document loaders here.

play04:13

There's things for CSV files, a directory, HTML, Markdown and Microsoft Office.

play04:19

And if that's still not enough, you can click

play04:21

on the document loader integrations and there's

play04:23

a whole list of third-party document loaders

play04:25

available for you to choose from as well.

play04:28

And if you want to see what one of these documents

play04:30

looks like after you've loaded it,

play04:32

you could just go ahead and print it out.

play04:34

You should see an object like this.

play04:35

So each document is basically an object containing

play04:38

the text content of each page in the PDF.

play04:41

It also has some metadata attached, which tells

play04:43

you the page number and the source of the text.

play04:46

Our next problem is that each document or each page

play04:49

of the PDF is probably too big to use on its own.

play04:52

We'll need to split it into smaller chunks and we can use Langchains

play04:55

built-in recursive text splitter to do exactly that.

play04:59

After you run that on your documents, you'll find that each chunk is a lot smaller.

play05:04

So this is going to be handy when we index and store the data.

play05:07

Next, we'll need to create an embedding for each chunk.

play05:10

This will become something like a key for a database.

play05:13

I actually recommend creating a function that returns

play05:15

an embedding function because we're actually going to

play05:18

need this embedding function in two separate places.

play05:21

The first is going to be when we create the database itself.

play05:24

And the second is when we actually want to query the database.

play05:28

And it's very important that we use the exact same

play05:30

embedding function in both of these places.

play05:33

Otherwise, it's not going to work.

play05:35

Langchain also comes with a lot of different embedding functions you can use.

play05:39

In this case, I'm using AWS Bedrock because I tend

play05:41

to build a lot of stuff using AWS already.

play05:44

And the results are pretty good, from what I can tell.

play05:46

But you can switch to using a different embedding function as well.

play05:49

You can choose from any of the embedding integrations

play05:51

listed here on the Langchain website.

play05:54

For example, if you want to run it completely locally on your

play05:57

own computer, you can use an Ollama embedding instead.

play06:01

Of course, for this to work, you also need to install Ollama

play06:03

and run the Ollama server on your computer first.

play06:06

If you haven't used Ollama before, you can think

play06:08

of it as a platform that manages and runs

play06:11

open source LLMs locally on your computer.

play06:14

Just download it from the official website, Ollama.

play06:17

com, and then install any of the available

play06:20

open source models like Llama2 or Mistral.

play06:23

You can then run this command to serve the model as a REST API on your local host.

play06:28

Now, you'll be able to use an LLM just by calling this local API.

play06:32

Of course, the Langchain module for Ollama embeddings will handle

play06:35

all of this for you as long as the server is running.

play06:38

However, just as a heads up, for my own testing

play06:40

using one of the 4GB models on Ollama, the

play06:43

embedding results just weren't very good.

play06:46

For RAG apps, having good embeddings is essential,

play06:48

otherwise your queries won't match up with the chunks

play06:51

of information that are actually relevant.

play06:54

So for myself on this project, I'm still going to use a

play06:57

service like OpenAI or AWS Bedrock for the embeddings.

play07:01

But if your computer can handle it, you can try

play07:03

using a larger, more powerful model on Ollama

play07:05

as well, and please let me know how that goes.

play07:08

By the way, some of you might be wondering at this point,

play07:10

how did I measure the quality of the embeddings?

play07:13

Well, we'll get to that later when we look at testing.

play07:15

Now let's walk through the process of creating the database.

play07:19

Once we have the documents split into smaller chunks, we can use

play07:22

the embedding function to build a vector database with it.

play07:25

So just as a quick recap, a vector is something like

play07:28

a list of numbers, and our embeddings are actually

play07:31

a vector because they're just a list of numbers.

play07:34

So a vector database lets us store information

play07:37

using vectors as something like a key.

play07:40

And in this video, we're going to be using ChromaDB as our vector database.

play07:44

In my first video, we actually had code that looked

play07:47

a lot like this, and it's useful if we wanted

play07:50

to create a brand new database from scratch.

play07:53

But what if we wanted to add or update items in an existing database?

play07:58

ChromaDB will let us do this too, but first we'll

play08:01

need to tag every item with a string id.

play08:04

Let's go back to our chunk of text and figure out how we can do this.

play08:08

So as you can see, each chunk already has its source file path and a page number.

play08:13

So what if we put it together to do something like this?

play08:16

We'll use the source path, the page number, and then the chunk number of that page.

play08:21

Because remember, a single page could have several chunks.

play08:24

That way, every chunk will have a unique but deterministic id.

play08:28

We can then use this to see if this particular chunk exists in

play08:31

the database already, and if it's not, then we can add it.

play08:35

Implementing this is pretty easy as well.

play08:37

We can loop through all the chunks and look at its metadata.

play08:40

We'll concatenate the source and the page number to make an id.

play08:44

But because a single page is split up into multiple chunks,

play08:47

we actually have many chunks sharing the same page id.

play08:50

Solving this is pretty easy though.

play08:52

We can just keep count of the chunk index for a page,

play08:55

and then reset it to zero whenever we see a new page.

play08:59

So putting all that together, we now have a

play09:01

chunk id that looks something like these.

play09:03

Each chunk is now guaranteed a unique and deterministic id.

play09:07

Let's add it back into the metadata of the chunk as well so we can use it later.

play09:11

Now, if we add new PDFs or add new pages to an existing

play09:14

PDF, our system will have a way to check

play09:16

whether it's already in the database or not.

play09:19

So let's hop over to the code editor and see this in action.

play09:23

Currently, in my data folder, I've got a Monopoly PDF and a Ticket to Ride PDF.

play09:28

So now I'm going to add a new PDF to this folder.

play09:31

It's going to be the one for CodeNames.

play09:33

This is the one I'm adding.

play09:34

So now when I populate the database, I want my program to detect

play09:38

that this one is new, but the other two already exist.

play09:42

So I only want this one to be added.

play09:45

So here, right away, it's quickly detected

play09:48

that there's 41 documents already inside the

play09:52

database, but we have 27 new documents

play09:55

that we need to add just because I moved that

play09:58

new pdf into the data directory as well.

play10:02

So that was a new one.

play10:03

And this time, even if we run the same command

play10:05

again to populate the database, it can see that

play10:08

all the documents, all the pdfs inside that

play10:10

data folder have already been added from the previous

play10:13

step and there was nothing new to add.

play10:16

So this is exactly the behavior that we want.

play10:18

Although this implementation will let us add

play10:20

new data without having to recreate the entire

play10:23

database itself, it's actually not enough

play10:26

for us if we wanted to edit an existing page.

play10:29

For example, if I modify the pdf content in this chunk

play10:32

here, the chunk ID will still be exactly the same.

play10:36

So how do we know when we need to actually update this page?

play10:39

This problem is out of scope for today, but

play10:41

there's actually many ways to solve this.

play10:43

If you think you know the solution, then please share it in the comments.

play10:46

Now let's close the loop on this and actually take a look

play10:48

at the code that you need for updating your database.

play10:51

Now that we've given every chunk a unique ID, let's add them to the database.

play10:55

If you're using chroma, you can first load up your database like

play10:58

this, using the same embedding function we used earlier.

play11:01

Let's go through all the items in the database and get all of the IDs.

play11:05

If you're running this for the very first time, then this should be an empty set.

play11:09

After that, we can filter through all of the chunks we're about to add.

play11:12

If we don't see an ID inside the set, that means

play11:14

it's a new chunk and we should add it.

play11:17

From there, it's all pretty easy.

play11:19

It's just a few lines to add the documents to the database.

play11:22

Just don't forget to also add the IDs explicitly as well.

play11:25

If you don't specify a matching list of IDs for

play11:27

the items that you're adding, then chroma will

play11:30

generate new UUIDs for us automatically.

play11:33

It's convenient, but it also means that we won't be able

play11:35

to check for the existing items like we did earlier.

play11:38

So if that's the case, when we try to add new

play11:40

items, we're just going to end up with a

play11:41

lot of duplicated items inside the database.

play11:44

Now let's put all this together and make this not just

play11:47

functional, but also able to run locally as well.

play11:50

If you were using Ollama's local embeddings from before, you'll

play11:53

be able to do everything 100% locally, end to end.

play11:57

Or you might end up with more of a hybrid approach like me.

play12:00

I use an online embedding model because it's better than what I can do locally.

play12:04

But I found that as long as the embeddings are good,

play12:07

I can actually get pretty impressive results using

play12:10

a local LLM to do the actual chat interface.

play12:13

So that's what we're going to do here.

play12:15

We can start by creating a new Python script or

play12:18

function that will take our query as input.

play12:21

We'll also have to load the embedding function and the database.

play12:24

We'll need to prepare a prompt for our LLM.

play12:27

Here's the template I'm going to use.

play12:29

There's two variables we'll need to replace here.

play12:31

First is the context, which is going to be all the chunks

play12:34

from our database that best matches the query.

play12:37

And then second, it's the actual question that we want to ask.

play12:40

So we'll put that whole thing together and then we get

play12:42

the final prompt that we want to send to our LLM.

play12:45

To retrieve the relevant context, we'll need to search

play12:48

the database, which will give us a list of

play12:50

the top K most relevant chunks to our question.

play12:53

Then we can use that together with the original

play12:55

question text to generate the prompt.

play12:58

If you decide to print out the entire prompt at

play13:00

this stage, you should see something like this.

play13:03

So you've got your entire prompt template here, but you

play13:06

could see that our context section already has some of

play13:09

the chunks from the instruction manual formatted in.

play13:13

And I put my k=5, so there's actually five different chunks.

play13:18

And this is all part of one big prompt.

play13:21

This is the information that my system thought

play13:23

was the best matching to answer our query.

play13:26

And then I kind of reiterate the question that I want right

play13:29

at the end after I've given all of this context.

play13:32

So here the question is, how many clues can I give in code names?

play13:35

And the response is, in code names you can only give one

play13:38

clue per turn, and the clue should be a single word.

play13:41

And then I also have the sources of this answer cited here,

play13:44

so that's basically where all these chunks were found.

play13:48

After you have the prompt, the rest is super easy.

play13:51

All you have to do is just invoke an LLM with the prompt.

play13:54

Here I'll use the Mistral model on my local Ollama server.

play13:57

It only needs four gigabytes to run, but it's actually quite capable.

play14:01

And if you want, you can also get the original source of the text like this.

play14:04

Now let's go back to our terminal and see this in action.

play14:07

So I'm going to use this program and I'm going to query it.

play14:10

How do I get out of jail in Monopoly?

play14:13

And now the program stopped running, so let's go and see what it did.

play14:16

Here you can see that we find all the relevant chunks.

play14:20

So this one is the most relevant, and it's actually spot on.

play14:24

It actually gives us step-by-step instructions on how to get out of jail.

play14:27

So I think really this is the only one we need.

play14:29

But anyways, we put our limit to five, so we also get a bunch

play14:32

of other chunks that may be relevant to the question.

play14:35

And then as part of the prompt, we reiterate the question

play14:38

again so that our LLM knows what to answer.

play14:41

And using all of that information, this is the response our LLM came up with.

play14:46

So it came up with four different ways we can get out of jail in Monopoly.

play14:49

And then right at the end, we also have the sources of all of this information.

play14:53

So that's what it's like when we run the entire application.

play14:56

And even though I used AWS Bedrock for the embeddings,

play14:58

because I couldn't get local embeddings

play15:01

that were good enough, this part to generate

play15:03

the question still uses a local Ollama server.

play15:06

So if I go to my other terminal here, see where

play15:08

my Ollama server is running, you could

play15:10

see it logging the work that we're doing.

play15:12

We now have a RAG application that works quite well end-to-end.

play15:16

We can get it to answer our questions by using the embedded

play15:18

source material, but the quality of the answers we

play15:21

get would depend on quite a lot of different factors.

play15:24

For example, it could depend on the source material

play15:26

itself, or the way we split the text.

play15:28

And it will also 100% depend on the LLM model we

play15:31

use for the embedding and the final response.

play15:34

So the problem we have now is, how do we evaluate the quality of responses?

play15:39

This seems to be a subjective matter.

play15:41

Let's see if we can approach this with unit testing.

play15:44

If you've never worked with unit tests in Python

play15:46

before, then you can also check out my other

play15:48

video on how to get started with pytest.

play15:50

The main idea here is to write some sample questions and also

play15:53

provide the expected answer for each of those questions.

play15:57

So given a question like, "How much total money does

play16:00

a player start with in Monopoly?", the answer I'd

play16:03

expect my RAG application to respond with is 1500.

play16:06

You want it to be something that you can already

play16:08

validate or already know the answer for.

play16:11

We can then run the test by passing the question

play16:14

into our actual app, and then comparing

play16:17

and asserting that the answer matches.

play16:20

But the challenge with this is that we can't do

play16:22

a strict equality comparison, because there could

play16:25

be many ways to express the right answer.

play16:28

So what we can do instead is actually use an LLM to judge the answer for us.

play16:33

This won't always guarantee perfect results, but it does get us pretty close.

play16:38

We can start by having a prompt template like this, that asks

play16:41

the LLM to judge whether these responses are equivalent.

play16:45

Then, as part of our test, we'll query the

play16:47

RAG app with our question, and then we'll

play16:49

create a prompt based on the question, the

play16:52

expected response, and the actual response.

play16:55

We can then invoke our LLM again to give us its opinion.

play16:59

We can clean up the response we get from that, and

play17:01

finally check whether the answer is true or false.

play17:04

And this is something we'll actually be able to assert on as part of our unit test.

play17:08

So putting all that together, I can wrap this into

play17:11

a nice helper function that returns true or false.

play17:14

Then, I can just write a bunch of unit tests using that helper

play17:17

function, and I can write as many test cases as I want.

play17:20

This will give me a quick way to see how well my application

play17:23

is performing, especially after I make updates to

play17:26

the code, the source documents, or the LLM model itself.

play17:30

Now let's hop over back to our editor to do a quick demo.

play17:32

So I've got my test file here, and here is the helper

play17:35

function that you saw earlier, and here is us trying

play17:38

to interpret that result into either a true

play17:40

or a false result, and here is the prompt template.

play17:44

So these are going to be my two test cases.

play17:46

I'm going to test the monopoly rules, and I'm

play17:47

also going to test the ticket to ride rules.

play17:49

So two test cases. Let's see how it does.

play17:52

Okay, and in this case, both of my test cases passed.

play17:55

Let's expand this window and actually take a bit of a closer look.

play17:58

So here, my expected response is 10 points,

play18:00

and the actual response is "The longest continual

play18:03

train gets a bonus of 10 points."

play18:05

So these are not exactly the same string,

play18:08

but they're still saying the same thing.

play18:11

And this is true. So this was successful.

play18:15

And then if I go up to my monopoly one, the expected response

play18:19

is 1,500, and the actual response is also 1,500.

play18:23

And as you can see again, the format is slightly

play18:25

different, so we need the LLM to tell us whether

play18:27

or not these actually mean the same thing.

play18:30

So this one passed as well. In this case, both of our tests passed.

play18:34

Now, we have to be careful with this because we

play18:36

don't know whether it passed because the evaluation

play18:39

was good and the answer was correct, or

play18:41

if our LLM turns out to be too generous, we might

play18:44

actually end up passing the wrong answers.

play18:47

So it's also good to do a negative test case to kind of check that.

play18:51

So what we could do is we can turn this expected

play18:53

response into something we know that's wrong

play18:56

and then check that it actually fails.

play18:58

We want it to fail in that case.

play19:00

So I'm going to put 9999.

play19:02

Okay, and I'm now running that test again, expecting this case to fail.

play19:07

And here it actually does fail, which is good. That's exactly what we wanted.

play19:11

So we have our fake expected response of 9999,

play19:14

and then the actual response is still the same

play19:17

from when we asked it before, which is 1,500.

play19:20

And our LLM evaluation correctly determines that this is the wrong response.

play19:26

So our test will fail in this case, and our entire test suite will fail.

play19:29

However, if we want a failing test, if we want this

play19:32

negative case to be used as part of our suite in the

play19:35

correct way, what we could actually do is go back

play19:37

to our test case here and then invert the assertion.

play19:41

So instead of asserting that this is true, we can

play19:44

assert that this is actually going to fail.

play19:47

And that also tells us that this answer should be wrong, and

play19:50

something is wrong if it's not wrong, if that makes sense.

play19:54

So let's go ahead and run this again.

play19:56

So this time the LLM still believes that the

play19:59

response doesn't match, and it's false.

play20:03

But because we've inverted the assert case, the

play20:05

entire test suite still manages to pass.

play20:07

So I recommend that if you're going to write tests for

play20:09

LLM applications like this, it's good to have both

play20:12

positive cases and negative cases being tested.

play20:15

And by the way, if you do have a lot of different

play20:17

test cases you want to use, you maybe don't

play20:20

need to assert that 100% of them succeed.

play20:23

You could maybe set a threshold for what is good enough.

play20:26

For example, 80% or 90%.

play20:29

So now you've leveled up your project by learning

play20:31

how to use different LLMs, including a

play20:33

local one, and you've also learned how to add

play20:36

new items to your database, and how to test

play20:38

the quality of your application as a whole.

play20:41

These were all topics that were brought up in the

play20:43

comment section of my previous RAG tutorial.

play20:46

And so after watching this, if there's more

play20:47

things you'd like to learn how to do, like

play20:49

deploy this to the cloud for example, then

play20:51

let me know in the comments of this video and

play20:53

we can build it together in the next one.

play20:55

I know we went through the project quite quickly.

play20:57

My focus here was to show you the coding snippets that

play21:00

mattered the most and helping you to understand them.

play21:03

So I've actually had to simplify a bunch of the code and the ideas along the way.

play21:07

But if you want to take a closer look and see

play21:09

how all the pieces fit together into a project,

play21:12

or you just want to download a code

play21:14

that you can run right away, then check out

play21:16

the GitHub link in the video description.

play21:19

There you'll have access to the entire project that

play21:21

I used for this video, and something that I was

play21:24

running end-to-end as you saw in the demo here.

play21:27

Anyways, I hope this was useful, and I'll see you in the next one.

Rate This

5.0 / 5 (0 votes)

Étiquettes Connexes
RAGPythonAppPDFsLangage NaturelManuels de JeuxMonopolyCodeNamesBase de Données VecteursTests UnitairesLLM