Python RAG Tutorial (with Local LLMs): AI For Your PDFs

pixegami
17 Apr 202421:33

Summary

TLDRThis tutorial video guides viewers on building a Python RAG application for querying information from a set of PDFs using natural language. It covers advanced features like running the app locally with open-source LLMs, updating the vector database, and evaluating AI responses. The host demonstrates how to index data sources, utilize embeddings, and integrate with local or online models for generating natural language responses, concluding with unit testing strategies to ensure quality.

Takeaways

  • 📚 The video demonstrates building a Python RAG (Retrieval-Augmented Generation) application for querying information from a set of PDFs, specifically board game instruction manuals.
  • 🔍 It introduces advanced features for the RAG application, including running it locally with open-source LLMs (Large Language Models) and updating the vector database without rebuilding from scratch.
  • 🛠️ The tutorial covers the process of setting up the application, from gathering documents to using a PDF document loader and splitting the content into smaller chunks for indexing.
  • 📈 The importance of creating embeddings for each chunk of text is highlighted, as these serve as keys in the vector database and are crucial for the RAG system to function effectively.
  • 🔧 The video explains how to use ChromaDB as the vector database and how to tag each chunk with a unique ID to manage updates and additions to the database.
  • 🔄 It shows how to detect new documents and update the database by checking for unique IDs, allowing for incremental updates instead of full rebuilds.
  • 🤖 The application uses an LLM to generate responses to queries, with the video providing a demonstration of how the system formulates answers using context from the PDFs.
  • 🔬 The script discusses the evaluation of the AI-generated responses through unit testing, using an LLM to judge the equivalence of expected and actual responses.
  • 🔗 The video provides a GitHub link for those interested in accessing the full project code and running the application themselves.
  • 💡 The tutorial encourages viewers to suggest further topics for future videos, such as deploying the application to the cloud, fostering a community of learners.
  • 🚀 The video concludes by emphasizing the learning outcomes, such as using different LLMs, updating databases, and testing application quality, and invites viewers to engage with the content.

Q & A

  • What is the primary purpose of the application built in the video?

    -The application is designed to allow users to ask natural language questions about a set of PDFs, specifically board game instruction manuals, and receive answers along with references to the source material.

  • What does RAG stand for and what is its role in the application?

    -RAG stands for Retrieval, Augmented Generation. It is a method used to index a data source so that it can be combined with a Large Language Model (LLM) to provide an AI chat experience leveraging the indexed data.

  • How does the application handle the process of updating the vector database with new entries?

    -The application updates the vector database by first giving each chunk of text a unique and deterministic ID based on the source file path, page number, and chunk number. It then checks if the chunk exists in the database; if not, it adds the new chunk.

  • What is the significance of embeddings in the context of this application?

    -Embeddings are a key component in the application, serving as a numerical representation of the text chunks and queries. They are used to fetch the most relevant entries from the vector database when a question is asked.

  • What is the role of the Ollama server in the application?

    -The Ollama server is used to run open-source LLMs locally on the user's computer. It provides the capability to generate responses using a local model, which can be more efficient and cost-effective than relying solely on online models.

  • How does the application handle the case of adding new PDFs or pages to an existing PDF?

    -The application detects new documents or pages by comparing the unique IDs of the existing chunks in the database with the new chunks derived from the added PDFs or pages. Only the new chunks that do not exist in the database are added.

  • What is the significance of using a unique but deterministic ID for each chunk?

    -Using a unique but deterministic ID for each chunk ensures that the application can accurately identify whether a chunk already exists in the database, allowing for efficient updates and avoiding duplication.

  • What is ChromaDB and how does it fit into the application?

    -ChromaDB is a vector database used in the application to store the embeddings of the text chunks. It allows for efficient retrieval of the most relevant chunks when a query is made.

  • How does the application evaluate the quality of AI-generated responses?

    -The application uses unit testing with a helper function that creates a prompt for an LLM to judge whether the expected response and the actual response are equivalent in meaning, despite potential differences in wording.

  • What is the importance of testing the application with both positive and negative test cases?

    -Testing with both positive and negative cases helps ensure the robustness of the application. Positive cases confirm that the application works correctly with expected inputs, while negative cases verify that it can correctly identify and handle incorrect or unexpected inputs.

  • How can users access the full project code and run the application themselves?

    -Users can access the full project code by visiting the GitHub link provided in the video description. This allows them to download the code and run the application end-to-end as demonstrated in the video.

Outlines

00:00

🛠️ Building a Python RAG Application

This paragraph introduces a project to create a Python application using the Retrieval-Augmented Generation (RAG) model. The app is designed to answer questions about a set of PDF documents, specifically board game instruction manuals for games like Monopoly and CodeNames. The video promises to cover advanced features, including running the app locally with open-source Large Language Models (LLMs), updating the vector database with new entries without rebuilding from scratch, and testing the AI's responses. The paragraph also provides a quick demo of the app in action, explaining the basic concept of RAG and how it combines an LLM with indexed data to provide natural language responses.

05:04

📚 Document Preparation and Embedding Creation

The second paragraph delves into the process of preparing documents for the RAG application. It discusses gathering PDFs as source material and using the Langchain library to load documents and split them into smaller, manageable chunks. The importance of creating an embedding function for these chunks is emphasized, as it serves as a key for the database. The paragraph also mentions different embedding options, such as AWS Bedrock and Ollama, and how to integrate them into the application. The process of building a vector database with ChromaDB and updating it with new or modified documents is outlined, including the use of unique IDs for each chunk to avoid duplication.

10:06

🔄 Updating the Database with New Content

This paragraph focuses on the functionality of updating the vector database with new PDFs or changes to existing documents. It explains how to detect new documents and avoid re-adding existing ones, ensuring efficient database management. The paragraph also touches on the challenge of updating modified content within a document, hinting at solutions but stating it's beyond the current scope. The code snippets provided demonstrate how to add new documents to the database using unique IDs and how to ensure that only new or updated content is added, maintaining database integrity.

15:06

🤖 Integrating the LLM for Response Generation

The fourth paragraph describes the integration of a local Large Language Model (LLM) for generating responses to user queries. It details the process of creating a Python script that takes a query, uses an embedding function to search the database for relevant chunks, and constructs a prompt for the LLM. The paragraph explains how to retrieve the most relevant context from the database and combine it with the user's question to form a complete prompt. It also discusses using the LLM to generate a response and how to handle different approaches to local versus online embeddings, including using an Ollama server for local embeddings or an online service like AWS Bedrock for better quality.

20:07

📝 Evaluating AI Response Quality with Unit Testing

The final paragraph addresses the evaluation of the AI-generated responses' quality through unit testing. It introduces the concept of writing test cases with expected answers and using an LLM to judge the equivalence of the expected and actual responses. The paragraph outlines creating a prompt template for the LLM to evaluate response correctness and suggests using an LLM's judgment to determine if the test passes or fails. It also mentions the importance of including both positive and negative test cases and setting a threshold for acceptable test success rates. The paragraph concludes with a demonstration of running test cases and adjusting assertions to reflect the correctness of the responses.

Mindmap

Keywords

💡RAG

RAG stands for Retrieval, Augmented Generation. It is a method used to index a data source, which can then be combined with a Language Learning Model (LLM) to provide an AI chat experience that can leverage that data. In the video, RAG is used to build an application that can answer questions about PDFs, specifically board game instruction manuals, by referencing the content within those documents.

💡LLM (Language Learning Model)

LLM refers to Language Learning Models, which are AI models trained to understand and generate human-like text based on the input provided to them. In the context of the video, an LLM is used to generate responses to questions about the content of PDF documents after retrieving relevant information from a database.

💡PDF

PDF stands for Portable Document Format, a file format used to present documents in a manner independent of application software, hardware, and operating systems. In the video, PDFs are used as the data source for the RAG application, containing instructions for board games which the application will index and query.

💡Embedding

In the context of the video, an embedding is a numerical representation of a text chunk that serves as a key in a vector database. It allows the system to convert text into a format that can be compared and matched with other text chunks, facilitating the retrieval of relevant information when answering questions.

💡Vector Database

A vector database is a type of database that stores and retrieves data using vector representations of the information. In the video, embeddings of text chunks are stored in a vector database called ChromaDB, which is used to efficiently retrieve the most relevant text chunks in response to a query.

💡ChromaDB

ChromaDB is the vector database used in the video to store embeddings of text chunks. It allows for the efficient storage and retrieval of information, making it possible to update the database with new entries without rebuilding the entire database from scratch.

💡Ollama

Ollama is a platform mentioned in the video that manages and runs open source LLMs locally on a user's computer. It is used as an alternative to online embedding models for those who prefer or require a local solution.

💡Unit Testing

Unit testing is a method of testing individual components or units of a software to determine if they function as intended. In the video, unit testing is used to evaluate the quality of the AI-generated responses by comparing the actual responses from the RAG application to expected answers and using an LLM to judge the equivalence of those responses.

💡Langchain

Langchain is a library referenced in the video that provides tools for working with documents and text processing, including document loaders and text splitters. It is used to facilitate the process of loading documents, splitting them into chunks, and preparing them for indexing in the RAG application.

💡Natural Language Processing (NLP)

NLP is a field of computer science and artificial intelligence that is concerned with the interaction between computers and human language. In the video, NLP techniques are implicitly used in the RAG application to process and understand natural language queries and generate human-like responses.

💡Local LLM Model

A local LLM model refers to an instance of a Language Learning Model that runs on the user's own computer rather than relying on a remote server. In the video, the presenter uses a local LLM model via the Ollama server to generate responses, which allows for a more private and potentially faster interaction with the AI.

Highlights

Building a Python RAG application to answer questions about a set of PDFs using natural language.

Using board game instruction manuals as the data source for the RAG application.

Introduction of advanced features for the RAG tutorial, including local running and vector database updates.

Demonstration of how to get the RAG application running locally using open source LLMs.

Explanation of how to update the vector database with new entries without rebuilding from scratch.

Overview of testing and evaluating the quality of AI-generated responses for the app.

Recap and explanation of the RAG concept: Retrieval, Augmented Generation.

Demonstration of the completed app's ability to answer questions about board game instructions.

Use of a local LLM model to generate responses in the app.

Behind-the-scenes explanation of how the app processes the data and queries.

Focus on main features and speeding through other parts for experienced viewers.

Instructions on installing or updating main dependencies for the RAG project.

Guide on gathering and preparing PDF documents as the source material.

Use of Langchain library for loading documents and its document loader options.

Splitting documents into smaller chunks for indexing and storage.

Creating an embedding function for database indexing and querying.

Recommendation to use the same embedding function for database creation and querying.

Discussion on using AWS Bedrock for embeddings and the option to use local models like Ollama.

Process of creating a vector database with the chunks and their unique IDs.

Explanation of how to add new PDFs to the database without recreating it from scratch.

Challenge of updating existing pages in the database and potential solutions.

Unit testing approach to evaluate the quality of responses from the RAG application.

Use of an LLM to judge the equivalence of responses in the testing process.

Writing unit tests with sample questions and expected answers for the RAG application.

Demonstration of the testing process and the use of assertions to validate responses.

Inclusion of both positive and negative test cases for comprehensive evaluation.

Final thoughts on the project, invitation for further topics, and reference to GitHub for code.

Transcripts

play00:00

In this video, we're going to build a Python RAG

play00:02

application that lets us ask questions about

play00:05

a set of PDFs we have using natural language.

play00:08

The PDFs I'm going to use here are a bunch of board game instruction

play00:11

manuals for games like Monopoly or CodeNames.

play00:14

I can ask questions about my data, like "how do I

play00:16

build a hotel in Monopoly?" The app will give me

play00:19

an answer and a reference to the source material.

play00:22

Now, I have done a basic RAG tutorial before on this

play00:24

channel, but in this video we're going to take it up

play00:27

a notch by introducing some more advanced features

play00:30

that you guys asked about in the comments last time.

play00:33

We're going to cover how to get it running locally

play00:35

on your computer using open source LLMs.

play00:38

I'll also show you how to update the vector database with new entries.

play00:42

So if you want to modify or add information, you can do that

play00:45

without having to rebuild the entire database from scratch.

play00:49

Finally, we'll take a look at how we can test and evaluate

play00:52

the quality of our AI generated responses.

play00:55

This way you can quickly validate your app whenever you make

play00:58

a change to the data source, the code or the LLM model.

play01:02

All right, let's get started.

play01:03

If you haven't built an app like this before,

play01:06

then I highly recommend you to check out my

play01:09

previous video tutorial on this topic first.

play01:13

It will help you to get up to speed with all of the basic concepts.

play01:16

Otherwise, here's a quick recap. RAG stands for Retrieval

play01:19

Augmented Generation, and it's a way to index a

play01:22

data source so that we can combine it with an LLM.

play01:26

This gives us an AI chat experience that can leverage that data.

play01:30

Here's a quick demo of the completed app.

play01:32

I have my Python script here and I'm going to

play01:34

ask a question about my data source, which

play01:36

is going to be board game instruction manual.

play01:39

So I can ask, "how do I build a hotel in Monopoly?""

play01:44

And the result is that it gives me a response based on the

play01:48

data that it found in the PDF sources that I provided it.

play01:52

So the response is going to use that and actually phrase

play01:55

it into a proper natural language response.

play01:58

It's not just going to copy and paste the raw data source.

play02:01

And here it's telling me that if I want to build

play02:03

a hotel, I need to have four houses in a single

play02:05

color and then I can buy the hotel from the bank.

play02:08

And in this version of the app, I'm also using

play02:10

a local LLM model to generate this response.

play02:13

So here I have my Ollama server running in a separate terminal.

play02:17

If you don't know what that is yet, that's okay. We'll cover it later.

play02:20

But here's the actual LLM reading the question

play02:22

and then turning this into a response.

play02:25

Here's a quick recap on how that all works behind the scenes.

play02:29

First, we have our original data source, the PDFs.

play02:32

This data is going to be split into small chunks

play02:35

and then transformed into an embedding

play02:37

and stored inside of the vector database.

play02:40

Then when we want to ask a question, we'll also turn our query into an embedding.

play02:45

This will let us fetch the most relevant entries from the database.

play02:48

We can then use those entries together in a prompt

play02:51

and that's how we get our final response.

play02:54

For this tutorial, we're going to mainly focus on the

play02:56

features I mentioned at the beginning of the video.

play02:59

But for everything else, we're going to be speeding through it a little bit.

play03:02

So if you feel like it's all going a little bit

play03:04

too fast, you can either check out my previous

play03:06

RAG tutorial video first to learn the basics.

play03:09

Or you could also follow along by looking through the code itself on GitHub.

play03:14

Links will be in the description.

play03:16

Here are the main dependencies I'll be using in this project.

play03:18

So go ahead and install or update them first before you start.

play03:21

First, we'll need some data to feed our RAG application with.

play03:25

Gather some documents that you'd like to use as your source material.

play03:28

In my previous video, a lot of you asked me how to do this with PDFs.

play03:32

So I'm going to be using PDFs here.

play03:34

I'm going to use board game instruction manuals.

play03:36

I've got one for Monopoly and I've also got one for A Ticket to Ride.

play03:40

And I just found these for free online.

play03:42

So you can use whatever you want, but this is what I'm going to use here.

play03:45

Just download the PDFs you want to use online and then put them inside a folder.

play03:49

In this case, I've put it inside this data folder here in my project.

play03:53

This is the code I can then use to load the documents from inside that folder.

play03:57

It's using a PDF document loader that comes with the Langchain library.

play04:01

And for future reference, if you want to load other types of

play04:04

documents, you can head over to the Langchain documentation.

play04:07

Look up document loaders and then just pick from any

play04:10

of the various available document loaders here.

play04:13

There's things for CSV files, a directory, HTML, Markdown and Microsoft Office.

play04:19

And if that's still not enough, you can click

play04:21

on the document loader integrations and there's

play04:23

a whole list of third-party document loaders

play04:25

available for you to choose from as well.

play04:28

And if you want to see what one of these documents

play04:30

looks like after you've loaded it,

play04:32

you could just go ahead and print it out.

play04:34

You should see an object like this.

play04:35

So each document is basically an object containing

play04:38

the text content of each page in the PDF.

play04:41

It also has some metadata attached, which tells

play04:43

you the page number and the source of the text.

play04:46

Our next problem is that each document or each page

play04:49

of the PDF is probably too big to use on its own.

play04:52

We'll need to split it into smaller chunks and we can use Langchains

play04:55

built-in recursive text splitter to do exactly that.

play04:59

After you run that on your documents, you'll find that each chunk is a lot smaller.

play05:04

So this is going to be handy when we index and store the data.

play05:07

Next, we'll need to create an embedding for each chunk.

play05:10

This will become something like a key for a database.

play05:13

I actually recommend creating a function that returns

play05:15

an embedding function because we're actually going to

play05:18

need this embedding function in two separate places.

play05:21

The first is going to be when we create the database itself.

play05:24

And the second is when we actually want to query the database.

play05:28

And it's very important that we use the exact same

play05:30

embedding function in both of these places.

play05:33

Otherwise, it's not going to work.

play05:35

Langchain also comes with a lot of different embedding functions you can use.

play05:39

In this case, I'm using AWS Bedrock because I tend

play05:41

to build a lot of stuff using AWS already.

play05:44

And the results are pretty good, from what I can tell.

play05:46

But you can switch to using a different embedding function as well.

play05:49

You can choose from any of the embedding integrations

play05:51

listed here on the Langchain website.

play05:54

For example, if you want to run it completely locally on your

play05:57

own computer, you can use an Ollama embedding instead.

play06:01

Of course, for this to work, you also need to install Ollama

play06:03

and run the Ollama server on your computer first.

play06:06

If you haven't used Ollama before, you can think

play06:08

of it as a platform that manages and runs

play06:11

open source LLMs locally on your computer.

play06:14

Just download it from the official website, Ollama.

play06:17

com, and then install any of the available

play06:20

open source models like Llama2 or Mistral.

play06:23

You can then run this command to serve the model as a REST API on your local host.

play06:28

Now, you'll be able to use an LLM just by calling this local API.

play06:32

Of course, the Langchain module for Ollama embeddings will handle

play06:35

all of this for you as long as the server is running.

play06:38

However, just as a heads up, for my own testing

play06:40

using one of the 4GB models on Ollama, the

play06:43

embedding results just weren't very good.

play06:46

For RAG apps, having good embeddings is essential,

play06:48

otherwise your queries won't match up with the chunks

play06:51

of information that are actually relevant.

play06:54

So for myself on this project, I'm still going to use a

play06:57

service like OpenAI or AWS Bedrock for the embeddings.

play07:01

But if your computer can handle it, you can try

play07:03

using a larger, more powerful model on Ollama

play07:05

as well, and please let me know how that goes.

play07:08

By the way, some of you might be wondering at this point,

play07:10

how did I measure the quality of the embeddings?

play07:13

Well, we'll get to that later when we look at testing.

play07:15

Now let's walk through the process of creating the database.

play07:19

Once we have the documents split into smaller chunks, we can use

play07:22

the embedding function to build a vector database with it.

play07:25

So just as a quick recap, a vector is something like

play07:28

a list of numbers, and our embeddings are actually

play07:31

a vector because they're just a list of numbers.

play07:34

So a vector database lets us store information

play07:37

using vectors as something like a key.

play07:40

And in this video, we're going to be using ChromaDB as our vector database.

play07:44

In my first video, we actually had code that looked

play07:47

a lot like this, and it's useful if we wanted

play07:50

to create a brand new database from scratch.

play07:53

But what if we wanted to add or update items in an existing database?

play07:58

ChromaDB will let us do this too, but first we'll

play08:01

need to tag every item with a string id.

play08:04

Let's go back to our chunk of text and figure out how we can do this.

play08:08

So as you can see, each chunk already has its source file path and a page number.

play08:13

So what if we put it together to do something like this?

play08:16

We'll use the source path, the page number, and then the chunk number of that page.

play08:21

Because remember, a single page could have several chunks.

play08:24

That way, every chunk will have a unique but deterministic id.

play08:28

We can then use this to see if this particular chunk exists in

play08:31

the database already, and if it's not, then we can add it.

play08:35

Implementing this is pretty easy as well.

play08:37

We can loop through all the chunks and look at its metadata.

play08:40

We'll concatenate the source and the page number to make an id.

play08:44

But because a single page is split up into multiple chunks,

play08:47

we actually have many chunks sharing the same page id.

play08:50

Solving this is pretty easy though.

play08:52

We can just keep count of the chunk index for a page,

play08:55

and then reset it to zero whenever we see a new page.

play08:59

So putting all that together, we now have a

play09:01

chunk id that looks something like these.

play09:03

Each chunk is now guaranteed a unique and deterministic id.

play09:07

Let's add it back into the metadata of the chunk as well so we can use it later.

play09:11

Now, if we add new PDFs or add new pages to an existing

play09:14

PDF, our system will have a way to check

play09:16

whether it's already in the database or not.

play09:19

So let's hop over to the code editor and see this in action.

play09:23

Currently, in my data folder, I've got a Monopoly PDF and a Ticket to Ride PDF.

play09:28

So now I'm going to add a new PDF to this folder.

play09:31

It's going to be the one for CodeNames.

play09:33

This is the one I'm adding.

play09:34

So now when I populate the database, I want my program to detect

play09:38

that this one is new, but the other two already exist.

play09:42

So I only want this one to be added.

play09:45

So here, right away, it's quickly detected

play09:48

that there's 41 documents already inside the

play09:52

database, but we have 27 new documents

play09:55

that we need to add just because I moved that

play09:58

new pdf into the data directory as well.

play10:02

So that was a new one.

play10:03

And this time, even if we run the same command

play10:05

again to populate the database, it can see that

play10:08

all the documents, all the pdfs inside that

play10:10

data folder have already been added from the previous

play10:13

step and there was nothing new to add.

play10:16

So this is exactly the behavior that we want.

play10:18

Although this implementation will let us add

play10:20

new data without having to recreate the entire

play10:23

database itself, it's actually not enough

play10:26

for us if we wanted to edit an existing page.

play10:29

For example, if I modify the pdf content in this chunk

play10:32

here, the chunk ID will still be exactly the same.

play10:36

So how do we know when we need to actually update this page?

play10:39

This problem is out of scope for today, but

play10:41

there's actually many ways to solve this.

play10:43

If you think you know the solution, then please share it in the comments.

play10:46

Now let's close the loop on this and actually take a look

play10:48

at the code that you need for updating your database.

play10:51

Now that we've given every chunk a unique ID, let's add them to the database.

play10:55

If you're using chroma, you can first load up your database like

play10:58

this, using the same embedding function we used earlier.

play11:01

Let's go through all the items in the database and get all of the IDs.

play11:05

If you're running this for the very first time, then this should be an empty set.

play11:09

After that, we can filter through all of the chunks we're about to add.

play11:12

If we don't see an ID inside the set, that means

play11:14

it's a new chunk and we should add it.

play11:17

From there, it's all pretty easy.

play11:19

It's just a few lines to add the documents to the database.

play11:22

Just don't forget to also add the IDs explicitly as well.

play11:25

If you don't specify a matching list of IDs for

play11:27

the items that you're adding, then chroma will

play11:30

generate new UUIDs for us automatically.

play11:33

It's convenient, but it also means that we won't be able

play11:35

to check for the existing items like we did earlier.

play11:38

So if that's the case, when we try to add new

play11:40

items, we're just going to end up with a

play11:41

lot of duplicated items inside the database.

play11:44

Now let's put all this together and make this not just

play11:47

functional, but also able to run locally as well.

play11:50

If you were using Ollama's local embeddings from before, you'll

play11:53

be able to do everything 100% locally, end to end.

play11:57

Or you might end up with more of a hybrid approach like me.

play12:00

I use an online embedding model because it's better than what I can do locally.

play12:04

But I found that as long as the embeddings are good,

play12:07

I can actually get pretty impressive results using

play12:10

a local LLM to do the actual chat interface.

play12:13

So that's what we're going to do here.

play12:15

We can start by creating a new Python script or

play12:18

function that will take our query as input.

play12:21

We'll also have to load the embedding function and the database.

play12:24

We'll need to prepare a prompt for our LLM.

play12:27

Here's the template I'm going to use.

play12:29

There's two variables we'll need to replace here.

play12:31

First is the context, which is going to be all the chunks

play12:34

from our database that best matches the query.

play12:37

And then second, it's the actual question that we want to ask.

play12:40

So we'll put that whole thing together and then we get

play12:42

the final prompt that we want to send to our LLM.

play12:45

To retrieve the relevant context, we'll need to search

play12:48

the database, which will give us a list of

play12:50

the top K most relevant chunks to our question.

play12:53

Then we can use that together with the original

play12:55

question text to generate the prompt.

play12:58

If you decide to print out the entire prompt at

play13:00

this stage, you should see something like this.

play13:03

So you've got your entire prompt template here, but you

play13:06

could see that our context section already has some of

play13:09

the chunks from the instruction manual formatted in.

play13:13

And I put my k=5, so there's actually five different chunks.

play13:18

And this is all part of one big prompt.

play13:21

This is the information that my system thought

play13:23

was the best matching to answer our query.

play13:26

And then I kind of reiterate the question that I want right

play13:29

at the end after I've given all of this context.

play13:32

So here the question is, how many clues can I give in code names?

play13:35

And the response is, in code names you can only give one

play13:38

clue per turn, and the clue should be a single word.

play13:41

And then I also have the sources of this answer cited here,

play13:44

so that's basically where all these chunks were found.

play13:48

After you have the prompt, the rest is super easy.

play13:51

All you have to do is just invoke an LLM with the prompt.

play13:54

Here I'll use the Mistral model on my local Ollama server.

play13:57

It only needs four gigabytes to run, but it's actually quite capable.

play14:01

And if you want, you can also get the original source of the text like this.

play14:04

Now let's go back to our terminal and see this in action.

play14:07

So I'm going to use this program and I'm going to query it.

play14:10

How do I get out of jail in Monopoly?

play14:13

And now the program stopped running, so let's go and see what it did.

play14:16

Here you can see that we find all the relevant chunks.

play14:20

So this one is the most relevant, and it's actually spot on.

play14:24

It actually gives us step-by-step instructions on how to get out of jail.

play14:27

So I think really this is the only one we need.

play14:29

But anyways, we put our limit to five, so we also get a bunch

play14:32

of other chunks that may be relevant to the question.

play14:35

And then as part of the prompt, we reiterate the question

play14:38

again so that our LLM knows what to answer.

play14:41

And using all of that information, this is the response our LLM came up with.

play14:46

So it came up with four different ways we can get out of jail in Monopoly.

play14:49

And then right at the end, we also have the sources of all of this information.

play14:53

So that's what it's like when we run the entire application.

play14:56

And even though I used AWS Bedrock for the embeddings,

play14:58

because I couldn't get local embeddings

play15:01

that were good enough, this part to generate

play15:03

the question still uses a local Ollama server.

play15:06

So if I go to my other terminal here, see where

play15:08

my Ollama server is running, you could

play15:10

see it logging the work that we're doing.

play15:12

We now have a RAG application that works quite well end-to-end.

play15:16

We can get it to answer our questions by using the embedded

play15:18

source material, but the quality of the answers we

play15:21

get would depend on quite a lot of different factors.

play15:24

For example, it could depend on the source material

play15:26

itself, or the way we split the text.

play15:28

And it will also 100% depend on the LLM model we

play15:31

use for the embedding and the final response.

play15:34

So the problem we have now is, how do we evaluate the quality of responses?

play15:39

This seems to be a subjective matter.

play15:41

Let's see if we can approach this with unit testing.

play15:44

If you've never worked with unit tests in Python

play15:46

before, then you can also check out my other

play15:48

video on how to get started with pytest.

play15:50

The main idea here is to write some sample questions and also

play15:53

provide the expected answer for each of those questions.

play15:57

So given a question like, "How much total money does

play16:00

a player start with in Monopoly?", the answer I'd

play16:03

expect my RAG application to respond with is 1500.

play16:06

You want it to be something that you can already

play16:08

validate or already know the answer for.

play16:11

We can then run the test by passing the question

play16:14

into our actual app, and then comparing

play16:17

and asserting that the answer matches.

play16:20

But the challenge with this is that we can't do

play16:22

a strict equality comparison, because there could

play16:25

be many ways to express the right answer.

play16:28

So what we can do instead is actually use an LLM to judge the answer for us.

play16:33

This won't always guarantee perfect results, but it does get us pretty close.

play16:38

We can start by having a prompt template like this, that asks

play16:41

the LLM to judge whether these responses are equivalent.

play16:45

Then, as part of our test, we'll query the

play16:47

RAG app with our question, and then we'll

play16:49

create a prompt based on the question, the

play16:52

expected response, and the actual response.

play16:55

We can then invoke our LLM again to give us its opinion.

play16:59

We can clean up the response we get from that, and

play17:01

finally check whether the answer is true or false.

play17:04

And this is something we'll actually be able to assert on as part of our unit test.

play17:08

So putting all that together, I can wrap this into

play17:11

a nice helper function that returns true or false.

play17:14

Then, I can just write a bunch of unit tests using that helper

play17:17

function, and I can write as many test cases as I want.

play17:20

This will give me a quick way to see how well my application

play17:23

is performing, especially after I make updates to

play17:26

the code, the source documents, or the LLM model itself.

play17:30

Now let's hop over back to our editor to do a quick demo.

play17:32

So I've got my test file here, and here is the helper

play17:35

function that you saw earlier, and here is us trying

play17:38

to interpret that result into either a true

play17:40

or a false result, and here is the prompt template.

play17:44

So these are going to be my two test cases.

play17:46

I'm going to test the monopoly rules, and I'm

play17:47

also going to test the ticket to ride rules.

play17:49

So two test cases. Let's see how it does.

play17:52

Okay, and in this case, both of my test cases passed.

play17:55

Let's expand this window and actually take a bit of a closer look.

play17:58

So here, my expected response is 10 points,

play18:00

and the actual response is "The longest continual

play18:03

train gets a bonus of 10 points."

play18:05

So these are not exactly the same string,

play18:08

but they're still saying the same thing.

play18:11

And this is true. So this was successful.

play18:15

And then if I go up to my monopoly one, the expected response

play18:19

is 1,500, and the actual response is also 1,500.

play18:23

And as you can see again, the format is slightly

play18:25

different, so we need the LLM to tell us whether

play18:27

or not these actually mean the same thing.

play18:30

So this one passed as well. In this case, both of our tests passed.

play18:34

Now, we have to be careful with this because we

play18:36

don't know whether it passed because the evaluation

play18:39

was good and the answer was correct, or

play18:41

if our LLM turns out to be too generous, we might

play18:44

actually end up passing the wrong answers.

play18:47

So it's also good to do a negative test case to kind of check that.

play18:51

So what we could do is we can turn this expected

play18:53

response into something we know that's wrong

play18:56

and then check that it actually fails.

play18:58

We want it to fail in that case.

play19:00

So I'm going to put 9999.

play19:02

Okay, and I'm now running that test again, expecting this case to fail.

play19:07

And here it actually does fail, which is good. That's exactly what we wanted.

play19:11

So we have our fake expected response of 9999,

play19:14

and then the actual response is still the same

play19:17

from when we asked it before, which is 1,500.

play19:20

And our LLM evaluation correctly determines that this is the wrong response.

play19:26

So our test will fail in this case, and our entire test suite will fail.

play19:29

However, if we want a failing test, if we want this

play19:32

negative case to be used as part of our suite in the

play19:35

correct way, what we could actually do is go back

play19:37

to our test case here and then invert the assertion.

play19:41

So instead of asserting that this is true, we can

play19:44

assert that this is actually going to fail.

play19:47

And that also tells us that this answer should be wrong, and

play19:50

something is wrong if it's not wrong, if that makes sense.

play19:54

So let's go ahead and run this again.

play19:56

So this time the LLM still believes that the

play19:59

response doesn't match, and it's false.

play20:03

But because we've inverted the assert case, the

play20:05

entire test suite still manages to pass.

play20:07

So I recommend that if you're going to write tests for

play20:09

LLM applications like this, it's good to have both

play20:12

positive cases and negative cases being tested.

play20:15

And by the way, if you do have a lot of different

play20:17

test cases you want to use, you maybe don't

play20:20

need to assert that 100% of them succeed.

play20:23

You could maybe set a threshold for what is good enough.

play20:26

For example, 80% or 90%.

play20:29

So now you've leveled up your project by learning

play20:31

how to use different LLMs, including a

play20:33

local one, and you've also learned how to add

play20:36

new items to your database, and how to test

play20:38

the quality of your application as a whole.

play20:41

These were all topics that were brought up in the

play20:43

comment section of my previous RAG tutorial.

play20:46

And so after watching this, if there's more

play20:47

things you'd like to learn how to do, like

play20:49

deploy this to the cloud for example, then

play20:51

let me know in the comments of this video and

play20:53

we can build it together in the next one.

play20:55

I know we went through the project quite quickly.

play20:57

My focus here was to show you the coding snippets that

play21:00

mattered the most and helping you to understand them.

play21:03

So I've actually had to simplify a bunch of the code and the ideas along the way.

play21:07

But if you want to take a closer look and see

play21:09

how all the pieces fit together into a project,

play21:12

or you just want to download a code

play21:14

that you can run right away, then check out

play21:16

the GitHub link in the video description.

play21:19

There you'll have access to the entire project that

play21:21

I used for this video, and something that I was

play21:24

running end-to-end as you saw in the demo here.

play21:27

Anyways, I hope this was useful, and I'll see you in the next one.

Rate This

5.0 / 5 (0 votes)

Связанные теги
Python RAGAI ApplicationPDF IndexingLLM IntegrationLocal ServerCloud ModelEmbedding KeysDatabase UpdateQuality TestingChat InterfaceLangchain Library
Вам нужно краткое изложение на английском?