How to evaluate an LLM-powered RAG application automatically.

Underfitted
26 Mar 202450:41

Summary

TLDRThe video script outlines a method for testing and evaluating a rack application, specifically one that utilizes a large language model (LLM) like GPT-3.5 or GPT-4. It emphasizes the importance of robust testing to ensure the reliability of the system's outputs. The speaker introduces a process involving the creation of a knowledge base from a set of documents, the generation of test cases using GPT-4, and the use of an open-source library called Gizard for evaluation. The script details the technical steps, including setting up a vector store database, using Lang chain for the rack system, and automating the testing process with Gizard and pytest. The goal is to provide a systematic way to assess and improve the application's performance.

Takeaways

  • πŸ” The importance of testing a rack application is emphasized, highlighting the need for a systematic approach to evaluate and ensure the quality of results from a language model.
  • πŸ“ The speaker introduces a method for creating test cases and evaluating different models, such as GPT-4 and open-source alternatives, in a structured and automated manner.
  • πŸ’» Open-source tools are recommended for implementing robust testing in rack applications, with the speaker sharing their code and providing links for viewers to access and use these tools.
  • 🌐 The process of scraping a website to gather information for a rack system is demonstrated, using tools like Lang chain and Beautiful Soup for Python.
  • πŸ“š A detailed example is given using the speaker's own website, which contains a wealth of information about a machine learning systems course, to illustrate how a rack system can extract and utilize data.
  • πŸ”— The concept of embeddings and vector stores is explained, showing how they can be used to semantically identify and retrieve relevant documents for answering user queries.
  • 🧠 The use of GPT-4 for automatically generating test cases is highlighted, showcasing the ability to create relevant questions and context for evaluating a rack system's performance.
  • πŸ”§ The speaker outlines the construction of a simple rack system using Lang chain, explaining each component's role in retrieving and formatting answers to user questions.
  • πŸ“Š Gizard is introduced as a tool for evaluating the rack system, providing a report that includes a map representation of the knowledge base, component analysis, and overall correctness score.
  • πŸ› οΈ Recommendations for improving the system are provided based on the evaluation, with insights into which areas need refinement and how to focus efforts for enhancement.
  • πŸ“ˆ The automation of testing is discussed, with the speaker demonstrating how to create and run a test suite, and suggesting the integration of this process with Py Test for continuous evaluation.

Q & A

  • How can one evaluate a rack application effectively?

    -An effective evaluation of a rack application involves creating automated test cases, using open-source tools for robust testing, and systematically comparing different models to ensure the results are accurate and reliable.

  • What is the main challenge in testing a text generation system like a rack application?

    -The main challenge in testing a text generation system is the subjectivity of the output, as it's difficult to compare the generated text against a fixed ground truth, unlike in classification tasks where the correct label is clear.

  • How does the speaker propose to automate the generation of test cases for a rack system?

    -The speaker proposes using an open AI API key to connect to GPT-4, which automatically generates test cases for the knowledge base by prompting the model with specific inputs and configurations.

  • What is a vector store database and why is it used in the context of a rack system?

    -A vector store database is a system that stores semantic identifiers, or embeddings, for documents. It is used in a rack system to efficiently find relevant documents based on their content, which helps in answering user queries accurately.

  • What is the role of the 'retriever' in the rack system?

    -The 'retriever' in the rack system is responsible for finding the most relevant documents from the vector store database based on the user's question. It uses the question to identify and return documents that are semantically similar.

  • How does the speaker plan to improve the conversational aspect of the rack system?

    -The speaker plans to improve the conversational aspect of the rack system by implementing a history feature, which supports keeping context during the conversation, allowing for more accurate and relevant responses to follow-up questions.

  • What is the purpose of the 'knowledge base' in the rack system?

    -The 'knowledge base' in the rack system serves as a collection of all the documents that the system has access to. It is used to generate test cases and to provide the context needed for the model to answer questions accurately.

  • How does the speaker integrate the testing process with 'pytest'?

    -The speaker integrates the testing process with 'pytest' by using the 'iptest' library, which allows running 'pytest' tests directly from a notebook. This enables the automation of the testing process and ensures that the system is thoroughly evaluated before deployment.

  • What is the significance of the 'GPT-3.5' model in the evaluation process?

    -The 'GPT-3.5' model is used as the main text generation model in the rack system. It is also the model that generates the test cases and evaluates the answers produced by the system, ensuring consistency and reliability in the evaluation process.

  • What is the overall correctness score of the speaker's rack system?

    -The overall correctness score of the speaker's rack system is 73.33%, which is derived from the evaluation process using the automatically generated test cases and the 'GPT-3.5' model.

Outlines

00:00

πŸ€– Introduction to Testing a Rack Application

The paragraph discusses the challenge of testing a rack application, specifically focusing on the lack of knowledge on how to structure a system to evaluate and ensure the results from an LLM are accurate. The speaker aims to address this issue by presenting a simple rack system code and a method to evaluate and test the system continuously. The goal is to establish an automated way to compare different models, such as GPT-4 and open-source alternatives, in a systematic and non-manual approach.

05:02

πŸ” Scraping a Website for Question Answering

This section details the process of using the Lang chain library to scrape a website for information that will be used to answer user questions. The speaker explains the use of a text splitter and a web-based loader to gather content, emphasizing the importance of splitting content into manageable chunks due to context size limitations in models like GPT-3.5. The process of creating a vector store database and generating embeddings for semantic identification of documents is also discussed.

10:03

🧠 Generating Test Cases for the Rack System

The paragraph describes the complexity of testing a rack system for text generation tasks due to the subjective nature of the output. The speaker introduces the concept of automatically generating test cases using the GPT-4 model and the Gizard library. The process involves creating a knowledge base from the scraped documents and using it to generate a set of test cases with corresponding questions, reference answers, and context documents. This automated approach saves significant time and effort compared to manual test case creation.

15:04

πŸ› οΈ Building the Rack System and Prompt Template

The speaker outlines the process of building a simple rack system that utilizes the scraped and embedded documents to answer user questions. A prompt template is created to structure the input for the GPT-3.5 model, including variables for context and question. The speaker also discusses the creation of a chain in the Lang chain library, which involves components like a map, prompt, and model invocation. The focus is on preparing the system for validation through testing with the generated test cases.

20:06

πŸ”§ Integrating and Testing the Rack System

In this part, the speaker explains how to integrate the components of the rack system, including the vector store retriever and the GPT-3.5 model, to answer questions. The process involves passing the question and context to the model through the chain components. The speaker also discusses the use of a parser to clean the model's output and the importance of the item getter function. The paragraph concludes with a test of the chain to ensure it works correctly and returns clean, formatted strings.

25:06

πŸ“Š Evaluating the Rack System's Performance

The speaker presents the evaluation process of the rack system using the Gizard library. The evaluation involves running the test cases through the system and comparing the generated answers with the reference answers. The results are analyzed in terms of correctness, component performance, and recommendations for improvement. The speaker highlights the ability of Gizard to evaluate the system's performance even when using GPT-4 for the test case generation, emphasizing the importance of this step in refining the system.

30:08

πŸ”„ Automating Tests and Integrating with pytest

The final paragraph discusses the automation of the testing process and integration with pytest, a popular Python testing library. The speaker demonstrates how to create a test suite with Gizard and run it using the Lang chain. The automation allows for repeated testing without manual intervention, which is crucial for continuous improvement and deployment readiness. The speaker also shows how to integrate these tests with pytest, enabling the running of tests directly from a notebook and ensuring the model passes the tests before deployment.

Mindmap

Keywords

πŸ’‘Rack Application

A 'Rack Application' refers to a system designed to handle requests and provide responses, often using web server software. In the context of the video, it is a complex system built to trust and utilize the results produced by a language model (LLM). The speaker aims to address how to effectively test and evaluate such a system to ensure it produces reliable results.

πŸ’‘LLM (Large Language Model)

An 'LLM' or Large Language Model is an artificial intelligence system that processes and generates human-like text based on the input it receives. These models are often used in natural language processing tasks. In the video, the speaker discusses the challenge of testing a system built around trusting the outputs of an LLM, such as GPT-3.5 or GPT-4.

πŸ’‘Test Cases

Test cases are specific scenarios or conditions prepared to test a program or system to ensure it works correctly. They typically include a range of inputs and expected outcomes. In the video, the speaker is interested in creating automated test cases to evaluate the performance of the rack application in answering questions accurately.

πŸ’‘GPT-4

GPT-4 is a hypothetical next-generation language prediction model following the GPT (Generative Pre-trained Transformer) series. It is an advanced AI model designed to generate human-like text based on the input it receives. The speaker in the video discusses the idea of testing a rack application that utilizes such a model.

πŸ’‘Open-source Model

An open-source model refers to a software model whose source code is made publicly available, allowing anyone to view, use, modify, and distribute the model. In the context of the video, the speaker discusses the option of using an open-source model instead of a proprietary model like GPT-3.5 or GPT-4 for their rack application.

πŸ’‘Continuous Testing

Continuous testing is a software testing approach where tests are repeatedly run as part of the software development process to ensure that new code changes do not break existing functionality. In the video, the speaker emphasizes the importance of continuously testing the rack application to maintain its reliability and accuracy.

πŸ’‘Gizard

Gizard appears to be a tool or library used in the context of the video for evaluating and testing the rack application. It helps in generating test cases and evaluating the system's responses against expected outcomes.

πŸ’‘Vector Store

A vector store is a type of database that stores vector representations (embeddings) of documents or data points. These embeddings are high-dimensional vectors that capture semantic meaning, allowing for efficient searching and comparison. In the video, the speaker uses a vector store to find relevant documents for answering questions in their rack application.

πŸ’‘Knowledge Base

A knowledge base is a collection of information or data from which answers to questions can be derived. In the context of the video, the speaker creates a knowledge base from the documents on their website, which the rack application uses to provide answers to user queries.

πŸ’‘Embeddings

Embeddings are numerical representations of words, phrases, or documents in a vector space, where each dimension represents a semantic feature. They are used in natural language processing to capture the meaning of text. In the video, embeddings are generated for documents in the speaker's knowledge base, allowing the rack application to match questions with relevant documents.

Highlights

The speaker discusses the challenges of testing a large language model (LLM) based system, emphasizing the need for robust testing methodologies.

A simple rack system code is presented to demonstrate potential testing approaches for LLM applications.

The importance of creating test cases that can continuously evaluate the system is stressed, to ensure the reliability of the LLM's outputs.

The speaker introduces the concept of using an automated approach to compare different models, such as GPT-4 and open-source models, in a systematic manner.

The use of open-source tools and libraries, like Gizard and Lang chain, is advocated for implementing robust testing of rack applications.

The process of scraping a website to gather information for the LLM to use is explained, along with the necessity of splitting content for effective context management.

The significance of using a vector store database to generate embeddings for semantic identification of documents is highlighted.

The concept of automatically generating test cases using GPT-4 is introduced, showcasing a potential method for evaluating text generation systems.

The speaker presents a method for building a knowledge base from the scraped documents, which will be used to test the system's ability to answer questions.

An overview of how to structure a prompt for the LLM to answer questions using the knowledge base is provided.

The process of creating a chain in Lang chain to integrate the prompt, retriever, and model for answering questions is detailed.

The evaluation of the system is performed using Gizard, which measures the accuracy of the LLM's responses compared to reference answers.

Component analysis is used to identify strengths and weaknesses in the system, providing targeted areas for improvement.

The concept of creating a test suite for automated testing is introduced, allowing for repeated evaluation of the system with different test cases.

The integration of the testing process with the pytest library is discussed, enabling direct testing from a notebook environment.

The speaker emphasizes the importance of automating the testing process to ensure system reliability before deployment.

The video concludes with a call to action for viewers to engage with the content and provide feedback on the presented testing methodologies.

Transcripts

play00:00

how can you test a rack application this

play00:04

is a question that

play00:06

unfortunately not a lot of people are

play00:09

trying to answer right now they built

play00:12

this huge system that's supposed to

play00:16

trust the results of an llm and they

play00:19

have no clue they have no idea how they

play00:23

should structure that system so they can

play00:26

actually test they can actually evaluate

play00:30

the system to ensure the results are

play00:34

good results so that's the question that

play00:37

I want to answer today I'm going to show

play00:39

you the code of a simple rack system and

play00:43

I'm going to show you one way you can

play00:46

think and you can Implement to evaluate

play00:50

that system one way you can incorporate

play00:53

or create test cases that you can use to

play00:58

test your system Contin continuously

play01:01

even better I'm going to show you a way

play01:03

that you can use to evaluate different

play01:06

models doing the same work so imagine

play01:09

you built this rack

play01:11

application I want you to have an

play01:14

automated um an automated way to test

play01:19

whether GPT 4 is better than an

play01:22

open-source model and do that

play01:25

systematically do that in a way does not

play01:28

involve you trying different things

play01:31

because so far what I've seen is that

play01:34

most people they just do the entire

play01:38

integration they have a couple of pet

play01:40

examples they try those examples and

play01:44

that's it that's the extent of testing

play01:48

this model so hopefully by the end of

play01:50

this video you have a better approach to

play01:54

this by the end of this video you're

play01:56

going to have the tools all of them open

play01:58

source that you can use to actually

play02:01

Implement robust testing for your rack

play02:04

application now before I keep going if

play02:06

you like this type of content uh just

play02:09

give me a like below that's that tells

play02:11

the algorithm that I should keep doing

play02:14

this type of videoos so if you enjoy is

play02:15

free just just just like the video and

play02:19

uh let me show you what I have here all

play02:21

of the code that I'm showing you here

play02:24

it's going to be linked down below so

play02:27

you can you can follow through with the

play02:29

is you can just install it on your

play02:31

computer and you can use it uh this is a

play02:34

notebook I'm going to do everything on a

play02:35

notebook it's a very simple notebook and

play02:37

the first thing that you see here in the

play02:40

first cell is uh just loading the

play02:43

environment variables into the notebook

play02:46

so I I have access to them and I'm just

play02:48

creating this open AI API key and I'm

play02:51

reading it from an environment variable

play02:53

I created this environment variable

play02:55

before off camera so I have it here set

play02:58

that's obviously is your open AI API key

play03:01

uh that comes from mym file that I

play03:05

created and I'm not going to show you

play03:07

because my keys is there obviously but

play03:09

you are going to need to set that

play03:11

environment variable I'm going to be

play03:13

using here gizard which is an open

play03:15

source Library that's going to help me

play03:17

evaluate my rack application and gizar

play03:20

uses that open AI API key and verment

play03:24

variable to do its job okay so make sure

play03:27

you do this particularly for my rack

play03:29

appli ation I'm going to be using GPT uh

play03:32

3.5 because it's cheaper you can

play03:35

actually change this to an open source

play03:37

model if you want to or you can just use

play03:39

GPT 4 it doesn't really matter so that

play03:42

is what this variable is for is for me

play03:44

later on when I create my model I'm

play03:47

going to be using this variable to uh

play03:49

use GPT 3.5 all right so let's start uh

play03:54

in you know what really matters and my

play03:57

rack application is going to answer

play04:01

questions from a website or actually

play04:04

it's going to answer any questions and

play04:05

it's going to answer those questions

play04:07

using the information from a website so

play04:09

I teach this class um it's called

play04:13

building machine learning systems that

play04:14

don't suck and I have a website and

play04:17

there is ton of information on this

play04:18

website okay there are

play04:20

testimonials uh there is just

play04:22

information about the program uh

play04:25

different characteristics like how many

play04:27

hours it takes to finish the program how

play04:29

many assignments there is a bunch of

play04:31

information here um who is this program

play04:34

for the stuff that you will learn uh

play04:37

there is a syllabus uh like you can go

play04:40

here and you can see just again it's

play04:43

just ton of information uh how much the

play04:45

program cost here's the syllabus of the

play04:48

program and you know again it's just a

play04:51

ton of information about the program so

play04:53

what I want to do is build a rack system

play04:56

by uh scraping this website so I'm going

play04:59

to gather all of the information on this

play05:01

website and I'm going to store that

play05:03

information and then I'm going to answer

play05:06

any questions from the user using this

play05:08

content so that's sort of like the setup

play05:10

for for this app so to scrape the

play05:13

website I'm going to be using by the way

play05:15

by I'm going to build my rack

play05:17

application using Lang chain um you

play05:20

don't have to use l chain you can do

play05:23

like llama index if you wanted to it's

play05:25

fine I'm going to use l chain that's the

play05:27

one that I prefer so for l chain in this

play05:31

cell right here this is how easy it is

play05:33

to do it with L chain I can scrape the

play05:36

website so here is the URL of my website

play05:40

is D ml. school and here is what's

play05:44

happening here I'm importing a couple of

play05:46

libraries I'm creating a text splitter

play05:50

and this splitter is just a class that's

play05:53

going to tell Lang chain how I want to

play05:57

split the content that I'm SC scraping

play06:00

off of the website so it's a ton of

play06:03

content so let's say I'm going to scrape

play06:06

I don't know 10 pages of content this

play06:08

splitter is a recursive character text

play06:12

splitter is telling Lang chain that I

play06:15

want chunks that are no longer than a

play06:18

thousand characters and I want an

play06:21

overlap of 20 characters between them so

play06:23

what's going to happen is that the

play06:25

splitter is going to go through all the

play06:27

content it's going to grab the th000

play06:30

characters so those first 1,000

play06:33

characters those are going to become one

play06:36

chunk and then it's going to go to the

play06:38

second 1,000 characters with 20

play06:41

characters overlap so it's going to take

play06:44

the last 20 characters from the first

play06:46

document and it's going to start there

play06:48

and then it's going to grab another

play06:50

thousand characters and it's going to

play06:52

keep doing that on and on on on now why

play06:55

do I need to split all of the content

play06:57

because my rack system

play07:00

um requires sending context to the model

play07:04

so I'm going to be telling the model hey

play07:07

answer this user question using the

play07:11

following context and I want to include

play07:14

some context now I do not want to send

play07:17

the entire website as the context

play07:20

because I'm probably going to be

play07:21

violating the context size right there

play07:24

is a limited number of characters that I

play07:26

can send so by splitting all of my

play07:29

website into smaller chunks now I have a

play07:33

way to only send a few of these chunks

play07:36

at a time to answer any question U so

play07:40

that's pretty important whenever you're

play07:42

using a model that's sort of like U has

play07:44

a constrain on how much context you have

play07:47

to send now I recorded a video it's on

play07:50

my channel that goes into a lot of

play07:53

details about how the context size works

play07:56

and how all of these models treat the

play07:58

context size I'm this splitting and and

play08:00

recursive character text splitter all of

play08:02

that good stuff it's going to be linked

play08:05

somewhere here if not you can find it on

play08:07

my channel uh if you want more

play08:09

information okay so I'm defining my

play08:11

splitter and now I'm going to use a web

play08:14

based loader and a webbased loader it's

play08:16

just a class that behind the scenes uses

play08:19

beautiful soup to go to that URL and

play08:23

scrape all of the content from that URL

play08:25

it's very simple as you can see I'm just

play08:28

uh setting up the loader

play08:30

right here giving it the URL and then

play08:33

I'm going to call the the function load

play08:36

and split and I'm going to pass the text

play08:39

splitter that I just created so the

play08:42

function know or the loader knows

play08:44

exactly how I want to split that uh

play08:46

content and then I'm going to just print

play08:49

out all of the documents that I'm going

play08:50

to get out of that page out of my

play08:53

website and as you can see here all of

play08:56

the documents that I get and when I'm

play08:58

going to go through all the details

play08:59

details but uh you can see building

play09:01

machine Learning Systems the dck that's

play09:03

how the first document starts and if I

play09:06

go all the way to the end I don't know

play09:07

if it's going to print it out here we'll

play09:10

see but if I go all the way to the end

play09:13

it ends on we'll use this time two

play09:17

that's the final uh sentence here let's

play09:20

look at the second document now if I go

play09:23

to the second do document we use this

play09:26

this time to discuss the first

play09:29

principles behind building see how there

play09:31

is an overlap there that's what's

play09:34

happening here that's because of my text

play09:36

splitter is asking to have a 20

play09:38

character overlap so this is awesome

play09:41

this is working now I have a

play09:44

list of I don't know how many let me try

play09:47

Here length

play09:49

documents let's see I have 10 different

play09:52

documents from my website I got 10

play09:55

different documents which is awesome

play09:57

what we need to do right now is load all

play10:00

of those documents into a database and

play10:03

that database is going to help us find

play10:06

the individual documents that are the

play10:08

most relevant to answer any question and

play10:12

I already talked about this again on

play10:13

that video but this database is going to

play10:15

be a vector store and for this

play10:18

particular example I'm just going to be

play10:20

using a vector store that that in memory

play10:23

uh in my other video I also use pine

play10:26

cone which is an actual Vector store but

play10:30

here this is fine just in memory I'm

play10:32

going to be storing all of these

play10:34

documents now there is something very

play10:35

important about a vector store database

play10:38

and it's that when I store the documents

play10:41

I'm also going to be uh generating

play10:44

embeddings for each one of those

play10:46

documents so what is an embedding I'm

play10:48

not going to go too deep into this but

play10:50

an embedding is basically like an

play10:52

identifier for the document is a

play10:55

semantic identifier so it's like a

play10:57

coordinates in space

play10:59

and depending on what the document talks

play11:02

about we're going to generate different

play11:04

coordinates for them so imagine that if

play11:06

we talk about cars the

play11:09

document talks about cars on automobiles

play11:13

well the location is going to be over

play11:17

here but if we talk about boats U maybe

play11:20

the location the embedding is going to

play11:23

point over here so anything related to

play11:26

cars is going to go this way anything

play11:28

related to Bo is going to go this way

play11:31

and the reason this is important is

play11:33

because later on we can find uh if we

play11:36

want uh to answer a question about cars

play11:39

about an Audi or about a Tesla well we

play11:42

can find documents on this section here

play11:45

right on in the location where all of

play11:48

the automobile documents are stored

play11:51

that's what embeddings are going to give

play11:53

us right so in order to create this or

play11:56

load this data into a vector store we

play11:59

need to

play12:00

specify a class that's going to take

play12:03

care of generating those embeddings

play12:05

those locations in space right and in

play12:07

this case I'm using the open AI

play12:09

embeddings class as you can see it here

play12:11

so whenever I create this doc array in

play12:14

memory search again this is just a

play12:16

vector store that's storing everything

play12:18

in memory just to keep it simple here I

play12:21

say hey create this database from a list

play12:25

of documents that I have right so this

play12:27

is my list of documents it's right here

play12:30

and use this open AI embeddings class to

play12:35

generate the embeddings that you need to

play12:37

store those documents that's what's

play12:39

happening on this line so after I run

play12:42

this line I'm going to have my database

play12:45

all of my documents inside and for each

play12:48

one of those documents I'm going to have

play12:50

embeddings generated for them okay so

play12:53

that's great so that means that if I

play12:55

have a query I can find all of the

play12:57

documents that are similar to that query

play13:00

so if I'm asking about BMW how fast can

play13:03

they go the documents that I'm going to

play13:06

return from the database are all related

play13:08

to cars and BMWs and Audi and that type

play13:11

of stuff after doing this I have my

play13:14

Vector store I'm going to create a

play13:17

knowledge base okay and this is key here

play13:21

and we start entering in the area of how

play13:23

do I test my system okay so I have all

play13:26

of these documents I'm going to be

play13:28

building a system I'm going to be

play13:30

building a rack application that is

play13:32

going to answer questions using these

play13:34

documents how do I test that and the

play13:38

steps that we're going to go through

play13:39

here will make sense in a second is

play13:41

we're going to generate automatically a

play13:44

bunch of test cases now you can do that

play13:46

manually but that's a lot of work here's

play13:48

what happens if you're trying to test a

play13:51

classification system something that

play13:54

classifies let's say patients into sick

play13:58

or healthy right a classification system

play14:01

is pretty simple to test because you can

play14:05

you know the ground truth you know

play14:07

whether a patient is sick or not and you

play14:09

look at the response from the system and

play14:12

if this if the response matches the

play14:14

ground truth the system got it correctly

play14:17

if not it's it's wrong that's it it's

play14:21

just comparing the output with the with

play14:24

the correct label the problem of a rack

play14:27

system when you're using a rack system

play14:29

for text

play14:31

generation is that it's it's really

play14:33

subjective it's really hard to compare

play14:36

like if I ask you hey here you have I

play14:40

don't know here you have like a a

play14:43

document and I want my rack system to to

play14:46

summarize that document for example I

play14:48

want a summarization of this page

play14:51

here like how do you know if that

play14:55

summary truly reflects what the page

play14:59

says right it's it's it's less clear how

play15:04

you can test these

play15:06

systems so that's the challenge that we

play15:09

have here so to start with we need to

play15:12

generate a bunch of test cases like how

play15:15

do we test these well let's just

play15:16

generate a number of test cases and

play15:18

before we generate the test cases I'm

play15:19

going to create what we call a knowledge

play15:21

base okay so the knowledge base it's

play15:25

just going to contain all of the

play15:26

documents that we have that's that's the

play15:28

knowledge that we have all of the

play15:30

documents that I just stored in the

play15:31

database now in order for me to create

play15:33

that knowledge base I need a data frame

play15:35

it's a pandas data frame just table

play15:38

structure where I'm going to organize

play15:40

all of those documents um so you can see

play15:43

here I'm just creating that data frame

play15:45

off of the documents that we just loaded

play15:48

into the vector store so nothing fancy

play15:51

here I'm I'm I'm putting them in a

play15:54

column called text because that's the

play15:57

input that I'm going to need to create

play15:59

my knowledge basee I'm printing out the

play16:01

10 documents that I have um those are go

play16:04

from zero to 9 and you can see that's

play16:06

the content the content is is just right

play16:09

there in that column so this is awesome

play16:12

here is where we start right I'm going

play16:14

to be using Gard which is a library

play16:18

that's going to help me evaluate my rack

play16:19

system Gard has a class that's called

play16:22

knowledge base that's going to wrap all

play16:24

of these documents okay and the reason

play16:26

gar needs this this knowledge base it's

play16:30

because Gard is going to help me

play16:32

generate automatic test cases I'm want

play16:34

to see them in just a second all right

play16:36

so I'm going to wrap all of my data

play16:38

frame into this knowledge Base Class

play16:41

okay so after doing this I have my

play16:42

knowledge base and I'm going to use that

play16:44

knowledge base to do everything else

play16:46

from here on out all right so let's

play16:49

generate test cases this is Key by the

play16:52

way if you wanted to create your own

play16:55

test cases you can do that what is a

play16:57

test case well the test case is say it's

play16:59

going to be a question a sample question

play17:02

it's going to it's going to have like

play17:04

what the answer should look like and

play17:07

it's going to contain what is the

play17:09

document where the system should find

play17:12

that answer okay so if I ask you hey

play17:14

what is the price of the class the

play17:16

sample answer should say well the class

play17:18

cost

play17:20

$450 and it should come together with

play17:23

the document let's say document seven

play17:26

where the price uh is specified

play17:29

right we can do that manually that would

play17:31

be a ton of work okay generating sample

play17:34

test cases is a ton of work as you may

play17:37

imagine so what gizar is going to do for

play17:41

us behind the scenes gizar is going to

play17:44

use that open AI environment variable

play17:47

that I told you at the beginning it was

play17:48

important to set and it's going to

play17:50

connect to gp4 and it's going to use gp4

play17:54

with obviously specific prompts that

play17:56

they use to automatic Ally generate test

play18:00

cases for my knowledge base so here is

play18:04

how that looks like in code I'm using

play18:06

the generate test

play18:08

set function from giz card I'm going to

play18:12

pass the knowledge base hey all of the

play18:15

content that I know right now that we

play18:18

have that we're using to power our rack

play18:21

system that's the knowledge base I'm

play18:24

going to specify how many test cases do

play18:27

I want 16 this case that's number of

play18:30

questions I want to automatically

play18:32

generate 60 test cases so just go at it

play18:37

if you want a 100 just set a 100 120

play18:40

doesn't matter you can generate many

play18:42

many test cases the longer your

play18:44

knowledge basis the more content you

play18:46

have obviously the more test cases you

play18:48

can generate right and then I'm going to

play18:51

specify a

play18:53

description for this agent that's going

play18:56

to help in the generation of test cases

play18:59

okay so I'm just going to specify a

play19:01

description for it after I generate my

play19:05

test cases after I run this and this is

play19:07

going to take a minute to finish

play19:08

remember this is connecting to

play19:11

gp4 using your API key you have to be

play19:14

you have to understand that it's going

play19:17

to be using your API key to connect to

play19:19

gp4 to generate all of these test cases

play19:22

okay so after doing this I'm printing

play19:25

out here just so you see them and we're

play19:28

going to open the file now I'm saving

play19:30

also this to a file but just so you see

play19:33

them here in the notebook I'm printing

play19:36

out three of these questions three of

play19:39

these test cases and as you can see I'm

play19:41

printing out the question number with

play19:44

what the question is the first question

play19:45

was what does the machine Learning

play19:47

System course offer that was the first

play19:50

automatically generated question okay

play19:53

then I'm printing out the reference

play19:56

answer what gp4 think a good answer will

play20:00

be so the machine Learning System course

play20:03

offers 18 hours of live interactive

play20:06

session it is a practical Hands-On y y y

play20:10

and I'm also printing out the reference

play20:14

context in other words what is the

play20:17

document or which documents answer this

play20:20

question okay and for this particular

play20:23

one is document zero so the first

play20:25

document should answer this particular

play20:28

question now look at the second question

play20:30

here who is the instructor of the

play20:33

machine learning program that is a test

play20:35

case that gp4 came up with geiz car

play20:39

asked gp4 to come up with these

play20:41

questions and that was one of the test

play20:43

cases that I can use to test my system

play20:45

reference answer the instructor of the

play20:48

program is Santiago that's me reference

play20:50

context what is this answer coming from

play20:53

and it says well there are two documents

play20:56

that can be used to answer this question

play20:58

document five and document nine okay and

play21:03

then it goes on and on and on here in

play21:05

this sale I'm just saving the test set

play21:08

to a Json uh file Json L file so let's

play21:12

open that test set here and you can see

play21:15

all of the questions that were generated

play21:18

automatically by gizar so this is great

play21:20

these are my test cases now I can use

play21:22

this to test my system okay so this is

play21:26

just this is a very valuable step that

play21:30

we don't have to go through manually

play21:32

which takes a ton of time and you think

play21:34

about it let's see uh let me see

play21:37

question look at this hello I'm

play21:40

considering enrolling in the machine

play21:41

learning school program this is

play21:43

simulating a user asking my system a

play21:47

question which is great that's exactly

play21:49

the type of test case that I need so you

play21:53

get the question you get the reference

play21:56

answer right what what the correct

play21:59

answer should

play22:00

be and you get the context at some point

play22:05

there we go you get the context again

play22:07

it's just that the list of documents

play22:10

containing that answer or the list of

play22:13

documents that the system should use to

play22:15

answer that question this is awesome I

play22:17

have 60 test cases now next step is to

play22:22

run those test cases next step is to

play22:24

actually validate my system but I need

play22:26

to build the system first because I

play22:28

don't don't have a system right now

play22:30

let's just prepare the prompt and this

play22:32

is going to be just my simple chain my

play22:35

simple rack system that is going to work

play22:38

like this I'm going to grab a question

play22:40

from the user uh hopefully find the

play22:43

context in my database in my Vector

play22:45

store put them together and ask the

play22:48

model to answer that question and if the

play22:51

model cannot answer that question then

play22:54

we can say I don't know that's what my

play22:56

prompt is this is a very simp prompt to

play22:59

build a rack system by the way if you

play23:02

really want to build a rack system for

play23:03

something serious there are much better

play23:05

prompts that helped the model answer

play23:08

better this is a very very simple one

play23:11

I'm just creating a prompt template this

play23:13

is a class from L chain that's going to

play23:15

allow me to parameterize a prompt so you

play23:19

can see I have two variables here I have

play23:20

the context variable and I have the

play23:23

question variable and whenever I execute

play23:27

this prompt or I use this prompt as part

play23:29

of a bigger rack system I'm going to

play23:31

have to pass those two variables or

play23:34

values for those two variables the

play23:36

context and the question so I'm creating

play23:40

this promt template from the text that I

play23:43

just put in here and I'm printing out

play23:46

what the template will look like after I

play23:48

formatt it with the two variables you

play23:50

can see I'm passing the variable context

play23:53

here is some context and I'm passing a

play23:55

variable

play23:56

question here is a question okay so this

play23:59

is what I get answer the question based

play24:00

on the context below if you can answer

play24:02

the question reply I don't know context

play24:04

here is some context question here's a

play24:07

question okay so that works that's fine

play24:09

that's cool let's now create the rack

play24:11

chain uh of course I'm not spending a

play24:14

ton of time there is a ton of um ideas

play24:17

and steps that we have to go through in

play24:20

order to come up with this rack system

play24:22

I'm not going to go through all of them

play24:24

right now because I'm assuming you only

play24:26

care about evaluating these rack systems

play24:28

but again the video that I linked before

play24:30

in my channel goes through all of that

play24:34

uh all of those ideas in order for you

play24:36

to get here so let me try to explain

play24:39

what's happening here in this rack chain

play24:43

um first of all I'm going to be creating

play24:45

the model and like I told you before I'm

play24:48

using GPT 3.5 model that's the model

play24:51

that's going to be answering the

play24:53

questions from my knowledge base

play24:57

Okay so just initializing my model here

play25:00

with the chat open AI class from Lan

play25:03

chain I'm passing the API

play25:06

key and I'm passing the name of the

play25:08

model which is GPT 3.5 turbo I could be

play25:13

using GPT 4 here as well if I wanted to

play25:16

that will actually be very interesting

play25:18

test because right now what's going to

play25:20

happen at the end of this is that I'm

play25:22

going to have GPT

play25:25

3.5 answering questions and those

play25:28

questions will be evaluated by gp4

play25:30

because gp4 was the one generating the

play25:32

test cases in the first place that's

play25:34

just just the way it it happens here all

play25:37

right so I'm going to create my chain

play25:39

okay and a chain is what the name says

play25:42

it's just like a string of components

play25:47

where the input or the output of one

play25:50

component will become the input of the

play25:53

next component in that chain so that's

play25:56

how you build here in L chain and that's

play25:59

one of the reasons I like it a lot I'm

play26:01

going to start with the first component

play26:03

here in my chain and is this map that

play26:07

you see or dictionary that you see and

play26:09

notice that there are two keys on this

play26:11

dictionary the first key is context and

play26:14

the second key is question and the

play26:17

reason I have this map here is because

play26:19

the second component is the prompt that

play26:21

we created let me scroll up to that

play26:24

prompt remember that prompt requires

play26:28

two variables so the input to that

play26:32

prompt is two variables context and

play26:37

question or is it's a map with two

play26:39

variables inside because of that the

play26:42

first component of this chain is a map

play26:46

that again is going to get fed into the

play26:49

prompt which is the second component now

play26:51

let's see where these values are coming

play26:54

from the first value is the context

play26:57

where is the cont context coming from

play26:59

well obviously it has to come from my

play27:01

Vector store my Vector store contains

play27:05

all of the documents they're stored

play27:07

right there and some of those documents

play27:10

are going to be the context that I need

play27:12

to send to the model to answer a

play27:15

particular question how do we know which

play27:18

documents well we need to pass to the

play27:20

vector store we need to pass a question

play27:23

and tell the vector store give me any

play27:27

questions that are simil ilar or give me

play27:29

any documents that are similar to this

play27:31

question remember how embeddings work if

play27:34

I tell the vector store give me what the

play27:37

price of the course is the vector store

play27:40

should look through all of those

play27:42

embeddings in space and return any

play27:45

embeddings that are around the location

play27:48

that talks about prices and costs right

play27:52

so if that such location exists any

play27:55

documents that are very similar to that

play27:58

Center Point are going to get uh

play28:00

returned back to me and hopefully the

play28:02

those documents will answer the question

play28:05

that I asked which is how much does it

play28:07

cost the way I I I do that or or or I

play28:11

sort of like accomplish that here in

play28:13

code is by taking the vector store that

play28:16

we created and generating a retriever

play28:19

from that Vector store that retriever uh

play28:22

let's do this let's do this so so maybe

play28:24

maybe this is going to make it a little

play28:26

bit clearer okay so I'm going to create

play28:28

a retriever and I'm going to say hey

play28:30

just the vector store just uh give me a

play28:34

retriever okay and let's see what we can

play28:36

do with that retriever okay so if if you

play28:39

do you probably know this but if you use

play28:41

the function there this is going to

play28:44

return all of the functionality of that

play28:47

retriever okay so look at this what do

play28:49

we get here these are all of the

play28:51

functions that we can call from that

play28:53

retriever here that's a bunch of stuff

play28:55

so let's see the gets um get prompts get

play28:59

relevant documents okay so that sounds

play29:01

like a that sounds cool uh there is

play29:04

invoke as well okay so let's do the get

play29:07

relevant documents let's try this out

play29:10

Let's do let's commment this out here

play29:13

and let's do

play29:16

Retriever get relevant

play29:18

documents uh look at this so what is the

play29:21

machine learning school okay top K1 I

play29:25

don't I'm not going to pass that let's

play29:26

see what happens when I do this

play29:28

did that even work let's go up oh this

play29:32

is awesome okay so when I called get

play29:35

relevant documents on a

play29:38

retriever and I pass a string what's

play29:41

going to happen is exactly what you're

play29:42

imagining right now the retriever will

play29:45

return the top four documents in this

play29:48

case the top four documents that are

play29:51

related to that question the top four of

play29:54

them are going to come back and that is

play29:56

exactly what we need to ACC accomplish

play29:58

here as part of the Lang chain chain in

play30:01

this case we are using the as retriever

play30:04

here but we could be using just a

play30:06

retriever it doesn't matter just the

play30:07

retriever variable that we use here and

play30:10

we are passing the question and this

play30:13

item getter I'm going to let uh you

play30:15

figure that out but the item getter is

play30:18

just a function from the operator

play30:20

package and the item getter is just

play30:22

basically going to grab the question out

play30:25

of the function that you apply this two

play30:27

so in other words or in English uh

play30:30

what's going to happen is that I'm going

play30:31

to when I invoke that chain I'm going to

play30:34

be invoking that chain you can see it

play30:37

here I'm going to be invoking that chain

play30:40

with a variable called question right or

play30:43

with an attribute it's GNA I'm going to

play30:45

pass a dictionary with an attribute

play30:47

inside that's called question this item

play30:50

getter is going to grab the value of

play30:52

that question and it's going to pass the

play30:54

value of that question to the vector

play30:57

store retriever that we created right

play31:00

here okay just to make it clear let's

play31:03

just

play31:04

do retriever if I can spell retriever

play31:08

here okay so it's going to pass that

play31:10

question to the Retriever and we already

play31:12

know that what's what this is going to

play31:14

do is return the relevant documents that

play31:18

is what's going to happen so now the

play31:21

context the context here will have a m

play31:26

or a list of relev documents so this

play31:29

same list that you see here that is the

play31:32

list that we're are going to be passing

play31:34

to that context variable okay the second

play31:36

one is pretty straightforward I'm saying

play31:39

I also need to an attribute called

play31:42

question let's just put the same value

play31:46

that we invoked this chain with okay so

play31:48

the same value of question here is going

play31:50

to just go here and that is my first

play31:53

component of the chain unfortunately the

play31:55

most complicated one to understand

play31:57

because everything else is going to be

play31:59

pretty simple the next component of the

play32:01

chain is a prompt which is just the

play32:03

prompt that we defined before we're

play32:05

going to be injecting the context and

play32:07

the question to that prompt the output

play32:10

of that prompt which is a well-formatted

play32:12

prompt is going to go into our model so

play32:14

now we're going to be invoking our model

play32:16

our GPT 3.5 it always takes me a second

play32:20

to say GPT 3.5 always takes me a second

play32:23

to think about that so we're going to

play32:25

take that prompt invoke the model with

play32:28

prompt the model is going to return an

play32:30

answer back now in this particular case

play32:33

I'm not going to get into too many

play32:34

details but in this particular case that

play32:37

model actually we can we can uh we can

play32:40

look at here in action I'm going to add

play32:42

another line I'm going to say model.

play32:44

invoke let's just invoke the model uh

play32:47

with tell

play32:48

me tell me a joke okay I'm going to

play32:51

invoke that model with oh actually I

play32:54

cannot do that

play32:56

here I'm going to do it

play32:59

here oh of course not because I have not

play33:02

executed this okay so that's that's

play33:04

awesome let let me Jo ex let me just

play33:07

execute this and let me say model tell

play33:09

me a joke and in this case notice that

play33:13

yeah I'm getting a joke back from GPT

play33:16

3.5 this is my joke it's a bad joke

play33:19

obviously notice that this is not like

play33:22

clean string text it comes like wrapped

play33:25

into an AI message and the reason is

play33:28

because this is a shat model so it's

play33:29

supposed to have system message and

play33:31

human message and in this case this is

play33:33

an AI message so it's a message coming

play33:35

from AI I don't want that I want clean

play33:38

strings clean strings so I'm going to be

play33:41

passing a parser which is just a string

play33:44

output parser which is going to make

play33:46

this go away so the output of a chain is

play33:51

actually going to be a string all right

play33:52

so let me remove this that explains why

play33:56

you see prompt then Model D string

play33:58

output parser just to clean that class

play34:00

out and get clean beautiful

play34:03

strings and then here is just the test

play34:06

of the tests where I invoke my chain

play34:08

just to make sure it works I'm saying

play34:11

okay invoke my

play34:12

chain and pass a question I need to pass

play34:16

a question what is the machine learning

play34:18

school and look at the answer it's

play34:19

beautiful the answer is just a string

play34:22

just to make sure I did not lie I need

play34:25

you to trust me when I invoke these

play34:27

chain without the parser look what's

play34:29

going to happen see AI message horrible

play34:34

we don't want that so let me reexecute

play34:38

this beautiful just String Clean that is

play34:41

what we need all right we have our

play34:44

knowledge base we created test cases we

play34:48

have a chain we have a rack system we

play34:51

need to test that rack system how good

play34:52

is the rack system that is what's going

play34:55

to happen right now to do this

play34:57

evaluation

play34:58

we're still going to use gizard because

play35:00

gizard is going to take care of running

play35:03

every single test case through my chain

play35:06

take an answer evaluate that answer is

play35:09

it a good answer or not that is what the

play35:12

tricky part is remember this is not a

play35:14

classification model you cannot just

play35:16

compare strings and say yeah this string

play35:19

is exactly like this string you have to

play35:22

use a model to look at two answers and

play35:26

say yeah I think they're hitting the

play35:28

same points I think both of those

play35:31

answers are answering the same question

play35:34

that is what gizar is going to do behind

play35:36

the scenes so how do I use this well

play35:40

they have a function that's called

play35:41

evaluate very simple okay so that

play35:45

function requires a test set which we

play35:49

already have we created it they require

play35:52

the knowledge base which is the original

play35:54

data where is the data coming from and

play35:56

it requires a function that is going to

play35:59

call the model okay so in this case I'm

play36:01

calling it answer function you can call

play36:04

it however you want this function is

play36:06

very simple it's going to receive a

play36:09

question and an optional history if you

play36:11

want to enable history for your chat

play36:14

application in this case I'm not

play36:16

enabling it just so I keep this simple

play36:18

and then I'm going internally within

play36:20

that question the goal of that question

play36:22

is to answer that question or the goal

play36:24

of that function sorry is just to answer

play36:26

that question what I'm doing here is

play36:28

just invoking the chain so I'm going to

play36:30

be invoking the chain passing that

play36:33

question and what's going to happen is

play36:34

that within this evaluate function gar

play36:38

is going to repeatedly call my function

play36:41

passing the different questions that it

play36:43

needs to evaluate okay so that's pretty

play36:47

cool you call this evaluate function and

play36:50

it's going to give me back a report and

play36:53

when you run this it's going to take a

play36:54

second to run and remember this is going

play36:57

to be using gp4 behind the scenes so I

play36:59

imagine without looking at the source

play37:01

code I imagine that what's happening is

play37:04

is going through all of the test cases

play37:06

grabbing the first test case sending it

play37:08

sending that question to my Shain my

play37:11

rack system grabbing the answer and then

play37:14

using gp4 to compare the answer from my

play37:18

chain with the reference answer that we

play37:21

generated before it's going to compare

play37:23

those two and if they are similar if

play37:25

they look correct it's going to give me

play37:28

a point and if they don't I don't get

play37:30

any points and at the end we can

play37:33

determine how accurate my system is how

play37:35

many questions did I get correctly it

play37:38

should do that behind the scenes so what

play37:41

is in that report we can just display

play37:44

the report if you're working on a

play37:46

notebook uh you can just display the

play37:48

report and see what it looks like if not

play37:51

you can also open a web page so I'm

play37:52

displaying the report here but I'm going

play37:54

to go to the web page because it looks a

play37:56

little bit better uh I opened the report

play37:58

after running it later and here's what

play38:01

you get first there is a umap

play38:03

representation of my knowledge based and

play38:06

that's again the more questions I have

play38:08

the the larger my knowledge Bas is the

play38:10

more interesting stuff you're going to

play38:13

get here here you get where the false uh

play38:16

answers are or the incorrect answers are

play38:20

where they are located if my knowledge

play38:22

base was was bigger well obviously this

play38:24

will tell me this will give me more

play38:26

information about what areas of my

play38:28

knowledge base are not well covered or

play38:31

I'm having problems remember I only have

play38:32

10 documents here okay so that's why you

play38:34

have so few points uh you get a

play38:38

component analysis here we're going to

play38:41

see that in a second we're going to talk

play38:42

about that in a second but this is

play38:44

giving me a score for every single

play38:47

component of my Rx system we're going to

play38:49

talk about through all of them in a

play38:51

second there are some recommendations

play38:54

some correctness by topic I have only

play38:56

one topic in my website this can get

play38:58

really really complex with larger

play39:01

knowledge base and the overall

play39:03

correctness score is

play39:05

73.33% okay that's my overall how good

play39:09

my system was right now okay so let's

play39:12

talk about this component analysis let's

play39:14

go back here so I can show you uh I have

play39:18

here sort of like a the score individual

play39:21

score for each one of the components of

play39:23

a rack system so the first one is the

play39:25

generator and if you scroll like if you

play39:28

uh put your mouse on top of it you can

play39:31

see what is that component about so in

play39:33

this case this is the large language

play39:35

model that we used to uh in the chain to

play39:39

generate the answers so the way gizar is

play39:42

evaluating my system is it depending on

play39:45

what the test case looks like is trying

play39:48

to evaluate all of these components

play39:51

separately now in my simple chain I

play39:53

don't have a uh I don't have a rewriter

play39:55

and I don't have a routing the rewriter

play39:58

will be a component that you had in your

play39:59

chain to rewrite the question like when

play40:03

they when the user asks something that

play40:05

doesn't look like correct you could have

play40:07

a component that rewrites that question

play40:10

in a way that's simp it's easier to

play40:12

answer that question it becomes more

play40:14

relevant I don't have a rewriter here in

play40:17

this case so obviously I'm not doing

play40:19

that great in those type of questions in

play40:22

questions that should be rewritting I'm

play40:23

not doing too hot here uh the retriever

play40:27

is just just getting the most relevant

play40:28

questions from my map uh so I should

play40:31

work a little bit better on how on those

play40:33

embeddings on how that the similarity U

play40:37

gets computed and how I get the relevant

play40:39

documents so this breakdown is great to

play40:43

tell you exactly what you should be

play40:45

focusing on so let's go down a little

play40:47

bit here's my recommendation I'm saving

play40:50

or you can save that report to HTML

play40:52

that's the HTML document that I show you

play40:55

I you can also just sort of like print

play40:57

the correctness or compute the

play40:58

correctness based on question type okay

play41:02

that's what you get here you see that

play41:04

complex questions uh 90% correct

play41:07

conversational questions 50% correct

play41:10

this makes sense I did not include a

play41:12

history in my chat remember that this

play41:15

supports when I'm using a shatow open AI

play41:18

model uh it supports a conversation so

play41:20

it supports keeping context I did not

play41:22

use that so I'm sure that by using that

play41:26

I can improve the cont conversational

play41:28

aspect of my rack system which I did not

play41:31

Implement distracting elements only 50%

play41:35

so questions that were generated with

play41:36

distracting elements here they did not

play41:39

score well so I'm going to have to do

play41:41

better there double questions simple

play41:43

questions situational questions what

play41:45

100% so this is gold because this tell

play41:49

me how my system is doing and where

play41:51

should I focus on to fix my system by

play41:54

the way there are no topics here but

play41:56

gizar has the ability ility to

play41:57

automatically generate topics based on

play42:00

your documents so if I had like a bigger

play42:03

document or bigger knowledge base this

play42:06

card will it's able to just generate

play42:08

different topics recognize and generate

play42:10

different topics and then give you

play42:12

scores on those topics so you know okay

play42:15

so anything related to price the llm is

play42:17

doing great anything related to this

play42:20

other topic is not doing great okay so I

play42:23

can also get the failures so if you want

play42:25

to know exactly what questions did the

play42:28

system fail so I can get the list of

play42:31

failures I can safeties I can do

play42:32

whatever I want so let me see it's

play42:34

simple well you cannot read the whole

play42:36

question here because it's is what does

play42:38

the machine Learning System course blah

play42:39

blah blah reference answer conversation

play42:42

look at this look at this conversation

play42:44

history see how there is conversation

play42:46

for some of the questions I'm not

play42:48

supporting that right now so I'm not

play42:50

surprised the system is not doing great

play42:53

on those okay so all of that is awesome

play42:56

if you stop the video right now you

play42:58

already have a ton of value here just by

play43:00

doing this you have a ton of value but

play43:01

there is more there is more okay this is

play43:04

great to run an evaluation of your

play43:08

system one time just do it one time how

play43:12

is it doing great I want to actually

play43:14

automate this I want to do this every

play43:17

time I push a change or every time I'm

play43:20

ready to make a

play43:22

deployment uh I want to just run a test

play43:25

Su with all of my t cases but way those

play43:27

test cases that were autogenerated you

play43:29

can add your own as well right you can

play43:30

add your own test cases you can fix them

play43:32

you can do whatever you want with them

play43:34

but the key here is I want to automate

play43:37

my tests so how do we do that well let's

play43:40

let's take a look so here it's I'm just

play43:42

loading the test set from the Json L

play43:45

file just loading them in memory very

play43:48

simple and I can just create a test

play43:51

Suite it's just take the test sets and

play43:54

generate a test Suite that's the name of

play43:55

the test Suite that will be reference

play43:57

later whenever you run multiple test

play43:59

Suites you know exactly which one it is

play44:02

one line creates a test suite for me and

play44:05

then I can run that test Suite so how do

play44:06

I do that in order to run this test

play44:08

Suite I'm going to wrap my chain in a

play44:12

gizar model okay so this is a class

play44:14

that's going to provide all of the

play44:16

information to gizar that it needs to

play44:18

run my tests so look at this it gizar

play44:21

model requires a prediction function or

play44:25

the model that's going to be answering

play44:27

test Suite or it's going to be solving

play44:28

the test suite and we're going to see

play44:30

that in a second what the type of model

play44:33

is so in this case is text generation

play44:36

you can do classification you can use

play44:38

gizard to just do classification or that

play44:40

type of stuff what is the name what is

play44:43

the description always specify these two

play44:45

parameters they help their model make

play44:48

decisions and what is the feature name

play44:51

that I care about in this case is going

play44:53

to be the question okay so that's the

play44:54

feature that we care about that's the

play44:56

question that we need need to answer now

play44:58

look at this prediction function I call

play45:00

it batch prediction function this is

play45:03

very similar to the answer function we

play45:06

created before to run that report that

play45:09

evaluation report but in this case I'm

play45:11

just answering question in batches so

play45:14

when running the test Suite G car is not

play45:17

going to go question by question it will

play45:19

take a long time so it's actually doing

play45:21

this in batches and what's cool about L

play45:24

chain is that I can run I can invoke

play45:27

book a chain with a batch of inputs and

play45:30

that's exactly what's happening here

play45:32

this is my chain and now I'm passing a

play45:35

batch so it's an array of questions same

play45:38

thing as invogue before but now it's in

play45:40

batches so we can send multiple

play45:43

questions to the model at the same time

play45:45

we don't have to wait for one answer

play45:47

before sending the next question that

play45:49

makes this really really fast so very

play45:52

similar as before I receive a data frame

play45:55

I go through that data frame frame all

play45:57

of the values of questions and I'm pass

play46:00

them as an array of a map with one

play46:03

attribute that's called question here

play46:06

when I run this model I can get now the

play46:08

test suite and I can say run pass the

play46:11

gizard model and my test Suite is going

play46:14

to run and look at this it says that it

play46:16

succeeded with 62% that's the metric

play46:20

that I'm getting back when I run this

play46:22

test suite and that is awesome because

play46:25

now I can automate the process of

play46:27

running this test Suite by the way I

play46:29

obviously I can just get from the

play46:31

results I can just get you know what the

play46:34

metric was was the result was I can get

play46:37

that information here to automate

play46:40

something like before deploying the

play46:41

model make sure the test Suite passed if

play46:45

it didn't pass then don't deploy the

play46:47

model that will be the way to automate

play46:49

this this one more thing here uh notice

play46:52

here I'm displaying the results of the

play46:54

test you can see the test with pass

play46:57

the metric was 61 667 which is a past

play47:01

that's the name of the test Suite all of

play47:03

that good stuff the final thing that I

play47:05

have to show you is how do we integrate

play47:07

this with pie test why pie test because

play47:10

if you're not using pie test you not

play47:12

doing it correctly okay so py test in my

play47:14

opinion is the best unit test library

play47:18

that there is for python so of course I

play47:20

jumped all over this when I saw that you

play47:23

could actually integrate this with py

play47:24

test in my example here here I'm using

play47:28

something that I don't see many people

play47:30

using because guess what people who use

play47:33

notebooks they're not thinking about

play47:35

testing their code they should but

play47:37

they're not I'm using here the IPI test

play47:42

Library which is a library that's going

play47:44

to allow me to run pie test tests

play47:48

directly from my notebook and it's great

play47:50

so I'm going to install P IPI test IPI

play47:53

test not P test but IPI test and then I

play47:57

can use a cell magic as you can see here

play48:00

percentage percentage IPI test and this

play48:03

cell became runnable like if I run this

play48:07

test this cell it's going to run all of

play48:10

the test cases inside just like if I

play48:12

were running this with pi test which is

play48:15

awesome so look at this code here it's

play48:17

very simple I have only one test so I

play48:20

have a fix here that Returns the data

play48:22

set and the data set is just me loading

play48:25

the test set

play48:27

from the hard drive I'm loading my test

play48:29

sets my 60 test sets and I'm returning a

play48:32

data set I'm turning that test set into

play48:34

a data set and I'm returning that then I

play48:38

have a model fixture that is going to

play48:41

return the gizard model that I had that

play48:43

I created before I can also create it

play48:45

here inside but just decided to just

play48:47

reference the one that's that's outside

play48:49

and then I have a single test case okay

play48:52

these single test case that receive both

play48:55

fixtures the data set and the model and

play48:58

it's going to use a function that's

play48:59

called test llm correctness there are a

play49:02

bunch of functions inside this card that

play49:04

you can use to test different aspects of

play49:06

a system in this particular case I just

play49:09

care about is the llm correct and I

play49:12

passed the model and I pass the data set

play49:15

and very very important I pass a

play49:17

threshold that threshold indicates how

play49:21

high should I need my results in order

play49:24

to declare that this was successful okay

play49:26

okay and in this particular case the

play49:29

tests pass because my threshold is under

play49:34

62% so 62% that's the metric that I'm

play49:37

getting I'm not going to run it here on

play49:38

screen because it takes a little bit of

play49:40

time to answer all of those 60 questions

play49:42

but just trust me when I run this is

play49:45

going to succeed now if I set that

play49:48

threshold to 70

play49:50

or80 now these tests are going to fail

play49:54

with this you can see how you can

play49:56

integrate with your system if you're

play49:57

using pie test now you can integrate

play50:00

unit tests for your llm application you

play50:04

can evaluate your llm application

play50:07

automatically not just by calling John

play50:10

Doe or Mary Black and telling them can

play50:12

you try a few questions and see if it

play50:14

works which is what I've been seeing

play50:16

that is bad with this you can actually

play50:20

do it automatically so hopefully this

play50:23

makes sense hopefully this helps you if

play50:25

you got all the way to the end just

play50:28

please like this video it helps me

play50:31

understand whether this type of content

play50:34

is useful for you and I will see you in

play50:37

the next one

play50:40

bye-bye

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
AI TestingSystem EvaluationQuestion AnsweringModel ComparisonContinuous TestingOpen Source ToolsGPT ModelsRack SystemLang ChainVector StoreEmbeddings