How to evaluate an LLM-powered RAG application automatically.
Summary
TLDRThe video script outlines a method for testing and evaluating a rack application, specifically one that utilizes a large language model (LLM) like GPT-3.5 or GPT-4. It emphasizes the importance of robust testing to ensure the reliability of the system's outputs. The speaker introduces a process involving the creation of a knowledge base from a set of documents, the generation of test cases using GPT-4, and the use of an open-source library called Gizard for evaluation. The script details the technical steps, including setting up a vector store database, using Lang chain for the rack system, and automating the testing process with Gizard and pytest. The goal is to provide a systematic way to assess and improve the application's performance.
Takeaways
- π The importance of testing a rack application is emphasized, highlighting the need for a systematic approach to evaluate and ensure the quality of results from a language model.
- π The speaker introduces a method for creating test cases and evaluating different models, such as GPT-4 and open-source alternatives, in a structured and automated manner.
- π» Open-source tools are recommended for implementing robust testing in rack applications, with the speaker sharing their code and providing links for viewers to access and use these tools.
- π The process of scraping a website to gather information for a rack system is demonstrated, using tools like Lang chain and Beautiful Soup for Python.
- π A detailed example is given using the speaker's own website, which contains a wealth of information about a machine learning systems course, to illustrate how a rack system can extract and utilize data.
- π The concept of embeddings and vector stores is explained, showing how they can be used to semantically identify and retrieve relevant documents for answering user queries.
- π§ The use of GPT-4 for automatically generating test cases is highlighted, showcasing the ability to create relevant questions and context for evaluating a rack system's performance.
- π§ The speaker outlines the construction of a simple rack system using Lang chain, explaining each component's role in retrieving and formatting answers to user questions.
- π Gizard is introduced as a tool for evaluating the rack system, providing a report that includes a map representation of the knowledge base, component analysis, and overall correctness score.
- π οΈ Recommendations for improving the system are provided based on the evaluation, with insights into which areas need refinement and how to focus efforts for enhancement.
- π The automation of testing is discussed, with the speaker demonstrating how to create and run a test suite, and suggesting the integration of this process with Py Test for continuous evaluation.
Q & A
How can one evaluate a rack application effectively?
-An effective evaluation of a rack application involves creating automated test cases, using open-source tools for robust testing, and systematically comparing different models to ensure the results are accurate and reliable.
What is the main challenge in testing a text generation system like a rack application?
-The main challenge in testing a text generation system is the subjectivity of the output, as it's difficult to compare the generated text against a fixed ground truth, unlike in classification tasks where the correct label is clear.
How does the speaker propose to automate the generation of test cases for a rack system?
-The speaker proposes using an open AI API key to connect to GPT-4, which automatically generates test cases for the knowledge base by prompting the model with specific inputs and configurations.
What is a vector store database and why is it used in the context of a rack system?
-A vector store database is a system that stores semantic identifiers, or embeddings, for documents. It is used in a rack system to efficiently find relevant documents based on their content, which helps in answering user queries accurately.
What is the role of the 'retriever' in the rack system?
-The 'retriever' in the rack system is responsible for finding the most relevant documents from the vector store database based on the user's question. It uses the question to identify and return documents that are semantically similar.
How does the speaker plan to improve the conversational aspect of the rack system?
-The speaker plans to improve the conversational aspect of the rack system by implementing a history feature, which supports keeping context during the conversation, allowing for more accurate and relevant responses to follow-up questions.
What is the purpose of the 'knowledge base' in the rack system?
-The 'knowledge base' in the rack system serves as a collection of all the documents that the system has access to. It is used to generate test cases and to provide the context needed for the model to answer questions accurately.
How does the speaker integrate the testing process with 'pytest'?
-The speaker integrates the testing process with 'pytest' by using the 'iptest' library, which allows running 'pytest' tests directly from a notebook. This enables the automation of the testing process and ensures that the system is thoroughly evaluated before deployment.
What is the significance of the 'GPT-3.5' model in the evaluation process?
-The 'GPT-3.5' model is used as the main text generation model in the rack system. It is also the model that generates the test cases and evaluates the answers produced by the system, ensuring consistency and reliability in the evaluation process.
What is the overall correctness score of the speaker's rack system?
-The overall correctness score of the speaker's rack system is 73.33%, which is derived from the evaluation process using the automatically generated test cases and the 'GPT-3.5' model.
Outlines
π€ Introduction to Testing a Rack Application
The paragraph discusses the challenge of testing a rack application, specifically focusing on the lack of knowledge on how to structure a system to evaluate and ensure the results from an LLM are accurate. The speaker aims to address this issue by presenting a simple rack system code and a method to evaluate and test the system continuously. The goal is to establish an automated way to compare different models, such as GPT-4 and open-source alternatives, in a systematic and non-manual approach.
π Scraping a Website for Question Answering
This section details the process of using the Lang chain library to scrape a website for information that will be used to answer user questions. The speaker explains the use of a text splitter and a web-based loader to gather content, emphasizing the importance of splitting content into manageable chunks due to context size limitations in models like GPT-3.5. The process of creating a vector store database and generating embeddings for semantic identification of documents is also discussed.
π§ Generating Test Cases for the Rack System
The paragraph describes the complexity of testing a rack system for text generation tasks due to the subjective nature of the output. The speaker introduces the concept of automatically generating test cases using the GPT-4 model and the Gizard library. The process involves creating a knowledge base from the scraped documents and using it to generate a set of test cases with corresponding questions, reference answers, and context documents. This automated approach saves significant time and effort compared to manual test case creation.
π οΈ Building the Rack System and Prompt Template
The speaker outlines the process of building a simple rack system that utilizes the scraped and embedded documents to answer user questions. A prompt template is created to structure the input for the GPT-3.5 model, including variables for context and question. The speaker also discusses the creation of a chain in the Lang chain library, which involves components like a map, prompt, and model invocation. The focus is on preparing the system for validation through testing with the generated test cases.
π§ Integrating and Testing the Rack System
In this part, the speaker explains how to integrate the components of the rack system, including the vector store retriever and the GPT-3.5 model, to answer questions. The process involves passing the question and context to the model through the chain components. The speaker also discusses the use of a parser to clean the model's output and the importance of the item getter function. The paragraph concludes with a test of the chain to ensure it works correctly and returns clean, formatted strings.
π Evaluating the Rack System's Performance
The speaker presents the evaluation process of the rack system using the Gizard library. The evaluation involves running the test cases through the system and comparing the generated answers with the reference answers. The results are analyzed in terms of correctness, component performance, and recommendations for improvement. The speaker highlights the ability of Gizard to evaluate the system's performance even when using GPT-4 for the test case generation, emphasizing the importance of this step in refining the system.
π Automating Tests and Integrating with pytest
The final paragraph discusses the automation of the testing process and integration with pytest, a popular Python testing library. The speaker demonstrates how to create a test suite with Gizard and run it using the Lang chain. The automation allows for repeated testing without manual intervention, which is crucial for continuous improvement and deployment readiness. The speaker also shows how to integrate these tests with pytest, enabling the running of tests directly from a notebook and ensuring the model passes the tests before deployment.
Mindmap
Keywords
π‘Rack Application
π‘LLM (Large Language Model)
π‘Test Cases
π‘GPT-4
π‘Open-source Model
π‘Continuous Testing
π‘Gizard
π‘Vector Store
π‘Knowledge Base
π‘Embeddings
Highlights
The speaker discusses the challenges of testing a large language model (LLM) based system, emphasizing the need for robust testing methodologies.
A simple rack system code is presented to demonstrate potential testing approaches for LLM applications.
The importance of creating test cases that can continuously evaluate the system is stressed, to ensure the reliability of the LLM's outputs.
The speaker introduces the concept of using an automated approach to compare different models, such as GPT-4 and open-source models, in a systematic manner.
The use of open-source tools and libraries, like Gizard and Lang chain, is advocated for implementing robust testing of rack applications.
The process of scraping a website to gather information for the LLM to use is explained, along with the necessity of splitting content for effective context management.
The significance of using a vector store database to generate embeddings for semantic identification of documents is highlighted.
The concept of automatically generating test cases using GPT-4 is introduced, showcasing a potential method for evaluating text generation systems.
The speaker presents a method for building a knowledge base from the scraped documents, which will be used to test the system's ability to answer questions.
An overview of how to structure a prompt for the LLM to answer questions using the knowledge base is provided.
The process of creating a chain in Lang chain to integrate the prompt, retriever, and model for answering questions is detailed.
The evaluation of the system is performed using Gizard, which measures the accuracy of the LLM's responses compared to reference answers.
Component analysis is used to identify strengths and weaknesses in the system, providing targeted areas for improvement.
The concept of creating a test suite for automated testing is introduced, allowing for repeated evaluation of the system with different test cases.
The integration of the testing process with the pytest library is discussed, enabling direct testing from a notebook environment.
The speaker emphasizes the importance of automating the testing process to ensure system reliability before deployment.
The video concludes with a call to action for viewers to engage with the content and provide feedback on the presented testing methodologies.
Transcripts
how can you test a rack application this
is a question that
unfortunately not a lot of people are
trying to answer right now they built
this huge system that's supposed to
trust the results of an llm and they
have no clue they have no idea how they
should structure that system so they can
actually test they can actually evaluate
the system to ensure the results are
good results so that's the question that
I want to answer today I'm going to show
you the code of a simple rack system and
I'm going to show you one way you can
think and you can Implement to evaluate
that system one way you can incorporate
or create test cases that you can use to
test your system Contin continuously
even better I'm going to show you a way
that you can use to evaluate different
models doing the same work so imagine
you built this rack
application I want you to have an
automated um an automated way to test
whether GPT 4 is better than an
open-source model and do that
systematically do that in a way does not
involve you trying different things
because so far what I've seen is that
most people they just do the entire
integration they have a couple of pet
examples they try those examples and
that's it that's the extent of testing
this model so hopefully by the end of
this video you have a better approach to
this by the end of this video you're
going to have the tools all of them open
source that you can use to actually
Implement robust testing for your rack
application now before I keep going if
you like this type of content uh just
give me a like below that's that tells
the algorithm that I should keep doing
this type of videoos so if you enjoy is
free just just just like the video and
uh let me show you what I have here all
of the code that I'm showing you here
it's going to be linked down below so
you can you can follow through with the
is you can just install it on your
computer and you can use it uh this is a
notebook I'm going to do everything on a
notebook it's a very simple notebook and
the first thing that you see here in the
first cell is uh just loading the
environment variables into the notebook
so I I have access to them and I'm just
creating this open AI API key and I'm
reading it from an environment variable
I created this environment variable
before off camera so I have it here set
that's obviously is your open AI API key
uh that comes from mym file that I
created and I'm not going to show you
because my keys is there obviously but
you are going to need to set that
environment variable I'm going to be
using here gizard which is an open
source Library that's going to help me
evaluate my rack application and gizar
uses that open AI API key and verment
variable to do its job okay so make sure
you do this particularly for my rack
appli ation I'm going to be using GPT uh
3.5 because it's cheaper you can
actually change this to an open source
model if you want to or you can just use
GPT 4 it doesn't really matter so that
is what this variable is for is for me
later on when I create my model I'm
going to be using this variable to uh
use GPT 3.5 all right so let's start uh
in you know what really matters and my
rack application is going to answer
questions from a website or actually
it's going to answer any questions and
it's going to answer those questions
using the information from a website so
I teach this class um it's called
building machine learning systems that
don't suck and I have a website and
there is ton of information on this
website okay there are
testimonials uh there is just
information about the program uh
different characteristics like how many
hours it takes to finish the program how
many assignments there is a bunch of
information here um who is this program
for the stuff that you will learn uh
there is a syllabus uh like you can go
here and you can see just again it's
just ton of information uh how much the
program cost here's the syllabus of the
program and you know again it's just a
ton of information about the program so
what I want to do is build a rack system
by uh scraping this website so I'm going
to gather all of the information on this
website and I'm going to store that
information and then I'm going to answer
any questions from the user using this
content so that's sort of like the setup
for for this app so to scrape the
website I'm going to be using by the way
by I'm going to build my rack
application using Lang chain um you
don't have to use l chain you can do
like llama index if you wanted to it's
fine I'm going to use l chain that's the
one that I prefer so for l chain in this
cell right here this is how easy it is
to do it with L chain I can scrape the
website so here is the URL of my website
is D ml. school and here is what's
happening here I'm importing a couple of
libraries I'm creating a text splitter
and this splitter is just a class that's
going to tell Lang chain how I want to
split the content that I'm SC scraping
off of the website so it's a ton of
content so let's say I'm going to scrape
I don't know 10 pages of content this
splitter is a recursive character text
splitter is telling Lang chain that I
want chunks that are no longer than a
thousand characters and I want an
overlap of 20 characters between them so
what's going to happen is that the
splitter is going to go through all the
content it's going to grab the th000
characters so those first 1,000
characters those are going to become one
chunk and then it's going to go to the
second 1,000 characters with 20
characters overlap so it's going to take
the last 20 characters from the first
document and it's going to start there
and then it's going to grab another
thousand characters and it's going to
keep doing that on and on on on now why
do I need to split all of the content
because my rack system
um requires sending context to the model
so I'm going to be telling the model hey
answer this user question using the
following context and I want to include
some context now I do not want to send
the entire website as the context
because I'm probably going to be
violating the context size right there
is a limited number of characters that I
can send so by splitting all of my
website into smaller chunks now I have a
way to only send a few of these chunks
at a time to answer any question U so
that's pretty important whenever you're
using a model that's sort of like U has
a constrain on how much context you have
to send now I recorded a video it's on
my channel that goes into a lot of
details about how the context size works
and how all of these models treat the
context size I'm this splitting and and
recursive character text splitter all of
that good stuff it's going to be linked
somewhere here if not you can find it on
my channel uh if you want more
information okay so I'm defining my
splitter and now I'm going to use a web
based loader and a webbased loader it's
just a class that behind the scenes uses
beautiful soup to go to that URL and
scrape all of the content from that URL
it's very simple as you can see I'm just
uh setting up the loader
right here giving it the URL and then
I'm going to call the the function load
and split and I'm going to pass the text
splitter that I just created so the
function know or the loader knows
exactly how I want to split that uh
content and then I'm going to just print
out all of the documents that I'm going
to get out of that page out of my
website and as you can see here all of
the documents that I get and when I'm
going to go through all the details
details but uh you can see building
machine Learning Systems the dck that's
how the first document starts and if I
go all the way to the end I don't know
if it's going to print it out here we'll
see but if I go all the way to the end
it ends on we'll use this time two
that's the final uh sentence here let's
look at the second document now if I go
to the second do document we use this
this time to discuss the first
principles behind building see how there
is an overlap there that's what's
happening here that's because of my text
splitter is asking to have a 20
character overlap so this is awesome
this is working now I have a
list of I don't know how many let me try
Here length
documents let's see I have 10 different
documents from my website I got 10
different documents which is awesome
what we need to do right now is load all
of those documents into a database and
that database is going to help us find
the individual documents that are the
most relevant to answer any question and
I already talked about this again on
that video but this database is going to
be a vector store and for this
particular example I'm just going to be
using a vector store that that in memory
uh in my other video I also use pine
cone which is an actual Vector store but
here this is fine just in memory I'm
going to be storing all of these
documents now there is something very
important about a vector store database
and it's that when I store the documents
I'm also going to be uh generating
embeddings for each one of those
documents so what is an embedding I'm
not going to go too deep into this but
an embedding is basically like an
identifier for the document is a
semantic identifier so it's like a
coordinates in space
and depending on what the document talks
about we're going to generate different
coordinates for them so imagine that if
we talk about cars the
document talks about cars on automobiles
well the location is going to be over
here but if we talk about boats U maybe
the location the embedding is going to
point over here so anything related to
cars is going to go this way anything
related to Bo is going to go this way
and the reason this is important is
because later on we can find uh if we
want uh to answer a question about cars
about an Audi or about a Tesla well we
can find documents on this section here
right on in the location where all of
the automobile documents are stored
that's what embeddings are going to give
us right so in order to create this or
load this data into a vector store we
need to
specify a class that's going to take
care of generating those embeddings
those locations in space right and in
this case I'm using the open AI
embeddings class as you can see it here
so whenever I create this doc array in
memory search again this is just a
vector store that's storing everything
in memory just to keep it simple here I
say hey create this database from a list
of documents that I have right so this
is my list of documents it's right here
and use this open AI embeddings class to
generate the embeddings that you need to
store those documents that's what's
happening on this line so after I run
this line I'm going to have my database
all of my documents inside and for each
one of those documents I'm going to have
embeddings generated for them okay so
that's great so that means that if I
have a query I can find all of the
documents that are similar to that query
so if I'm asking about BMW how fast can
they go the documents that I'm going to
return from the database are all related
to cars and BMWs and Audi and that type
of stuff after doing this I have my
Vector store I'm going to create a
knowledge base okay and this is key here
and we start entering in the area of how
do I test my system okay so I have all
of these documents I'm going to be
building a system I'm going to be
building a rack application that is
going to answer questions using these
documents how do I test that and the
steps that we're going to go through
here will make sense in a second is
we're going to generate automatically a
bunch of test cases now you can do that
manually but that's a lot of work here's
what happens if you're trying to test a
classification system something that
classifies let's say patients into sick
or healthy right a classification system
is pretty simple to test because you can
you know the ground truth you know
whether a patient is sick or not and you
look at the response from the system and
if this if the response matches the
ground truth the system got it correctly
if not it's it's wrong that's it it's
just comparing the output with the with
the correct label the problem of a rack
system when you're using a rack system
for text
generation is that it's it's really
subjective it's really hard to compare
like if I ask you hey here you have I
don't know here you have like a a
document and I want my rack system to to
summarize that document for example I
want a summarization of this page
here like how do you know if that
summary truly reflects what the page
says right it's it's it's less clear how
you can test these
systems so that's the challenge that we
have here so to start with we need to
generate a bunch of test cases like how
do we test these well let's just
generate a number of test cases and
before we generate the test cases I'm
going to create what we call a knowledge
base okay so the knowledge base it's
just going to contain all of the
documents that we have that's that's the
knowledge that we have all of the
documents that I just stored in the
database now in order for me to create
that knowledge base I need a data frame
it's a pandas data frame just table
structure where I'm going to organize
all of those documents um so you can see
here I'm just creating that data frame
off of the documents that we just loaded
into the vector store so nothing fancy
here I'm I'm I'm putting them in a
column called text because that's the
input that I'm going to need to create
my knowledge basee I'm printing out the
10 documents that I have um those are go
from zero to 9 and you can see that's
the content the content is is just right
there in that column so this is awesome
here is where we start right I'm going
to be using Gard which is a library
that's going to help me evaluate my rack
system Gard has a class that's called
knowledge base that's going to wrap all
of these documents okay and the reason
gar needs this this knowledge base it's
because Gard is going to help me
generate automatic test cases I'm want
to see them in just a second all right
so I'm going to wrap all of my data
frame into this knowledge Base Class
okay so after doing this I have my
knowledge base and I'm going to use that
knowledge base to do everything else
from here on out all right so let's
generate test cases this is Key by the
way if you wanted to create your own
test cases you can do that what is a
test case well the test case is say it's
going to be a question a sample question
it's going to it's going to have like
what the answer should look like and
it's going to contain what is the
document where the system should find
that answer okay so if I ask you hey
what is the price of the class the
sample answer should say well the class
cost
$450 and it should come together with
the document let's say document seven
where the price uh is specified
right we can do that manually that would
be a ton of work okay generating sample
test cases is a ton of work as you may
imagine so what gizar is going to do for
us behind the scenes gizar is going to
use that open AI environment variable
that I told you at the beginning it was
important to set and it's going to
connect to gp4 and it's going to use gp4
with obviously specific prompts that
they use to automatic Ally generate test
cases for my knowledge base so here is
how that looks like in code I'm using
the generate test
set function from giz card I'm going to
pass the knowledge base hey all of the
content that I know right now that we
have that we're using to power our rack
system that's the knowledge base I'm
going to specify how many test cases do
I want 16 this case that's number of
questions I want to automatically
generate 60 test cases so just go at it
if you want a 100 just set a 100 120
doesn't matter you can generate many
many test cases the longer your
knowledge basis the more content you
have obviously the more test cases you
can generate right and then I'm going to
specify a
description for this agent that's going
to help in the generation of test cases
okay so I'm just going to specify a
description for it after I generate my
test cases after I run this and this is
going to take a minute to finish
remember this is connecting to
gp4 using your API key you have to be
you have to understand that it's going
to be using your API key to connect to
gp4 to generate all of these test cases
okay so after doing this I'm printing
out here just so you see them and we're
going to open the file now I'm saving
also this to a file but just so you see
them here in the notebook I'm printing
out three of these questions three of
these test cases and as you can see I'm
printing out the question number with
what the question is the first question
was what does the machine Learning
System course offer that was the first
automatically generated question okay
then I'm printing out the reference
answer what gp4 think a good answer will
be so the machine Learning System course
offers 18 hours of live interactive
session it is a practical Hands-On y y y
and I'm also printing out the reference
context in other words what is the
document or which documents answer this
question okay and for this particular
one is document zero so the first
document should answer this particular
question now look at the second question
here who is the instructor of the
machine learning program that is a test
case that gp4 came up with geiz car
asked gp4 to come up with these
questions and that was one of the test
cases that I can use to test my system
reference answer the instructor of the
program is Santiago that's me reference
context what is this answer coming from
and it says well there are two documents
that can be used to answer this question
document five and document nine okay and
then it goes on and on and on here in
this sale I'm just saving the test set
to a Json uh file Json L file so let's
open that test set here and you can see
all of the questions that were generated
automatically by gizar so this is great
these are my test cases now I can use
this to test my system okay so this is
just this is a very valuable step that
we don't have to go through manually
which takes a ton of time and you think
about it let's see uh let me see
question look at this hello I'm
considering enrolling in the machine
learning school program this is
simulating a user asking my system a
question which is great that's exactly
the type of test case that I need so you
get the question you get the reference
answer right what what the correct
answer should
be and you get the context at some point
there we go you get the context again
it's just that the list of documents
containing that answer or the list of
documents that the system should use to
answer that question this is awesome I
have 60 test cases now next step is to
run those test cases next step is to
actually validate my system but I need
to build the system first because I
don't don't have a system right now
let's just prepare the prompt and this
is going to be just my simple chain my
simple rack system that is going to work
like this I'm going to grab a question
from the user uh hopefully find the
context in my database in my Vector
store put them together and ask the
model to answer that question and if the
model cannot answer that question then
we can say I don't know that's what my
prompt is this is a very simp prompt to
build a rack system by the way if you
really want to build a rack system for
something serious there are much better
prompts that helped the model answer
better this is a very very simple one
I'm just creating a prompt template this
is a class from L chain that's going to
allow me to parameterize a prompt so you
can see I have two variables here I have
the context variable and I have the
question variable and whenever I execute
this prompt or I use this prompt as part
of a bigger rack system I'm going to
have to pass those two variables or
values for those two variables the
context and the question so I'm creating
this promt template from the text that I
just put in here and I'm printing out
what the template will look like after I
formatt it with the two variables you
can see I'm passing the variable context
here is some context and I'm passing a
variable
question here is a question okay so this
is what I get answer the question based
on the context below if you can answer
the question reply I don't know context
here is some context question here's a
question okay so that works that's fine
that's cool let's now create the rack
chain uh of course I'm not spending a
ton of time there is a ton of um ideas
and steps that we have to go through in
order to come up with this rack system
I'm not going to go through all of them
right now because I'm assuming you only
care about evaluating these rack systems
but again the video that I linked before
in my channel goes through all of that
uh all of those ideas in order for you
to get here so let me try to explain
what's happening here in this rack chain
um first of all I'm going to be creating
the model and like I told you before I'm
using GPT 3.5 model that's the model
that's going to be answering the
questions from my knowledge base
Okay so just initializing my model here
with the chat open AI class from Lan
chain I'm passing the API
key and I'm passing the name of the
model which is GPT 3.5 turbo I could be
using GPT 4 here as well if I wanted to
that will actually be very interesting
test because right now what's going to
happen at the end of this is that I'm
going to have GPT
3.5 answering questions and those
questions will be evaluated by gp4
because gp4 was the one generating the
test cases in the first place that's
just just the way it it happens here all
right so I'm going to create my chain
okay and a chain is what the name says
it's just like a string of components
where the input or the output of one
component will become the input of the
next component in that chain so that's
how you build here in L chain and that's
one of the reasons I like it a lot I'm
going to start with the first component
here in my chain and is this map that
you see or dictionary that you see and
notice that there are two keys on this
dictionary the first key is context and
the second key is question and the
reason I have this map here is because
the second component is the prompt that
we created let me scroll up to that
prompt remember that prompt requires
two variables so the input to that
prompt is two variables context and
question or is it's a map with two
variables inside because of that the
first component of this chain is a map
that again is going to get fed into the
prompt which is the second component now
let's see where these values are coming
from the first value is the context
where is the cont context coming from
well obviously it has to come from my
Vector store my Vector store contains
all of the documents they're stored
right there and some of those documents
are going to be the context that I need
to send to the model to answer a
particular question how do we know which
documents well we need to pass to the
vector store we need to pass a question
and tell the vector store give me any
questions that are simil ilar or give me
any documents that are similar to this
question remember how embeddings work if
I tell the vector store give me what the
price of the course is the vector store
should look through all of those
embeddings in space and return any
embeddings that are around the location
that talks about prices and costs right
so if that such location exists any
documents that are very similar to that
Center Point are going to get uh
returned back to me and hopefully the
those documents will answer the question
that I asked which is how much does it
cost the way I I I do that or or or I
sort of like accomplish that here in
code is by taking the vector store that
we created and generating a retriever
from that Vector store that retriever uh
let's do this let's do this so so maybe
maybe this is going to make it a little
bit clearer okay so I'm going to create
a retriever and I'm going to say hey
just the vector store just uh give me a
retriever okay and let's see what we can
do with that retriever okay so if if you
do you probably know this but if you use
the function there this is going to
return all of the functionality of that
retriever okay so look at this what do
we get here these are all of the
functions that we can call from that
retriever here that's a bunch of stuff
so let's see the gets um get prompts get
relevant documents okay so that sounds
like a that sounds cool uh there is
invoke as well okay so let's do the get
relevant documents let's try this out
Let's do let's commment this out here
and let's do
Retriever get relevant
documents uh look at this so what is the
machine learning school okay top K1 I
don't I'm not going to pass that let's
see what happens when I do this
did that even work let's go up oh this
is awesome okay so when I called get
relevant documents on a
retriever and I pass a string what's
going to happen is exactly what you're
imagining right now the retriever will
return the top four documents in this
case the top four documents that are
related to that question the top four of
them are going to come back and that is
exactly what we need to ACC accomplish
here as part of the Lang chain chain in
this case we are using the as retriever
here but we could be using just a
retriever it doesn't matter just the
retriever variable that we use here and
we are passing the question and this
item getter I'm going to let uh you
figure that out but the item getter is
just a function from the operator
package and the item getter is just
basically going to grab the question out
of the function that you apply this two
so in other words or in English uh
what's going to happen is that I'm going
to when I invoke that chain I'm going to
be invoking that chain you can see it
here I'm going to be invoking that chain
with a variable called question right or
with an attribute it's GNA I'm going to
pass a dictionary with an attribute
inside that's called question this item
getter is going to grab the value of
that question and it's going to pass the
value of that question to the vector
store retriever that we created right
here okay just to make it clear let's
just
do retriever if I can spell retriever
here okay so it's going to pass that
question to the Retriever and we already
know that what's what this is going to
do is return the relevant documents that
is what's going to happen so now the
context the context here will have a m
or a list of relev documents so this
same list that you see here that is the
list that we're are going to be passing
to that context variable okay the second
one is pretty straightforward I'm saying
I also need to an attribute called
question let's just put the same value
that we invoked this chain with okay so
the same value of question here is going
to just go here and that is my first
component of the chain unfortunately the
most complicated one to understand
because everything else is going to be
pretty simple the next component of the
chain is a prompt which is just the
prompt that we defined before we're
going to be injecting the context and
the question to that prompt the output
of that prompt which is a well-formatted
prompt is going to go into our model so
now we're going to be invoking our model
our GPT 3.5 it always takes me a second
to say GPT 3.5 always takes me a second
to think about that so we're going to
take that prompt invoke the model with
prompt the model is going to return an
answer back now in this particular case
I'm not going to get into too many
details but in this particular case that
model actually we can we can uh we can
look at here in action I'm going to add
another line I'm going to say model.
invoke let's just invoke the model uh
with tell
me tell me a joke okay I'm going to
invoke that model with oh actually I
cannot do that
here I'm going to do it
here oh of course not because I have not
executed this okay so that's that's
awesome let let me Jo ex let me just
execute this and let me say model tell
me a joke and in this case notice that
yeah I'm getting a joke back from GPT
3.5 this is my joke it's a bad joke
obviously notice that this is not like
clean string text it comes like wrapped
into an AI message and the reason is
because this is a shat model so it's
supposed to have system message and
human message and in this case this is
an AI message so it's a message coming
from AI I don't want that I want clean
strings clean strings so I'm going to be
passing a parser which is just a string
output parser which is going to make
this go away so the output of a chain is
actually going to be a string all right
so let me remove this that explains why
you see prompt then Model D string
output parser just to clean that class
out and get clean beautiful
strings and then here is just the test
of the tests where I invoke my chain
just to make sure it works I'm saying
okay invoke my
chain and pass a question I need to pass
a question what is the machine learning
school and look at the answer it's
beautiful the answer is just a string
just to make sure I did not lie I need
you to trust me when I invoke these
chain without the parser look what's
going to happen see AI message horrible
we don't want that so let me reexecute
this beautiful just String Clean that is
what we need all right we have our
knowledge base we created test cases we
have a chain we have a rack system we
need to test that rack system how good
is the rack system that is what's going
to happen right now to do this
evaluation
we're still going to use gizard because
gizard is going to take care of running
every single test case through my chain
take an answer evaluate that answer is
it a good answer or not that is what the
tricky part is remember this is not a
classification model you cannot just
compare strings and say yeah this string
is exactly like this string you have to
use a model to look at two answers and
say yeah I think they're hitting the
same points I think both of those
answers are answering the same question
that is what gizar is going to do behind
the scenes so how do I use this well
they have a function that's called
evaluate very simple okay so that
function requires a test set which we
already have we created it they require
the knowledge base which is the original
data where is the data coming from and
it requires a function that is going to
call the model okay so in this case I'm
calling it answer function you can call
it however you want this function is
very simple it's going to receive a
question and an optional history if you
want to enable history for your chat
application in this case I'm not
enabling it just so I keep this simple
and then I'm going internally within
that question the goal of that question
is to answer that question or the goal
of that function sorry is just to answer
that question what I'm doing here is
just invoking the chain so I'm going to
be invoking the chain passing that
question and what's going to happen is
that within this evaluate function gar
is going to repeatedly call my function
passing the different questions that it
needs to evaluate okay so that's pretty
cool you call this evaluate function and
it's going to give me back a report and
when you run this it's going to take a
second to run and remember this is going
to be using gp4 behind the scenes so I
imagine without looking at the source
code I imagine that what's happening is
is going through all of the test cases
grabbing the first test case sending it
sending that question to my Shain my
rack system grabbing the answer and then
using gp4 to compare the answer from my
chain with the reference answer that we
generated before it's going to compare
those two and if they are similar if
they look correct it's going to give me
a point and if they don't I don't get
any points and at the end we can
determine how accurate my system is how
many questions did I get correctly it
should do that behind the scenes so what
is in that report we can just display
the report if you're working on a
notebook uh you can just display the
report and see what it looks like if not
you can also open a web page so I'm
displaying the report here but I'm going
to go to the web page because it looks a
little bit better uh I opened the report
after running it later and here's what
you get first there is a umap
representation of my knowledge based and
that's again the more questions I have
the the larger my knowledge Bas is the
more interesting stuff you're going to
get here here you get where the false uh
answers are or the incorrect answers are
where they are located if my knowledge
base was was bigger well obviously this
will tell me this will give me more
information about what areas of my
knowledge base are not well covered or
I'm having problems remember I only have
10 documents here okay so that's why you
have so few points uh you get a
component analysis here we're going to
see that in a second we're going to talk
about that in a second but this is
giving me a score for every single
component of my Rx system we're going to
talk about through all of them in a
second there are some recommendations
some correctness by topic I have only
one topic in my website this can get
really really complex with larger
knowledge base and the overall
correctness score is
73.33% okay that's my overall how good
my system was right now okay so let's
talk about this component analysis let's
go back here so I can show you uh I have
here sort of like a the score individual
score for each one of the components of
a rack system so the first one is the
generator and if you scroll like if you
uh put your mouse on top of it you can
see what is that component about so in
this case this is the large language
model that we used to uh in the chain to
generate the answers so the way gizar is
evaluating my system is it depending on
what the test case looks like is trying
to evaluate all of these components
separately now in my simple chain I
don't have a uh I don't have a rewriter
and I don't have a routing the rewriter
will be a component that you had in your
chain to rewrite the question like when
they when the user asks something that
doesn't look like correct you could have
a component that rewrites that question
in a way that's simp it's easier to
answer that question it becomes more
relevant I don't have a rewriter here in
this case so obviously I'm not doing
that great in those type of questions in
questions that should be rewritting I'm
not doing too hot here uh the retriever
is just just getting the most relevant
questions from my map uh so I should
work a little bit better on how on those
embeddings on how that the similarity U
gets computed and how I get the relevant
documents so this breakdown is great to
tell you exactly what you should be
focusing on so let's go down a little
bit here's my recommendation I'm saving
or you can save that report to HTML
that's the HTML document that I show you
I you can also just sort of like print
the correctness or compute the
correctness based on question type okay
that's what you get here you see that
complex questions uh 90% correct
conversational questions 50% correct
this makes sense I did not include a
history in my chat remember that this
supports when I'm using a shatow open AI
model uh it supports a conversation so
it supports keeping context I did not
use that so I'm sure that by using that
I can improve the cont conversational
aspect of my rack system which I did not
Implement distracting elements only 50%
so questions that were generated with
distracting elements here they did not
score well so I'm going to have to do
better there double questions simple
questions situational questions what
100% so this is gold because this tell
me how my system is doing and where
should I focus on to fix my system by
the way there are no topics here but
gizar has the ability ility to
automatically generate topics based on
your documents so if I had like a bigger
document or bigger knowledge base this
card will it's able to just generate
different topics recognize and generate
different topics and then give you
scores on those topics so you know okay
so anything related to price the llm is
doing great anything related to this
other topic is not doing great okay so I
can also get the failures so if you want
to know exactly what questions did the
system fail so I can get the list of
failures I can safeties I can do
whatever I want so let me see it's
simple well you cannot read the whole
question here because it's is what does
the machine Learning System course blah
blah blah reference answer conversation
look at this look at this conversation
history see how there is conversation
for some of the questions I'm not
supporting that right now so I'm not
surprised the system is not doing great
on those okay so all of that is awesome
if you stop the video right now you
already have a ton of value here just by
doing this you have a ton of value but
there is more there is more okay this is
great to run an evaluation of your
system one time just do it one time how
is it doing great I want to actually
automate this I want to do this every
time I push a change or every time I'm
ready to make a
deployment uh I want to just run a test
Su with all of my t cases but way those
test cases that were autogenerated you
can add your own as well right you can
add your own test cases you can fix them
you can do whatever you want with them
but the key here is I want to automate
my tests so how do we do that well let's
let's take a look so here it's I'm just
loading the test set from the Json L
file just loading them in memory very
simple and I can just create a test
Suite it's just take the test sets and
generate a test Suite that's the name of
the test Suite that will be reference
later whenever you run multiple test
Suites you know exactly which one it is
one line creates a test suite for me and
then I can run that test Suite so how do
I do that in order to run this test
Suite I'm going to wrap my chain in a
gizar model okay so this is a class
that's going to provide all of the
information to gizar that it needs to
run my tests so look at this it gizar
model requires a prediction function or
the model that's going to be answering
test Suite or it's going to be solving
the test suite and we're going to see
that in a second what the type of model
is so in this case is text generation
you can do classification you can use
gizard to just do classification or that
type of stuff what is the name what is
the description always specify these two
parameters they help their model make
decisions and what is the feature name
that I care about in this case is going
to be the question okay so that's the
feature that we care about that's the
question that we need need to answer now
look at this prediction function I call
it batch prediction function this is
very similar to the answer function we
created before to run that report that
evaluation report but in this case I'm
just answering question in batches so
when running the test Suite G car is not
going to go question by question it will
take a long time so it's actually doing
this in batches and what's cool about L
chain is that I can run I can invoke
book a chain with a batch of inputs and
that's exactly what's happening here
this is my chain and now I'm passing a
batch so it's an array of questions same
thing as invogue before but now it's in
batches so we can send multiple
questions to the model at the same time
we don't have to wait for one answer
before sending the next question that
makes this really really fast so very
similar as before I receive a data frame
I go through that data frame frame all
of the values of questions and I'm pass
them as an array of a map with one
attribute that's called question here
when I run this model I can get now the
test suite and I can say run pass the
gizard model and my test Suite is going
to run and look at this it says that it
succeeded with 62% that's the metric
that I'm getting back when I run this
test suite and that is awesome because
now I can automate the process of
running this test Suite by the way I
obviously I can just get from the
results I can just get you know what the
metric was was the result was I can get
that information here to automate
something like before deploying the
model make sure the test Suite passed if
it didn't pass then don't deploy the
model that will be the way to automate
this this one more thing here uh notice
here I'm displaying the results of the
test you can see the test with pass
the metric was 61 667 which is a past
that's the name of the test Suite all of
that good stuff the final thing that I
have to show you is how do we integrate
this with pie test why pie test because
if you're not using pie test you not
doing it correctly okay so py test in my
opinion is the best unit test library
that there is for python so of course I
jumped all over this when I saw that you
could actually integrate this with py
test in my example here here I'm using
something that I don't see many people
using because guess what people who use
notebooks they're not thinking about
testing their code they should but
they're not I'm using here the IPI test
Library which is a library that's going
to allow me to run pie test tests
directly from my notebook and it's great
so I'm going to install P IPI test IPI
test not P test but IPI test and then I
can use a cell magic as you can see here
percentage percentage IPI test and this
cell became runnable like if I run this
test this cell it's going to run all of
the test cases inside just like if I
were running this with pi test which is
awesome so look at this code here it's
very simple I have only one test so I
have a fix here that Returns the data
set and the data set is just me loading
the test set
from the hard drive I'm loading my test
sets my 60 test sets and I'm returning a
data set I'm turning that test set into
a data set and I'm returning that then I
have a model fixture that is going to
return the gizard model that I had that
I created before I can also create it
here inside but just decided to just
reference the one that's that's outside
and then I have a single test case okay
these single test case that receive both
fixtures the data set and the model and
it's going to use a function that's
called test llm correctness there are a
bunch of functions inside this card that
you can use to test different aspects of
a system in this particular case I just
care about is the llm correct and I
passed the model and I pass the data set
and very very important I pass a
threshold that threshold indicates how
high should I need my results in order
to declare that this was successful okay
okay and in this particular case the
tests pass because my threshold is under
62% so 62% that's the metric that I'm
getting I'm not going to run it here on
screen because it takes a little bit of
time to answer all of those 60 questions
but just trust me when I run this is
going to succeed now if I set that
threshold to 70
or80 now these tests are going to fail
with this you can see how you can
integrate with your system if you're
using pie test now you can integrate
unit tests for your llm application you
can evaluate your llm application
automatically not just by calling John
Doe or Mary Black and telling them can
you try a few questions and see if it
works which is what I've been seeing
that is bad with this you can actually
do it automatically so hopefully this
makes sense hopefully this helps you if
you got all the way to the end just
please like this video it helps me
understand whether this type of content
is useful for you and I will see you in
the next one
bye-bye
5.0 / 5 (0 votes)