Enhance RAG Chatbot Performance By Refining A Reranking Model
Summary
TLDRThis video script outlines a workflow using Labelbox to enhance a retrieval process for a custom chatbot. It demonstrates how to utilize a re-ranker model to improve document retrieval by incorporating human-in-the-loop for context evaluation. The process involves embedding documents into vector space, fine-tuning the model with human annotations, and testing the model's performance. The script highlights the importance of human expertise in refining AI responses, showcasing an effective technique for improving chatbot accuracy.
Takeaways
- π The script demonstrates a workflow for improving a retrieval process using the Labelbox platform.
- π€ A custom chatbot is utilized to interact with an internal corpus of documents through a retrieval method to find relevant information chunks.
- π The process involves using a ranker model to enhance retrieval performance by re-ranking initial search results, which may not always be contextually relevant.
- π§βπ§ A human loop is introduced to evaluate text chunks against search queries, using human judgment to refine the reranking model.
- π The script shows a baseline experiment for retrieval using a consistent Transformer model to embed documents into vector embeddings.
- π An example of embedding the extensive NFL rulebook into a vector index for efficient retrieval is provided.
- π The script details how to create a training dataset for fine-tuning the ranker model by extracting top chunks from initial data sets for human review.
- π The use of Labelbox SDK to format and submit text queries and response chunks to the Labelbox platform for human review is demonstrated.
- π The process of generating pre-labels using a model like OpenAI's GP4 to assist annotators in efficiently labeling data is explained.
- π Annotation projects are set up in Labelbox to allow human annotators to review and adjust pre-labels, improving the relevance of retrieved chunks.
- π§ After human review, the annotations are exported in JSON line format to fine-tune the ranker model, incorporating human expertise into the model's learning process.
- π The fine-tuned ranker model is then used to re-rank query-response pairs, aiming to improve the context provided to the language model for generating responses.
Q & A
What is the primary goal of using Labelbox in the described workflow?
-The primary goal of using Labelbox in the workflow is to improve the retrieval process by fine-tuning a re-ranker model with human-in-the-loop to enhance the performance of retrieving relevant information from an internal corpus of documents.
What is a 'chunk' in the context of this script?
-In the context of this script, a 'chunk' refers to a segment of text retrieved from the internal corpus of documents, which is used as context for the language model to generate a response.
Why is a re-ranker model used in the retrieval process?
-A re-ranker model is used to improve retrieval performance by re-evaluating the initial top results and potentially promoting more contextually relevant information that might be ranked lower by the initial retrieval algorithm.
What role does the human loop play in the fine-tuning process of the re-ranker model?
-The human loop involves expert annotators who evaluate the relevance of each text chunk to the search query, providing valuable feedback that is used to fine-tune the re-ranker model, ensuring it aligns with human expertise.
How does the script suggest improving the baseline retrieval performance?
-The script suggests improving baseline retrieval performance by using a re-ranker model that has been fine-tuned with human feedback, which helps in retrieving more contextually relevant information.
What is the significance of the NFL rules PDF in the provided example?
-The NFL rules PDF serves as an extensive document corpus for the example, which referees or a chatbot would need to be familiar with to answer queries related to NFL rules effectively.
What is the purpose of pre-processing in the context of this script?
-Pre-processing in this context involves loading documents into a consistent Transformer model to embed them into vector embeddings, which is a foundational step for both experiments and the retrieval process.
How does the script handle the limitation of token length in LLMs?
-The script addresses the token length limitation by selecting only the top two documents as context for the LLM model, which helps in managing the input size while still providing relevant information.
What is the process of exporting annotations and ground truth data from Labelbox?
-The process involves using the export section of the labeling project in Labelbox, where code is provided to export all ground truth data. This data is then formatted into a JSON line format suitable for fine-tuning the re-ranker model.
How does the fine-tuning process of the re-ranker model work?
-The fine-tuning process involves feeding the model with labeled data in JSON line format, which contains positive and negative examples. The model learns to score the relevance of responses, with the goal of improving the ranking of contextually relevant information.
What is the final step in the workflow after fine-tuning the re-ranker model?
-The final step is to replicate the experiment using the same LLM and embedding models but now with the fine-tuned re-ranker model to rank all query-response pairs, aiming to provide improved context for generating responses.
Outlines
π Introduction to Labelbox for Retrieval Process Improvement
The video script introduces the use of the Labelbox platform to enhance a retrieval process. It explains the workflow where a custom chatbot uses a retrieval method to search through an internal corpus of documents to find relevant text chunks. These chunks are then used as context for a language model to generate responses. The script outlines the process of using a re-ranker model to improve retrieval performance by incorporating contextual information from documents that may not initially rank high in the search results. The human loop process is introduced as a method to evaluate and fine-tune the re-ranking model using human expertise.
π Setting Up the Labelbox Platform for Human-in-the-Loop Annotation
The script proceeds to demonstrate the setup for a training dataset on the Labelbox platform. It describes the process of extracting top chunks from initial data for human review and the use of the Labelbox SDK to format and submit text queries and response chunks. The video shows how to use the platform's catalog to manage tasks and how to prepare data for human annotators, including the use of model predictions to provide pre-labels and reduce the workload for annotators.
π€ Utilizing Model Predictions and Annotation for Fine-Tuning
The script explains how to use model predictions to assist human annotators in the labeling process on the Labelbox platform. It details the creation of an annotation project with a specific ontology for relevance assessment of text chunks. The use of pre-labels generated by an AI model to expedite the annotation process is highlighted. The video also covers how to export annotations and ground truth data in JSON line format, which is essential for fine-tuning the re-ranking model.
π Fine-Tuning and Testing the Re-Ranking Model
The final part of the script discusses the fine-tuning process of the re-ranking model using the exported annotations and ground truth data. It illustrates how to select a model, input sample query and response pairs, and utilize the fine-tuned model to rank query-response pairs. The script concludes with a demonstration of how the re-ranking model, when combined with an LLM model, can improve the context provided for generating responses, thereby enhancing the overall retrieval process.
Mindmap
Keywords
π‘Labelbox
π‘Retrieval method
π‘LLM (Large Language Model)
π‘Re-ranker model
π‘Human-in-the-loop
π‘Vector embeddings
π‘PDF
π‘Chunk size
π‘Model Foundry
π‘Json Line Format
π‘Fine-tuning
Highlights
Introduction to the workflow using the Labelbox platform to improve the retrieval process.
Explanation of the custom chatbot and the retrieval method to enhance response generation.
Importance of fine-tuning a reranker model to improve retrieval performance.
Use of a human-in-the-loop process to evaluate and improve the relevance of retrieved text chunks.
Description of the baseline experiment for retrieval using a Transformer model for vector embeddings.
Introduction of an example using an NFL rulebook PDF to demonstrate the retrieval process.
Details on defining chunk size and storing it in a vector database for improved retrieval.
Steps to test the baseline performance of retrieval with no reranking.
Creation of a training dataset for fine-tuning the reranker model.
Explanation of the process to extract top five chunks for each query and submit them to the Labelbox platform.
Overview of the Labelbox platform and the process of importing text queries and response chunks.
Description of the ontology needed for the reranker model, focusing on relevant and non-relevant classifications.
Utilization of the model Foundry to generate pre-labels and assist annotators in the labeling process.
Steps to set up the annotation project in the Labelbox platform.
Details on exporting annotations and ground truth data for fine-tuning the reranker model.
Explanation of fine-tuning the reranker model using a JSON line format and sample queries.
Analysis of the fine-tuned reranker model's performance on sample data.
Replication of the experiment using the same LLM and embedding models with the fine-tuned reranker model.
Comparison of baseline responses with responses improved by the fine-tuned reranker model.
Conclusion highlighting the effectiveness of the human-in-the-loop process in improving retrieval and response quality.
Transcripts
hello so let me show you a quick
workflow on how you can use the labelbox
platform to improve a retrieval process
um so typically how it works is the user
asks a question it's going to go through
this custom
chatbot and we're going to use the
retrieval method to look through your
internal Corpus of documents so that you
can retrieve the top few chunks and that
chunk is going to be added as context to
your llm so that you can generate a
response so how we're going to use Label
Box is we're going to find tun a re
ranker model and a ranker model can help
improve your retrieval performance
because in many studies it's shown that
contextual information might not belong
in the top five results of your initial
retrieval algorithm it could be stuck in
the lower levels of relevance so we're
going to use a human Loop process
evaluate each chunk of text as it
pertains to the search query and then
we're going to use that to fine-tune
this reranking model so that it can take
context from um retrieval documents far
below and add that as context to
formulate maybe a better response um so
let me show you how this works so I'm
going to start off
by running this pre-processing note book
and just show you a baseline experiment
for retrieval so I'm just going to load
in some documents here and as you can
see here I have a consistent Transformer
model to embed my documents into Vector
embeddings so we're going to keep this
consistent for both
experiments and for the sake of our
example today we're going to upload a
PDF containing all of the NFL rules so
all the NFL referees probably has to be
familiar with this PDF and it's quite
extensive as you can see here the rule
book consists of uh 245 pages and we're
going to formulate our spilling strategy
here right so we're going to define a
chunk size and in many retrieval use
cases right this is another method or
another set of parameters that you can
tune to improve your
retrieval we'll save it as a vector
index you can save it into your vector
DB as well
and we can test this out to see how well
it performs right so here's a sample
query that might want to ask a chat bot
and you can see here that the most
relevant
result it consists of this chunk of text
right
here so let's go ahead and evaluate the
Baseline performance retrieval with no
reranking uh for our NFL documents with
this set of embedding model and Baseline
llm model so here's our list of uh
queries that we want to ask the
chatbot and because of token limitations
for many LOL models right we're just
going to get the top two documents and
we're going to use the top two documents
as context for our llm model so here's
the model name we got it off hugging
face and we can use those top two
results as context for this model so
here's the question sample
question and here's the results based
off of the
context and now we can just run it
across all of our list of
questions and get a corresponding
response based off of the context we'll
go ahead and save this to a CSV file uh
which we can use later on okay so that's
our Baseline experiment let's go ahead
and move on to creating the training
data set for fine-tuning our ranker
model so here let's go ahead and extract
the top five chunks for each query
within our initial data set and these
response chunks are going to be the data
that we're going to send over to the
labelbox platform for that human the
loop review
process okay so to send it to Label Box
let's go ahead and import the label box
SDK let's go ahead and format our pandas
data frame a little bit and let's go
ahead and create that asset object where
we can submit all of our text queries
and response chunks into the labelbox
platform as a task so there's no errors
as you can see here let's head over to
the labelbox
platform so I'm in the labelbox catalog
and as you can see here my job was
successful we have
354 of our user queries and the top five
most relevant chunks all loaded in as
text assets within the label box
platform so this is going to be that
platform where we can do human the loop
response
so before moving further uh might make
sense to show you uh what a ranker model
looks like and what types of response it
accepts to F tuna model so I'll head
over to this coh here
website and you can see here we need to
upload our data in a Json L format Json
line
format and this is very relevant to our
onology right so for each chunk we're
going to have to flag it as relevant or
not relevant so that's all we need the
ontology is fairly simple two options
for each chunk of
text so here I'm back in my labelbox
platform and before I create that
annotation project with the ontology
that I just learned let's actually use
model Foundry to take a first pass we
know that your expert
annotators uh are trained for efficiency
and accuracy but let's take some weight
off their shoulders by generating a set
of pre- labels that they can rely on so
we'll select all the data within this
data set and we'll click on the predict
with Foundry option and as you can see
here we give you different options for
passing your data through initial set of
foundation models so you can choose any
of the foundation models here so that I
can generate that pre-label for you
based off of your
prompt so if that doesn't make sense let
me show you how this works so here I
have a open AIG gp4 model that I have
created as you can see here I uploaded
the ontology right two responses for
each
chunk and for each chunk here right I
have developed the prompt to tell us
whether the chunk is relevant to the
query or not or if it's negative if it
doesn't contain information that need to
answer a specific
query so by formulating this prompt with
the ontology I can now generate some
previews from my open AI
model after a few seconds the previews
has been generated right so now for each
query and chunk pairs uh we can now get
pre- labels for how open AI determines
whether each chunk is not relevant or
relevant to this specific question that
the user may be
asking so if I'm satisfied with this
response I can now submit the job and
apply these classifications throughout
my entire data set so now my annotators
has some pre- labels to work
with once the model job is finished I
can now click on each row within my data
set back in the
catalog and determine and check out the
predictions right so now I have pre-
labels for every chunk within each row
within my data
set and once I send this over to my
annotators they're going to save a lot
of time so let me show you how to set up
that annotation
project so let's set that up so we'll
head over to the annotate component of
the labelbox
platform and let's go ahead and create a
new project we're working with text so
we'll select the text modality we'll
give this a name maybe demo with today's
date and let's go ahead and confirm the
project so now all we need to do to
finalize the configuration
is to add the data and set up the
ontology and labeling
experience let's set the ontology up
first so we'll head over to
settings set edit the standard labeling
editor and as you can see here we can
choose the ontology that we want to use
in our case we want the retrieval
ontology
here um this is going to give you a
preview of what that onology looks like
what are the guard rails as to how your
annotators can annotate each data row
and as you can see here we have five
Global classifications right one for
each chunk and based off the guideline
that cooh here has provided us earlier
right we want two options whether that's
relevant or not relevant for each
retrieved chunk so this is that human
Loop process that we have to find let's
save
it and all we need to do now is to add
the data so we'll head over to catalog
we'll choose all the data rows within
this data set that we uploaded earlier
select all of them it we'll add it to
this annotation
project we'll choose the project that we
have just created we want to include the
model
predictions from our gp4 model to give
our annotators a head
start and we could set a
priority once all this has been
configured let's go ahead and submit
this list to data rows to our annotation
project as a
job so now we can head over back into
our annotation
project confirm that the data row has
indeed been
added you can see here we have a data
row and now our expert team of
annotators that are familiar with the
rules of the NFL can start the labeling
process and because we added pre-label
into our Label Box workflow right you
can see here that the results of the gp4
model has been automatically populated
into the labeling editor so as an
annotator this is going to save me a lot
of time instead of starting from scratch
I can simply review and change the
results of each answer of these pre-
labels I can use these hot keys to speed
up my labeling task and hit submit to
move on to the next data
row and once my labeling ation is
finished that human Loop process is
finished let's go ahead and Export all
of our annotations and ground truth data
to F tune that model and if you remember
earlier right we're going to need it in
a Json line
format so if we head over to the export
section of our labeling project uh it
gives us the code that we need to export
all of our ground truth data so let me
show you how that
works so
now we can go ahead and paste that chunk
of code with our project ID initialize
the labelbox
client and now we can just go ahead and
break down the Json and convert our
labels into a data frame format for
easier
visualization we can also save as a
CSV and convert it into a jsonline
format that open source libraries or
coher uses to fine-tune a reranking
model what one line of that Json looks
like you got the query your positive
examples as a list and your negative
examples as a
list we'll write this to a Jon online
file format and we're ready to start the
reranking fine-tuning
process so we'll head over to the next
notebook and the first step is to select
the model so you can use coh here or in
this case we're using a open source
model that we found on hugging face
and from the documentation of this model
right we can pass in some sample query
and response pairs right so you know two
of the same queries and two different
answers and it's going to tell us which
one is more relevant so the Nega the
lower the score the less relevant the
response is so that's how this model
works and let's go ahead and feed in our
Json line format containing our labels
to fine-tune this ranker
model as you can see here from the log
files are across a few epochs our loss
starts to
decrease and our learning rate starts to
decrease as
well and now we can test our model on
some sample data right so here's a quick
query how many points is a touchdown
worth under some conditions and here's
all the positive and negative examples
and from our human Loop
process right we can see here that we
have provided this data row uh three
positive
examples and two negative examples so
the key thing to note here is what
matters is the relative order of the
scores not the absolute value so as we
can expect there is three positive
examples right they all have relatively
High values with two negative examples
with very low values right this shows us
that our model is fine tuned to the
examples that we
provided so the next step is fairly
simple we want to replicate our
experiment the same llms the same
embedding models to rank all of our
query response pairs with this ranker
model that we have fin toe okay so we're
going to load in our queries again with
all of labels and
responses and we can go ahead and take a
look at the score distributions for
Relevant versus non-relevant
labels and as you can see here the
non-relevant ones are typically negative
and the histogram of relevant ones are
typically uh positive and
higher okay so let's put it all
together we'll use the reranking method
we'll retrieve the top 20 documents now
for each specific query and then we'll
use our ftuned reranking model to get
the top two results as context for the
same llm model that we're going to
use okay so we'll initialize that same
embedding model from hugging face now
we're going to retrieve the top 20
responses for each
query so now for each query we got a
list of 20
responses and now we're going to use
that fine to ranker model to go ahead
and only extract the top
two so in this case let's go ahead and
bring in our llm model which is the same
Baseline model as
well and for each
query we can now feed in the top context
from the results of our fine to ranker
model we'll save this as a CSV
file and then we can just do a simple
inner join to compare the responses for
each question
the updated response right from our fine
tun ranker model in our initial
response and just taking a look at
initial pass of our
responses uh you can see here that some
responses are better right you know this
ranker method is certainly not a Magic
Bullet uh some results are definitely
better uh others are a little bit
worse uh but as you can see overall this
ranker method does work to match the
pattern of your data he uses the human
Loop process to incorporate real human
expertise so that you can't get a better
answer such as this one below here right
uh how many Feats in bounce to
constitute a catch in the NFL right it
everyone who watches NFL knows that is
two feet not
20 so hopefully this video shows you how
you can use the label boox platform to
incorporate human loot process to
fine-tune a ranker model as part of your
retrieval process while this is not a
Magic Bullet this is definitely a
technique which can help improve your
customized chat bots in other retrieval
processes thanks
Browse More Related Video
Create Your Own ChatGPT with PDF Data in 5 Minutes (LangChain Tutorial)
"How to give GPT my business knowledge?" - Knowledge embedding 101
Realtime Powerful RAG Pipeline using Neo4j(Knowledge Graph Db) and Langchain #rag
Introduction to Generative AI (Day 10/20) What are vector databases?
Retrieval Augmented Generation - Neural NebulAI Episode 9
Fine Tuning, RAG e Prompt Engineering: Qual Γ© melhor? e Quando Usar?
5.0 / 5 (0 votes)