Retrieval Augmented Generation - Neural NebulAI Episode 9
Summary
TLDRThis episode explores retrieval augmented generation (RAG), a technique to enhance large language models' context with relevant external information for more accurate responses. It breaks down RAG architecture and discusses techniques like prompt engineering for natural language to SQL translation that allow querying non-vector data stores. It also covers vector storage fundamentals like optimal embedding models and vector sizes for performance and accuracy. The strengths of RAG include rapid proof of concept development, while challenges involve planning contextual data retrieval and updates. Overall, RAG combines the strengths of neural networks with robust data retrieval.
Takeaways
- ๐ Retrieval augmented generation (RAG) enhances LLMs with external contextual information during inference.
- ๐ RAG retrieves relevant information from documents to provide context to LLMs.
- ๐ก RAG allows expanding context beyond LLM limitations in a cost effective way.
- ๐ Manual implementation of RAG builds deeper understanding compared to abstraction libraries.
- ๐ฎ Start simple - use Kendra for quick proofs of concept to understand data and access patterns.
- ๐ Telemetry for user requests, retrievers and LLMs enables accuracy and performance optimization.
- ๐ Choosing optimal storage and indexes depends on data structure, query access patterns and updates.
- โฌ๏ธ Smaller embedding vectors currently provide better semantic search than larger ones.
- ๐ฆ Order of operations - get a usable prototype, then focus on optimizing based on real user data.
- ๐ RAG's speed and expandable context make it an essential technique to evaluate.
Q & A
What is retrieval augmented generation (RAG)?
-RAG is the ability to enhance an LLM's context through external information at inference time. This allows the LLM to generate more accurate and relevant responses by supplementing its existing knowledge.
What are some benefits of using RAG?
-Benefits of RAG include lower cost compared to fine-tuning models, ability to quickly update information by modifying the external data source, and flexibility to combine semantic and non-semantic retrieval techniques.
What type of external data can be used with RAG?
-Many types of external data can be used with RAG including structured data like CSVs or databases as well as unstructured data that can be embedded like documents or webpages.
What are some best practices when implementing RAG?
-Best practices include optimizing retrieval, carefully selecting the chunks of data to return, using clear and annotated prompts, and avoiding overloading the context window with too much irrelevant information.
How can embeddings be used with RAG?
-Embeddings allow unstructured text data to be represented as numeric vectors that can be efficiently searched. Tools like Cohere or Hugging Face can generate quality embeddings optimized for semantic search.
What are some strengths of Kendra for RAG?
-Kendra simplifies RAG by automatically handling data indexing, embeddings, retrieval APIs, and more. This makes it fast to get started even though it offers less customization.
What indexing algorithms work well for RAG?
-Approximate nearest neighbor algorithms like HNSW often provide the best performance for semantic similarity search used in RAG.
How can PostgreSQL be used for RAG?
-The PostgreSQL extension PG Vector enables efficient vector similarity search within PostgreSQL databases, providing a SQL interface for retrieval.
What data should be considered when choosing a vector store?
-The schema, query patterns, size, rate of change and other statistics about the data should drive whether something like PG Vector, MongoDB or a dedicated store like Pinecone is appropriate.
How can RAG systems be optimized?
-Continuous telemetry around query performance, accuracy, and user satisfaction can identify areas to optimize including prompts, chunking strategies, indexes, and more.
Outlines
๐ Introducing Retrieval Augmented Generation
The hosts Rand and Clayton introduce the concept of retrieval augmented generation (RAG). They explain how RAG enhances LLMs by allowing them to incorporate external contextual information at inference time. They discuss RAG architecture, value propositions, applications, use cases, strengths and weaknesses.
๐ Example Use Case: Employee Handbook
Clayton provides an example use case of using RAG with an employee handbook to answer questions about PTO. He explains how passing the handbook as context allows the LLM to search it and find answers, contrasting with asking ChatGPT directly.
๐ค Tradeoffs of Fine-tuning vs. RAG
Rand discusses tradeoffs of fine-tuning vs using RAG. Fine-tuning can improve accuracy but is expensive computationally. RAG is often more cost-effective and can achieve good accuracy by optimizing retrieval rather than the LLM.
๐ก Non-Vector and Multi-Modal Retrieval
Clayton and Rand discuss non-vector retrieval, using structured data and SQL queries as an alternative to vector similarity search. Rand shows an example prompt engineering SQL queries from natural language. They also mention multi-modal retrieval combining vector search and SQL.
๐ Prompt Engineering for SQL Translation
Rand dives into an example prompt that translates natural language questions into SQL queries over a conference session database. He explains each part of the detailed prompt and how it guides the LLM to generate usable SQL.
๐ค Comparing Vector Quality and Context Sizes
The hosts analyze tradeoffs between different semantic vector spaces like TITAN vs Cohere embeddings. Smaller vectors can enable better search but may have lower quality. Larger context windows for LLMs can reduce need to optimize retrieval.
๐ Kendra vs. Custom Retrieval Architectures
Clayton and Rand discuss strengths and limitations of using fully managed Kendra vs building custom retrieval architectures. Kendra simplifies initial POCs while custom allows more control and incremental enhancement.
๐ Understanding Embeddings, Indexes and Algorithms
Rand provides background on semantic embeddings, storage indexes like KNN and HNSW, and similarity algorithms like cosine distance. He relates them to implementations in various databases for vector search.
๐ Access Pattern Driven Development
Rand recommends optimizing vector storage based on analyzing query access patterns rather than guessing up front. Starting with something simple like Kendra can reveal patterns to optimize for production systems.
๐ Key Takeaways on Retrieval Augmented Generation
To conclude, Rand and Clayton share key strengths like fast POCs and challenges like architecting retrieval properly. They recommend trying RAG manually before abstraction libraries.
Mindmap
Keywords
๐กretrieval augmented generation (RAG)
๐กprompt engineering
๐กembeddings
๐กcontext window
๐กsemantic retrieval
๐กaccess pattern driven development
๐กKendra
๐กPostgres + PG Vector
๐กOpenSearch
๐กreward modeling
Highlights
Rag is the ability to enhance the llms context through external information
Retrieval augmented generation and prompt engineering are often far less expensive and also sometimes more accurate
Retrieval augmented generation again it doesn't guarantee accuracy but because you're passing it in the context of the inference of the model you are more likely to get a correct result
I always do PG vector, and then I do access pattern driven development and I look at the slow quy, log and then I adjust the hnsw index based on the slow quy, log
The way that I think of it is, access pattern driven development
I took the exact same path that you took Clayton is I was at first thinking oh man Kindra you know, it's it's so expensive I don't have a ton of control but the ease of use of being able to just turn it on crawl some documents I don't have to worry about embeddings I don't have to worry about uh the retrieval queries or or, annotation of the schema or you know, continuous crawling anything Kindra just, manages all of that for me and that really does let me get started a lot faster
If I had to predict the way most customers will evolve they may start with KRA just to do the proof of concept as you're saying and then evolve into multiple other data stores that allow for customization of the retrieval
Getting started is an art and optimizing is a science
Telemetry in your applications is so important because if you're only measuring and optimizing for retrieval you could go and make the best retrieval in the world but your responses on the the generative side still suck and it's because your your retrieval may be um wicked fast super performant you know you can put all the documents in the world in it but the actual output is still incorrect
There are real world examples where it makes sense to do both retrieval augmented generation and continuous pre-training and parameter efficient fine-tuning and even full fine-tuning
Customers have been still exploring and and a little bit hesitant to go and index everything in the whole wide world
The way that llms work in real time is that they will go and take the context the the amount of information the text that you're providing to them use the tokens in that text in order to generate uh responses to uh the tokens that you've passed in
The problem with this is that it doesn't often prior to gen existing the vector search databases were not optimized for large embeddings of that size or large vectors of that size they were actually optimized for much smaller vectors
Retrieval augmented generation is like having a diary where I record what I had for lunch every day whereas pre-training is I'll I'll remember all the good meals that I had in my life and I can remember oh I like I like pizza I like pasta I like these things but if you were to ask me what I had on December 25th of 1996 I would not be able to tell you
There are techniques like low rank adaptations or qow Ros that uh can improve that but the reality is retrieval augmented generation and prompt engineering are often far less expensive and also sometimes more accurate
Transcripts
welcome to neural nebula where we
unravel the Mysteries around artificial
intelligence clear up misconceptions and
explore its transformative impact on
Industries its various use cases and
important considerations you need as you
embark on an AI journey in today's
episode we'll dive into retrieval
augmented generation we'll break down
the architecture discuss its value for
contextual usage and applications and
use cases along with its strengths and
weaknesses my name is Rand hunt VPS
strategy and Innovation at Kalin and
joining me today is Clayton Davis our
director of cloud native applications
development hi Clayton hey Randall how
you doing today doing well uh and then
apologies if my voice is a little
crackly during this uh recovering from a
cold but hopefully we'll be able to get
through this with no issues but with
that said let's Jump Right In so
retrieval on mention generation commonly
referred to as rag
um there's probably a bunch of fun
musical jokes to make about that but you
know it's like rag time blues no rag is
the ability to enhance the llms context
through external information so that
external the way that llms work in real
time is that they will go and take the
context the the amount of information
the text that you're providing to them
use the tokens in that text in order to
generate uh responses to uh the tokens
that you've passed in now the llm by
itself does not have the ability to go
out and enhance its context with
additional information so external
tooling around the llm has to be built
in order to pass in that additional
context at the time of
inference um so clayon does that kind of
jum with your understanding of retrieval
augment generation do you have any kind
of concrete examples you could throw in
uh no definitely I mean I think that
that's perfect and with the the context
Windows continually getting larger I
think rag fits more and more into like
the the first thing you should try
because you can do so much with it but I
mean I think I think the simplest um the
simplest use case and the one I I often
use is that you know if you internal to
a company and you're trying to figure
out how many days of PTO you have you
can't just ask chat GPT hey
how many days of PTO do I have a Kalin
because it doesn't have any of that
information it can't go find that
information um however if I pass an
additional context like our uh employee
handbook and I pass in the whole context
of our employee handbook to that and say
hey based on this employee handbook how
many days of PTO do I have uh now all a
sudden it has that extra context to be
able to go and search that information
and find that right and so um you know
it's a it's a very dumb down use case
and there's a lot of kind of tooling
around that I think that that that helps
enable it but at at its I think at its
Basics right if I copy and pasted the
entire employee handbook into chat GPT
and then asked it a question it's going
to be much better to to retrieve that
information for me so llms in general
the way that they are trained is by
taking huge amounts of tokens in an
unsupervised fashion and feeding them
into this these models these Transformer
networks and then we are seeing things
evolved Beyond Transformers now so we're
seeing ssms in the wild but that's
another topic and then it's asked to
predict what the next token is or what
the middle token is so uh llms can
become optimized by uh increasing the
number of tokens you're feeding in so
you could take for example the uh
employee handbook module that you were
just talking about and feed that into uh
the model as a training step and then
adjust the weight uh the problem is we
only see the emergent properties of
llms when we get to very large parameter
counts billions of of parameters in the
models and that requires thousands of
tokens per parameter so that means if
you have billions of parameters you need
trillions of tokens so to truly get uh
fine-tuning of a model it can become
quite cost prohibitive to feed in all
those additional tokens into the model
and then there are techniques like low
rank adaptations or qow Ros that uh can
improve that but the reality is
retrieval augmented generation and
prompt engineering are often far less
expensive and also sometimes more
accurate because feeding tokens into the
llm does not actually guarantee accuracy
it's just introducing that new
information into the llm whereas
retrieval augmented generation again it
doesn't guarantee accuracy but because
you're passing it in the context of the
inference of the model you are more
likely to get a correct result than if
you were to just do continuous
pre-training or fine-tuning so that's a
lot of the reason that retrieval
augmented generation is preferred over
other techniques that said there are
real world examples where it makes sense
to do both retrieval augmented
generation and continuous pre-training
and parameter efficient fine-tuning and
even full fine-tuning so if you are
introducing there's a popular model
that's available in Bedrock created by
meta called llama 2 uh if you are
creating uh a you know a new sort of
fine-tuned version of llama 2 and you're
introducing a lot of new vocabulary
let's say llama 2 doesn't have a lot of
exposure to say
um Hospital handwashing policies I'm
just I'm thinking of something
completely random and you wanted to
introduce all the vocabulary around that
to the model you could do a step of
continuous pre-training or parameter
efficient fuity uh introducing all of
this new information into the model
introducing that new vocabulary those
new tokens to the model and then also
further enhance it with retrieval
augmented generation but the way that I
like to think about it and C I'm going
to run this by you so tell me what you
think if this is because we could
probably start saying this to customers
if it works but the way that I've been
thinking about it is retrievable
augmented generation is like having a
diary where I record what I had for
lunch every day whereas pre-training is
I'll I'll remember all the good meals
that I had in my life and I can remember
oh I like I like pizza I like pasta I
like these things but if you were to ask
me what I had on December 25th of
1996 I would not be able to tell you uh
whereas if I go and look it up in my
diary I could tell you precisely oh I
had you know fig pudding and ham and all
this stuff does that make any sense no
and I I like that a lot because it's
it's um it's all it's the specifics
versus the summarization right it's kind
of where you end up but yeah know I
think that makes a lot of sense and you
can get very specific with the rag stuff
and and we've seen it with customers
already the the the real kind of
question that I think we face with
customers is with a combination of
prompt engineering and a combination of
retrieval augmented generation I feel
like we're able to solve close to 90% of
the use cases that our customers come to
us with but that last 10% is always very
resting and we can sometimes eek out
another 5% just by doing additional
prompt engineering and then there are
some Advanced Techniques for retrieval
AUM into generation things like hide
which is um creating a hypothetical
document from other documents and
combining sources
but I think 90% in many use cases is
actually sufficient and that's sort of
been my operating thesis so far there
are of course exceptions where that's
not the case and I think we've been
proving that out right I think customers
have come to us thinking like hey we we
need to we need to build our own model
deploy our own model run our own model
and you know we talk them off that ledge
we like hey like just give us a couple
weeks let let us try to do it with Rag
and and prompt engineering and it and it
works pretty well and the way you know
one of the ways i' I've explained to the
customers is that you know using just
prompt engineering and just reg you're
using these things out of the box right
and so when the next one comes out it's
a lot easier to upgrade and move versus
if you're doing all this tweaking and
tuning and running it yourself you've
got a lot of a you've got a lot bigger
lift to to upgrade or hey let me just
try this new model out real quick and
see what it does um you know you can do
that with Rag and prompt engineering you
once you start F tuning it's going to be
a more expensive you know quick little
trial to see if it's going to work
better or not and there there are a
couple of different specific ways of
going about retrieval onment to
generation so at its core you can
imagine you know the very simplest form
of retrieval augment to generation would
be to take a single document and to
parse it and create embeddings from it
those embeddings would be sort of the
tokens that represent the document and
store it in some sort of uh embedding
search
space and then you would take the users
prompt that you were going to use to
create an inference from the large
language model you would use that prompt
to search the embedding space that you
have just created and return relevant
pieces of that file uh and instead of
returning the embeddings raw you're
returning text from that document or you
know it doesn't necessarily have to be
text but we're primarily talking about
large language models so we'll imagine
it's all text for now you're returning
the the relevant sections of that
document putting it into the context of
the inference and saying hey answer
these questions um and to your point
earlier as the context becomes larger
and larger there are different tradeoffs
to be made um for instance the Llama 2
models while they are quite efficient
and have a a very great sort of um
operational cost a great per token cost
it's much lower than that of some of the
other proprietary
models the context size is only 8K uh
and then 4K in some cases so there's not
so much context that you can put into to
the model at inference time which in the
the way that that impacts things is if
you're using a llama 2 or a model or a
llama 2 variant something like a mistol
or a mixol you would then optimize your
retrieval rather than optimizing the llm
so you would optimize your retrieval by
doing additional indexing and making
sure the chunks that you're returning
are small but uh relevant to the query
that the user is making whereas uh and
this is actually always a good thing to
do always optimize your retrieval but
the impetus or the motivation to
optimize your retrieval is actually
lower the larger the context window uh
as long as cost is not a concern because
keep in mind you're paying for these
models in a in a per cocen way and I
think that's an important distinction is
that as you change your model not only
does your prompt engineering change your
method of retrieval changes as well and
then the techniques that you're using
for retrieval uh and even the techniques
you're using for storage and search
change yeah and there's um there's
another consideration there too like yes
cost should drive like how how many
results you going to return is it going
to be three is it going to be 10 but you
also have to know what your data is
because it may be that you know in the
employee handbook if that's all you
vectorize and you put in the database
and you're going to ask about PTO well
there's one section in the employee
handbook about PTO if you're going to
return 10 different results you're going
to start getting things that don't have
the similarity that you need and you're
going to start summarizing the wrong
things and so there is some data science
data engineering around like what am I
indexing and based on the questions I'm
asking like are there going to be 10
good similar results that I expect to
come back or is there really just one or
two and then I don't have that much data
and so you also to think about you know
what's going to be relevant you know
don't just I've got a huge context
window let's return it all like well
maybe not right because then you're
going be summarizing things that don't
make sense and there's also you know
they've done an anes on the models and
the the ability to retrieve from the
context is not perfect so you can do
what's called a needle and a hstack
search of the context that you're
providing where you can provide you know
the claw 2 models or the open AI chat
GPT models you know huge huge amounts of
context and then say return this one
obscure fact from the context and uh it
has recency bias so things that are very
very you know close to to the bottom of
the context or very close to the top of
the context are more likely to be
returned than stuff that's in the middle
um and then it does lead to different
kinds of optimization techniques to your
point and I I think that's interesting
and it's something that we've definitely
developed quite a bit of experience with
over the last you know year now of of
messing around with these systems and
and building production quality
applications for them but we're also
starting to do work well we we have been
doing work not just with semantic
retrieval which is the the embeddings
and the searching we've been doing uh
retrieval in a non-semantic fashion as
well so basically translating say
natural language input into a structured
query language and using the results
that it returned from that SQL query to
then infer an additional answer we built
this uh project called the uh reinvent
session concierge where we took all of
the and this is in our public GitHub
repo and uh for those of you who may be
listening I I might show this a little
bit later but so you can go watch the
Youtube video but I'll also just post a
link to the code as well but the way
that this works is that we
will basically go out and query the
postgress database that we created of
all of this um all of these reinvent
and we'll query it using natural
language and we'll have the clog 2 model
translate that natural language into AQL
query get the results feed those into
the context of the response and say okay
you know clawed model please now
synthesize a response based on the
user's original question so that's a
very interesting use case and uh you
know we chose to use postest for that
which is something we can talk about a
little bit later but um there are a lot
of other options too open search for
example now allows for for hybrid
queries so you can do both semantic and
lexical search
simultaneously uh and you can normalize
the scores between them as well which
can be really really powerful and I know
cleton that you've done quite a bit of
work with open search so far and maybe
you could tell us a little bit about
that yeah no I mean I do want to touch
on one thing on your non Vector
retrieval stuff because I um I think
it's been one of the more surprising
things that that we you know we've got
50 PC's or so under our belt and and you
know we've started having to categorize
them as you know you know rag non rag
but then within rag is it Vector search
or is it not and I think a lot of people
jump straight to like oh hey the the rag
concept like you use a vector database
you do this and and it's important to
try to figure out when you use one and
when you don't and and you know
structured data just doesn't go well
into a into a vector store and so you
know you mentioned postgress but we've
done things where customers are giving
us csvs of file or of of data and you
know we're using something like P spark
to help the LM generate a query and
query directly in a CSV because it's
structured right just because it's not
in a database doesn't mean you need to
vectorize it right and so even like you
know customers providing us like hey you
know we've got these four or five Excel
sheets we'd like you to to do ragon you
know put them into Kindra it's like well
no i' I don't think we need to put them
into Kinder I think we just need to you
know build you know train the model to
know which one to look at and then query
them with with pypar and so you know
there's there's a lot of non Vector
retrieval stuff you can do especially
with with structured data but um but
yeah back to open search like I I
thought I think open search is great I
you know um same with PG Vector they
give you a lot of flexibility and a lot
of customization and so once you get
past the POC stage you know you can do a
lot where you're doing Vector queries
but then you can put a ton of metadata
around it to know exactly what you um
you know is it is it the text you're
doing do you need a link to where that
that document actually came from you
know we're doing run right now where
there's a a question and answer and so
we're talking about vectorizing the
question so you can search based on the
question but then as part of the
metadata you're pulling back the answer
because that's what you want to
summarize and so you know you're not
just chunking out the data you're
actually doing it in a very specific way
just because you know what data you have
and what you're expecting to ask and
what you're expecting to get back so
there's a lot of flexibility when you
when you start using you know your own
Vector stores
what's fascinating to me is that we've
sort of made this transition of forcing
ourselves to learn a new query language
every time into basically using English
as the query language and forcing
ourselves to teach in the context window
of the prompt this is how you can
interpret these different things that a
user might ask and translate that into
this sort of query um and again I know
that we don't have uh a ton of people
who will will be watching the screen but
this is what I'm showing on screen right
now as an example of one of the prompts
that we used in the reinvent session
Navigator and we we store these prompts
in Dynamo DB so that we can version them
uh by date and time and so if we improve
one we can ab test it and go back but
you can see uh this is a clog prompt and
what it's saying is given an input
question use postgressql syntax to
generate a syntactically correct
postgress SQL query from the following
table session uh this table represents
events for a conference the table schema
will be contained within schema which
means I'm going to pass in the schema at
runtime because I'm going to derive that
and cache it uh as we change the schema
over time and then it says the query
should be read only write a query in
between uh SQL uh XML tags and you know
then I say in all caps and and that's
the craziest thing about uh some of the
these models is that they respond to
emotional language and they respond to
all caps in a way that they do not
respond to uh uh non all caps or or non
emotional language so important to note
all fields that you include in the wear
Clause should also be included in the
select Clause the reason for this is
that when we return the results we want
to be able to render it into like a
pandas data frame or into a table or
something rather than just getting a raw
answer with no context uh except the
embedding field uh don't include this in
the select Clause this will Aid in the
generation of the result and then we
annotate the schema so we we have
another important note here where we're
walking through how to use PG vector and
how to use the embeddings and the Syntax
for the embeddings the only reason we
include this is that it may not be in
the base models knowledge set because it
is a fairly new extension well I mean PG
Vector is not a fairly new extension but
some of the syntax uh is new and it's
really taken off recently so we want to
reinforce that context of the model uh
and then we you know pass in the schema
and then we pass in a number of example
questions so you know one example
question is how many sessions are in the
Venetian and then we pass in the the
example generated query and this is an
example of prompt engineering where
we're saying select count from session
where venue name I like Phoenician um
and we can do other more complex queries
and we try to give it a a diverse set of
things we try to introduce the idea that
queries should use ores instead of ands
so they should be as inclusive as
possible as opposed to being as
exclusive as possible because we want to
return results this may differ for other
applications that we build but in the
case of the reinvent session coners
we're trying to return uh as many
relevant results as possible
and then we you know if we go and we
look at the schema um we annotate the
table schema so we we don't and i' I've
switched to a different uh Dynamo DB
entry now where I've walked through hey
this is the schema of the table and
these are all the fields and this is
what they are talking about and I give
the type of it you know a tag topic is
an array of strings topics covered in
session for example
and you know uh the session type is a
string or it's an enum these are all
really really powerful prompt
engineering techniques that can
drastically enhance the results and then
you can put in an additional important
notes for instance I say session ID
string almost always return this field
that is a a common kind of technique
that we can use in The annotation of
this retrieval augmented generation to
say no matter what query you're running
running this is relevant information
that we need even if we don't
necessarily end up rendering it to the
end User it's still something that we
can use uh on our side to to kind of do
analytics and things like that uh but I
I like walking through this example
because I think it's illustrative of
taking natural language and querying a
non Vector store uh it's also been
enhanced a little bit with an in Vector
store at the bottom here in terms of
embeddings and that gives you that combo
of both traditional search uh lexical
search and the vector search and we've
really seen this sort of stuff start to
take off within our customers I I mean
to Clayton's point just a moment ago
there there have been some very
interesting techniques that have been
developed and uh there's a lot of
tooling now as well but I think you know
now that we've sort of explored the use
cases we've explored some of the
techniques for optimization and some of
the techniques
for the prompt engineering side of
things I think it's time to take a step
back and and go back to what uh
embeddings are and why we use them and
some of the tradeoffs that come with
embeddings talking with someone today
actually about embeddings and Incredibly
powerful um but one of the things that
I've noticed is that all the tooling out
there like Kindra or Lang chain make it
so people don't actually see the
embeddings or use the embeddings anymore
they're kind of like they're they're
abstracted away right um so I think it's
important to understand what embeddings
are and why we use them and and the
power they bring because like I mean the
example you were showing right isn't
like you didn't populate that with Lang
chain that was very much more manual
right and so like once you understand
that you can get a you can you can do
right 2.0 almost right like because you
can get a lot more advanced with it and
I think that's a very important point
though Clayton is what we did there is a
manual implementation of retrieval
augmented generation but tools like Lang
chain do make it much easier and uh that
comes at the cost of flexibility I would
actually advise anyone listening to go
and Implement retriev augmented
generation manually you know using
whatever programming language you want
do that at least once that way you will
really develop the kind of deeper
understanding of what's
happening because Lane chain extrap you
know it it it takes away a lot of the
complexity and you you some times Miss
exactly what's happening under the hood
yeah with a vector store I think is
important right like not just a database
but doing it with with a vector store
and and embedding your own data and like
so um there's some blog posts out on the
can website where I took uh I think OSHA
data right just publicly accessible data
and made like a chat bot for it right
just to just to try it out and so if you
don't have data just go find a data set
out online and use that as your data
set exactly so when it comes to
embeddings there we have a couple
choices within the AWS ecosystem um you
know what I've seen used most is
actually the hugging face embeddings um
there's also the open AI uh I think
they're called the ada2 embeddings um
but within Bedrock you have the choice
of the Titan embeddings and these are
multimodal embeddings as well so that
means it can take both image data and
Text data and a vector similarity would
be the same so if you were to store the
text goldfish for example and a picture
of a goldfish they would both correspond
to the same rough Vector that's really
cool and exciting and it unlocks a lot
of um interesting search
techniques uh but keeping it kind of
focused on primarily large language
models our experiments have shown that
the cohere embeddings are some of the
best and the reasoning for this is uh a
number a number of different things so
first first of all uh there is the size
of the the output Vector so the Titan
embeddings for example will output a
vector of the size 1536 which means
there's 1 1536 numbers um in the array
that is the vector representing the
output of up to an AK context of the
Titan embedding
now the problem with this is that it
doesn't often prior to gen existing the
vector search databases were not
optimized for large embeddings of that
size or large vectors of that size they
were actually optimized for much smaller
vectors um that problem is going away as
as people realize the the value and the
capabilities of slightly larger vectors
but the um coher embeddings are
outputting both 512 and one24 variants
which are more searchable uh we found
and then the hugging face outputs are
512 and like 700 or something uh and
then the open AI ones are also one24 so
these these smaller embeddings um these
smaller output vectors we've actually
found that the performance of the
smaller output vectors is higher so when
you get better semantic similarity
search and there's a lot of common
benchmarks and techniques that you know
you can go read blog posts and and
academic papers about all of those but
all of the academic research currently
backs up that situation that the smaller
vectors uh are are better for search I
expect that to change so I expect as the
hardware improves and as the the large
Vector search support improves that will
not necessarily be the case and so as
you generate these embeddings you know
you have to store them somewhere
and like I said uh there's a secondary
point so we've talked about the size of
the vector as the first point but
there's also a secondary point which is
the quality of the vector
so um there's there's two things to
break down the quality of the vector
here as well and that is the set of
input tokens that go in uh if a model
was trained primarily on English for
example and had a much smaller
representation of other scripts or other
languages then it may count additional
tokens in a lower quality lower
correctness lower Precision Vector as
output if you were to put the same uh
input data in so like if you were to say
uh you know flower in English and then
Bloomin or whatever in German you may
not get the same Vector depending on how
that embedding model was trained uh
we've actually seen this manifest with
some of our customers we had a customer
that was parsing a lot of
um uh East Asian languages texts things
like uh everything from hindii to
Mandarin to cantones to to Japanese and
and a little bit of Russian language as
well so this customer was parsing all of
these scripts that were not well
represented or not as well represented
in the embedding that we were using
prior to switching to coh here and we
were also getting charge for more tokens
for the non-english script so for
example the word hello in English would
charge us only one token and it would be
28 tokens if we did the same thing in
Hindi so this caused us to look at some
different embedding models and we
arrived at the coher model and we found
the output and the performance of that
was actually excellent so we've been
using cohere um for a couple of
different customers for the embedding
side now uh most of that still I would
say in the the proof of concept phase
although because once you generate all
of these embeddings you're you're going
to pay a fee to basically take your
entire data set send it into these
embedding models and then uh store the
output vectors so customers have been
still exploring and and a little bit
hesitant to go and index everything in
the whole wide world uh that said you
know we're we're in January of 2024 now
and we are starting to see customers
move into production with that which is
exciting because it's um it's really
letting us play with Vector storage at
scale um have have you seen this Clayton
the the kind of differences in
embeddings and then also I guess the
Comon the comparison to
Kendra yeah we hav't done a ton of uh
trial and athm Bets right like we've
done a ton of PC's right it's usually
pick one and get it to work and then you
come back and you can try different
things um we've done a lot with KRA um
KRA is almost uh it's it's like we
mentioned with Lang chain uh it
abstracts a lot of the things away from
you um I you know early on I was not a
fan of Kendra but I have since become
more of a Kendra Fanboy as as time has
gone on just it it makes I mean we
talking about customers and Vector
stores and and pcc's and showing the
power of of of
geni Kindra is is the quickest and
easiest way to do it it does come at a
little bit higher cost than some of the
other things but like you it's got so
many connectors you just point it at
something and say go it's going to index
it for you um it's you know the
retrieval is all done via an API clear
via an API it's all like it's it's magic
right but it once again it's out of the
box um which has its pros and cons right
you know the things that you were
showing with the uh with the the the
reinvent search like that does not work
with KRA you could like your tool would
be much worse if you used Kindra um
however I would argue that if you
started with Kindra kind of as a proof
of concept you would have been able to
see that like hey this is going to work
here are the next steps we need to take
to make this better and that's that's
what we're seeing a lot with customers
right like you know let's start with
Kindra let's get it done in a week or
two weeks um and then now we understand
your data we understand what you're
trying to do and we can better say like
hey I think these are the steps you need
to take to get to the the MVP P of what
you're trying to do next right to get it
in front of clients or whoever your
customers are going to be I agree
entirely and and that I I took the exact
same path that you took Clayton is I was
at first thinking oh man Kindra you know
it's it's so expensive I don't have a
ton of control but the ease of use of
being able to just turn it on crawl some
documents I don't have to worry about
embeddings I don't have to worry about
uh the retrieval queries or or
annotation of the schema or you know
continuous crawling anything Kindra just
manages all of that for me and that
really does let me get started a lot
faster I if I had to predict the way
most customers will evolve they may
start with KRA just to do the proof of
concept as you're saying and then evolve
into multiple other data stores that
allow for customization of the retrieval
um and and kind of reach these multi-
retrieval scenarios that that's my
prediction I'm wrong all the time though
so we'll see what happens um I agree I
like I I think one of the one of the
biggest learnings we've had going
through these pcc's is that um it it's
not like vector store isn't the answer
to everything with rag right and I I
think going into these we thought uh we
thought Vector store like was like
everything's got to be Vector Store
everything's going to be similarity
search and so we we we we went for that
um but as we're going through these
things we're seeing like hey you know
your data is structured or semi
structured um you know let's try to
let's try to be more creative at how
we're getting that data out or hey some
of your data is structured and some of
it's not so instead of shoving
everything into Kindra let's shove some
of it into Kindra but let's also take
some of it and and try to just cleer it
directly and so like TI your M to
retrieval point like I I do definitely
think that's the way things are going to
head um as people understand Vector
stores and the data that they that they
need to answer these questions they're
try to ask
absolutely so I want to walk through
some of the algorithmic and index
differences very briefly and then I
think we can talk um about all the
different options and data stores and
how they Implement these things and then
I think we'll sort of close everything
out but uh you know the the core sort of
algorithms that we're using right now
are like ukian distance that's the
traditional sort of search for
um uh vectors and then there's things
like inner products and there's things
like cosine uh cosine distance or cosine
similarity and the way that I tell
people to think about the different uh
versions of these algorithms is some of
them work on non-normalized data which
means they take into account both the
amplitude like it's a true vector and
that it's taking into account both the
the um amplitude and the direction and
then others really only care about the
direction and less about the amplitude
now the algorithm that you use to search
things uh you can sometimes use
different algorithms on the same kind of
index but I think cosine similarity is
really what uh a lot of the industry has
arrived at in terms of how they're
searching and then in terms of indexing
you know people used to use um IV F flat
which is like an inverted flat file um
where you're you're basically
it's an inverted index so this is a very
common full Tech search technique it's
been around for a long time but then
another technique that people have been
using more recently and the technique is
actually quite old I think it's from
probably the 60s but or 70s um the and
it's a statistical technique it's called
K nearest neighbor uh but optimized
versions of that uh like approximate K
nearest neighbor are how a lot of
indexes are being searched these days is
and there's a lot of tuning that you can
do with that algorithm and with that
index to if you have a good
understanding of what your underlying
data is or if you've done some
statistical analysis on your data a
canorous neighbor search can be
extremely efficient and extremely
powerful and accurate and then more
recently uh I think since a couple years
ago people have started using hnsw which
is high I think it's hierarchical or
highly navigable Small World um someone
can correct me in the comments I guess
because I can't remember all the the
definitions but hnsw is actually a
really really powerful uh
search implementation that um is in my
opinion like a much better version of K
nearest neighbor for generating correct
Vector search results and semantic
results
um now
hnsw was not available in post prior to
I think 2023 I I I think it yeah I think
it in 2023 is when uh hnsw got added to
PG vector and the way that that came
about is in my opinion a very
interesting story so superbase and AWS
the two companies they collaborated with
the person who maintains PG Vector I
pardon me I can't remember their name
but uh they worked together basically to
get hnsw implementation into postgress
uh which really um made postgress sing
in terms of vector search performance so
uh open search also got a lot of really
great uh additions into the ability to
tune these these kirous neighbor indexes
so so if you think about it you know
stores like pine cone pine cone is a
specialized Vector store that uses hnsw
I think it also has a couple of other
options PG Vector has hnsw it has IBB
IVF flat or
IVF um it has uh
you know a couple other kind of tuning
options and size options and then open
search uses K nearest neighbor and then
KRA we don't really know what it's using
although from our experience it seems
pretty clear it's using some form of K
nearest neighbor um but that's sort of
abstracted away from you you don't have
to care about what the underlying
algorithm is and then
mongodb also uses K nearest neighbor and
uh
one of the interesting differences
between mongodb Atlas and its
implementation of vector search versus
Amazon's document DB and its
implementation of vector search is that
document DB uses the hnsw search so it
actually outperforms the K&N search from
mongodb which honestly is like one of
the few times I've seen doc DB really
get something right where Monger DB
might have gone down the wrong path um
so I me I'm very interested and excited
to see how that develops but now that
we've covered sort of all the stores and
all of their algorithms and by the way
that was not by any means an exhaustive
store uh covering of all the different
algorithms and techniques there's a lot
a lot of different things out there so
we will with after covering all of those
like with with Gen in general it's a lot
of trial air test repeat um do you
recommend similar for trying to choose
the way you index stuff different stores
like do you recommend trying a couple or
or like I guess more specifically the
the way you do your index searching
right like is there a right is there a
wrong is there a you know test a few of
them I think that's a great question
Clayton the way that I think of it is
access pattern driven development so if
I know in advance what my access
patterns are going to be I can do a and
so so the ideal world right if
everything is going perfectly which by
the way in our business it never does um
but if if everything is going perfectly
prior to getting started I know all of
my access patterns and I know the
statistical Telemetry information about
all of my data I know you know how how
often I am I adding new data what is the
size of my data what's the average chunk
size what's the average change data all
of those sorts of things if I have all
of that information I can mathematically
derive the correct index and the correct
technique since that's never actually
happened in real life in the history of
Consulting um what we typically do is if
we have the data available we'll do some
statistical analysis of the data to
understand uh you know we can do things
like ingrams or there's all kinds of
different techniques and and our data
science team would be better equipped to
really talk through all of the
techniques there but there are ways that
we can say oh this data set is going to
be very amable to approximate K nearest
neighbor or this data set is actually
perfect for hnsw with you know this
kernel size and you know this amount of
of top K and this amount of top P um
those are just sort of parameters that
you can put in um so that is often what
we'll start with is we'll say well no
actually often what we start with is
Kindra because then we don't care about
it and we get to see some of the access
patterns uh of real world usage so uh
we'll start with like a a Cann or an hsw
style index in a traditional data store
and then sometimes it depends on the
customer's existing data store right you
don't want to introduce a brand new
database into your architecture just for
the purposes of vector storage um so if
you have an existing post grass
deployment TG Vector is probably the way
that you want to go because you already
have the organizational muscle memory
around maintaining and running post if
you have an existing MB cluster maybe
the mongodb akn is the way you want to
go because you already have that
existing mongodb experience um there are
exceptions to that rule so Greenfield
will often pull in a brand new database
for people just because it's it's being
created new there's not a ton of stuff
that we have to do um but my my
preference my personal preference not
this is not a
kalant endorsed opinion this is just a
Randall opinion I always do PG vector
and then I do access pattern driven
development and I look at the slope
query log and then I adjust the hnsw
index based on the slow quy
log awesome and then last question is so
for people listening that that you know
you you you did tell them to go and do
this by themselves right and go to do it
there isn't a wrong answer when you're
pcing right like you can use any of
these they're going to work one may just
work better than another and and you I
you know it matters more when you get to
production but from a POC standpoint
like pick one use it move forward it's
it's still G to work and and I've said
this before in other episodes but I
think getting started is an art and
optimizing is a science so you know
getting that canvas
from a a blank page into oh that's a
tlor swift thing right is blank space
baby yeah going from Blank Space baby
into a real world like this is a thing
that's working you can then go and
optimize and you can programmatically
optimize which means you can store an AB
test and you can compare the full thing
and that's why Telemetry in your
applications is so important because if
you're only measuring and optimizing for
retrieval you could go and make the best
retrieval in the world but your
responses on the the generative side
still suck and it's because your your
retrieval may be um wicked fast super
performant you know you can put all the
documents in the world in it but the
actual output is still incorrect so
maybe you're chunking it wrong maybe all
this is wrong so you really need to have
continuous Telemetry from you know in
user generated request to llm to
retriever back to LM back to end user
result and then tie that all together
and it takes time um and you want to
look at like your tm95 and your tm99
sort of results from both a performance
and from an accuracy standpoint yeah
it's fun I mean this is that's the fun
part for me is once we have something
working and real users are are playing
around with it we get to really dive
deep into the data and optimize it that
that optimization component will
sometimes like rewrite an entire prompt
because we realized oh you you know we
90% of this goes
unused um but yeah I mean we we've
really had a a great time experimenting
with this and there's one last technique
that happens in retrieval augmented
generation that can be very helpful and
this is possible in both Kendra and in
um open search and it's called reward
modeling um well it's not really reward
modeling it's more of a filter that's
applied after the search um reward
modeling is more of a training technique
but you're basically up voting or down
voting answers and fragments of
documents that are returned based on the
users input query and this can improve
the overall semantic search performance
over time for all of your
users um so we'll talk about that in
another episode because I think reward
modeling on both the search side and on
the training side is worth diving in
deeper but I just want to close this out
by saying that you know there's a lot of
strengths to retrieval augmented
generation it is absolutely technique
that you should investigate it is
absolutely a technique that is
worthwhile learning and understanding um
I think some of the challenges that come
with it are understanding the costs both
the embedding cost the the chunking the
um the keeping the models up to date and
uh establishing recency in some of the
refal onment to generation and then just
making sure the prompts and the
retrievers have accurate context and
that you're not not uh
overloading with irrelevant information
Clayton do do you have kind of like the
the pros and cons or strength and
challenges that you've seen in in your
work with the retrieve long Mage
generation yeah I mean I think the
biggest strength is the is the time to
PC right you can go from idea to PC very
quickly with with retrieval augmented
Generations um and one of the the
biggest challenges that I think was a
surprise to me as we've been doing all
these poc's is
um you need to think an architect about
how you're going to retrieve it whether
it's Vector search whether it's you're
just going to store a CSV whether it's
going to be a normal database but like
the the way you retrieve your data
matters because you for rag to work you
do have to get the right data into the
context window and shoving it on into a
vector store may not guarantee you're
getting the right data and so um
thinking about how your data is is it
structured is it unstructured to your
point how often does it update how do
you feed that updated data in like
there's a lot of there's a lot of
architecture around around the data um
but you know from a PC standpoint like
take a snapshot of it don't worry about
updating the data you know just make
sure it works but you know make sure
you're retrieving it in the most
efficient way possible to get the
correct data into
the amazing Clayton thanks again so much
for joining us I always enjoy chatting
with you my friend and with that uh I
will close us out so uh this concludes
our episode of real nebuli we hope we
leave you with something to think about
as you pursue innovating with AI if you
enjoyed the show please leave us a
rating and review on the platform you're
listening or watching on and don't
forget to subscribe so you never miss an
episode thanks for listening we'll see
you next
time
Browse More Related Video
![](https://i.ytimg.com/vi/u5Vcrwpzoz8/hq720.jpg)
"I want Llama3 to perform 10x with my private knowledge" - Local Agentic RAG w/ llama3
![](https://i.ytimg.com/vi/uO6r0vQmGB0/hq720.jpg)
Why Everyone is Freaking Out About RAG
![](https://i.ytimg.com/vi/TRjq7t2Ms5I/hqdefault.jpg?sqp=-oaymwEXCJADEOABSFryq4qpAwkIARUAAIhCGAE=&rs=AOn4CLDRfTRa4V1hfpCUcJ6VFtfn_zieuA)
Building Production-Ready RAG Applications: Jerry Liu
![](https://i.ytimg.com/vi/Ik8gNjJ-13I/hq720.jpg)
Realtime Powerful RAG Pipeline using Neo4j(Knowledge Graph Db) and Langchain #rag
![](https://i.ytimg.com/vi/u47GtXwePms/hq720.jpg)
What is RAG? (Retrieval Augmented Generation)
![](https://i.ytimg.com/vi/TBrb2Lq5mVc/hq720.jpg)
Introduction to generative AI scaling on AWS | Amazon Web Services
5.0 / 5 (0 votes)