Retrieval Augmented Generation - Neural NebulAI Episode 9

Neural NebulAI
29 Feb 202447:43

Summary

TLDRThis episode explores retrieval augmented generation (RAG), a technique to enhance large language models' context with relevant external information for more accurate responses. It breaks down RAG architecture and discusses techniques like prompt engineering for natural language to SQL translation that allow querying non-vector data stores. It also covers vector storage fundamentals like optimal embedding models and vector sizes for performance and accuracy. The strengths of RAG include rapid proof of concept development, while challenges involve planning contextual data retrieval and updates. Overall, RAG combines the strengths of neural networks with robust data retrieval.

Takeaways

  • ๐Ÿ˜Š Retrieval augmented generation (RAG) enhances LLMs with external contextual information during inference.
  • ๐Ÿ”Ž RAG retrieves relevant information from documents to provide context to LLMs.
  • ๐Ÿ’ก RAG allows expanding context beyond LLM limitations in a cost effective way.
  • ๐Ÿ“š Manual implementation of RAG builds deeper understanding compared to abstraction libraries.
  • ๐Ÿ˜ฎ Start simple - use Kendra for quick proofs of concept to understand data and access patterns.
  • ๐Ÿ“Š Telemetry for user requests, retrievers and LLMs enables accuracy and performance optimization.
  • ๐Ÿ” Choosing optimal storage and indexes depends on data structure, query access patterns and updates.
  • โฌ†๏ธ Smaller embedding vectors currently provide better semantic search than larger ones.
  • ๐Ÿšฆ Order of operations - get a usable prototype, then focus on optimizing based on real user data.
  • ๐Ÿ‘ RAG's speed and expandable context make it an essential technique to evaluate.

Q & A

  • What is retrieval augmented generation (RAG)?

    -RAG is the ability to enhance an LLM's context through external information at inference time. This allows the LLM to generate more accurate and relevant responses by supplementing its existing knowledge.

  • What are some benefits of using RAG?

    -Benefits of RAG include lower cost compared to fine-tuning models, ability to quickly update information by modifying the external data source, and flexibility to combine semantic and non-semantic retrieval techniques.

  • What type of external data can be used with RAG?

    -Many types of external data can be used with RAG including structured data like CSVs or databases as well as unstructured data that can be embedded like documents or webpages.

  • What are some best practices when implementing RAG?

    -Best practices include optimizing retrieval, carefully selecting the chunks of data to return, using clear and annotated prompts, and avoiding overloading the context window with too much irrelevant information.

  • How can embeddings be used with RAG?

    -Embeddings allow unstructured text data to be represented as numeric vectors that can be efficiently searched. Tools like Cohere or Hugging Face can generate quality embeddings optimized for semantic search.

  • What are some strengths of Kendra for RAG?

    -Kendra simplifies RAG by automatically handling data indexing, embeddings, retrieval APIs, and more. This makes it fast to get started even though it offers less customization.

  • What indexing algorithms work well for RAG?

    -Approximate nearest neighbor algorithms like HNSW often provide the best performance for semantic similarity search used in RAG.

  • How can PostgreSQL be used for RAG?

    -The PostgreSQL extension PG Vector enables efficient vector similarity search within PostgreSQL databases, providing a SQL interface for retrieval.

  • What data should be considered when choosing a vector store?

    -The schema, query patterns, size, rate of change and other statistics about the data should drive whether something like PG Vector, MongoDB or a dedicated store like Pinecone is appropriate.

  • How can RAG systems be optimized?

    -Continuous telemetry around query performance, accuracy, and user satisfaction can identify areas to optimize including prompts, chunking strategies, indexes, and more.

Outlines

00:00

๐Ÿ˜Š Introducing Retrieval Augmented Generation

The hosts Rand and Clayton introduce the concept of retrieval augmented generation (RAG). They explain how RAG enhances LLMs by allowing them to incorporate external contextual information at inference time. They discuss RAG architecture, value propositions, applications, use cases, strengths and weaknesses.

05:00

๐Ÿ˜ƒ Example Use Case: Employee Handbook

Clayton provides an example use case of using RAG with an employee handbook to answer questions about PTO. He explains how passing the handbook as context allows the LLM to search it and find answers, contrasting with asking ChatGPT directly.

10:01

๐Ÿค“ Tradeoffs of Fine-tuning vs. RAG

Rand discusses tradeoffs of fine-tuning vs using RAG. Fine-tuning can improve accuracy but is expensive computationally. RAG is often more cost-effective and can achieve good accuracy by optimizing retrieval rather than the LLM.

15:02

๐Ÿ’ก Non-Vector and Multi-Modal Retrieval

Clayton and Rand discuss non-vector retrieval, using structured data and SQL queries as an alternative to vector similarity search. Rand shows an example prompt engineering SQL queries from natural language. They also mention multi-modal retrieval combining vector search and SQL.

20:03

๐Ÿ“ Prompt Engineering for SQL Translation

Rand dives into an example prompt that translates natural language questions into SQL queries over a conference session database. He explains each part of the detailed prompt and how it guides the LLM to generate usable SQL.

25:04

๐Ÿค” Comparing Vector Quality and Context Sizes

The hosts analyze tradeoffs between different semantic vector spaces like TITAN vs Cohere embeddings. Smaller vectors can enable better search but may have lower quality. Larger context windows for LLMs can reduce need to optimize retrieval.

30:05

๐Ÿ›  Kendra vs. Custom Retrieval Architectures

Clayton and Rand discuss strengths and limitations of using fully managed Kendra vs building custom retrieval architectures. Kendra simplifies initial POCs while custom allows more control and incremental enhancement.

35:06

๐Ÿ” Understanding Embeddings, Indexes and Algorithms

Rand provides background on semantic embeddings, storage indexes like KNN and HNSW, and similarity algorithms like cosine distance. He relates them to implementations in various databases for vector search.

40:08

๐Ÿ“ˆ Access Pattern Driven Development

Rand recommends optimizing vector storage based on analyzing query access patterns rather than guessing up front. Starting with something simple like Kendra can reveal patterns to optimize for production systems.

45:08

๐Ÿ Key Takeaways on Retrieval Augmented Generation

To conclude, Rand and Clayton share key strengths like fast POCs and challenges like architecting retrieval properly. They recommend trying RAG manually before abstraction libraries.

Mindmap

Keywords

๐Ÿ’กretrieval augmented generation (RAG)

A technique for enhancing LLMs' context through external information at inference time. This allows the LLM to generate more accurate, contextual responses by supplementing its existing context with relevant external data passed in by additional tooling at query time. The video discusses RAG's architecture, use cases, and optimizations.

๐Ÿ’กprompt engineering

The practice of carefully designing and structuring the prompts provided to LLMs in order to guide them towards the desired output. This includes techniques like emotional cues, annotations, explanations, and examples. The video covers how prompt engineering can be cheaper and sometimes more accurate than fine-tuning models.

๐Ÿ’กembeddings

A vector representation of text data that allows for semantic similarity search. The video compares different embedding models like Cohere and discusses search optimization techniques like approximate nearest neighbors.

๐Ÿ’กcontext window

The amount of text context an LLM can process at inference time. Larger context windows allow more external data to be incorporated via RAG. Smaller windows require better optimized retrieval.

๐Ÿ’กsemantic retrieval

Retrieving text via semantic similarity searches using embeddings rather than traditional keyword searches. This allows incorporating conceptual relevance into the search process. The video discusses combining semantic and non-semantic retrieval.

๐Ÿ’กaccess pattern driven development

An iterative development approach that optimizes data stores and indexes based on empirical understanding of how users access the data. This allows matching the implementation to real-world usage telemetry.

๐Ÿ’กKendra

Amazon's managed document database service with integrated vector search. The video discusses how starting with Kendra can accelerate initial RAG proof-of-concepts due to its ease of use.

๐Ÿ’กPostgres + PG Vector

An open-source relational database with the PG Vector extension for adding vector similarity search. The video advocates for this as a customizable RAG storage option.

๐Ÿ’กOpenSearch

An open-source search engine that allows hybrid semantic + keyword search. The video mentions its AKN vector search capabilities.

๐Ÿ’กreward modeling

A technique for improving search quality over time by having users label good and bad result chunks. This allows search algorithms to learn from user feedback.

Highlights

Rag is the ability to enhance the llms context through external information

Retrieval augmented generation and prompt engineering are often far less expensive and also sometimes more accurate

Retrieval augmented generation again it doesn't guarantee accuracy but because you're passing it in the context of the inference of the model you are more likely to get a correct result

I always do PG vector, and then I do access pattern driven development and I look at the slow quy, log and then I adjust the hnsw index based on the slow quy, log

The way that I think of it is, access pattern driven development

I took the exact same path that you took Clayton is I was at first thinking oh man Kindra you know, it's it's so expensive I don't have a ton of control but the ease of use of being able to just turn it on crawl some documents I don't have to worry about embeddings I don't have to worry about uh the retrieval queries or or, annotation of the schema or you know, continuous crawling anything Kindra just, manages all of that for me and that really does let me get started a lot faster

If I had to predict the way most customers will evolve they may start with KRA just to do the proof of concept as you're saying and then evolve into multiple other data stores that allow for customization of the retrieval

Getting started is an art and optimizing is a science

Telemetry in your applications is so important because if you're only measuring and optimizing for retrieval you could go and make the best retrieval in the world but your responses on the the generative side still suck and it's because your your retrieval may be um wicked fast super performant you know you can put all the documents in the world in it but the actual output is still incorrect

There are real world examples where it makes sense to do both retrieval augmented generation and continuous pre-training and parameter efficient fine-tuning and even full fine-tuning

Customers have been still exploring and and a little bit hesitant to go and index everything in the whole wide world

The way that llms work in real time is that they will go and take the context the the amount of information the text that you're providing to them use the tokens in that text in order to generate uh responses to uh the tokens that you've passed in

The problem with this is that it doesn't often prior to gen existing the vector search databases were not optimized for large embeddings of that size or large vectors of that size they were actually optimized for much smaller vectors

Retrieval augmented generation is like having a diary where I record what I had for lunch every day whereas pre-training is I'll I'll remember all the good meals that I had in my life and I can remember oh I like I like pizza I like pasta I like these things but if you were to ask me what I had on December 25th of 1996 I would not be able to tell you

There are techniques like low rank adaptations or qow Ros that uh can improve that but the reality is retrieval augmented generation and prompt engineering are often far less expensive and also sometimes more accurate

Transcripts

play00:11

welcome to neural nebula where we

play00:12

unravel the Mysteries around artificial

play00:14

intelligence clear up misconceptions and

play00:15

explore its transformative impact on

play00:17

Industries its various use cases and

play00:19

important considerations you need as you

play00:21

embark on an AI journey in today's

play00:23

episode we'll dive into retrieval

play00:24

augmented generation we'll break down

play00:26

the architecture discuss its value for

play00:28

contextual usage and applications and

play00:30

use cases along with its strengths and

play00:32

weaknesses my name is Rand hunt VPS

play00:34

strategy and Innovation at Kalin and

play00:36

joining me today is Clayton Davis our

play00:38

director of cloud native applications

play00:40

development hi Clayton hey Randall how

play00:43

you doing today doing well uh and then

play00:45

apologies if my voice is a little

play00:46

crackly during this uh recovering from a

play00:49

cold but hopefully we'll be able to get

play00:51

through this with no issues but with

play00:53

that said let's Jump Right In so

play00:56

retrieval on mention generation commonly

play00:58

referred to as rag

play01:00

um there's probably a bunch of fun

play01:02

musical jokes to make about that but you

play01:04

know it's like rag time blues no rag is

play01:08

the ability to enhance the llms context

play01:14

through external information so that

play01:18

external the way that llms work in real

play01:21

time is that they will go and take the

play01:25

context the the amount of information

play01:27

the text that you're providing to them

play01:29

use the tokens in that text in order to

play01:32

generate uh responses to uh the tokens

play01:37

that you've passed in now the llm by

play01:41

itself does not have the ability to go

play01:43

out and enhance its context with

play01:46

additional information so external

play01:49

tooling around the llm has to be built

play01:51

in order to pass in that additional

play01:53

context at the time of

play01:55

inference um so clayon does that kind of

play01:59

jum with your understanding of retrieval

play02:01

augment generation do you have any kind

play02:02

of concrete examples you could throw in

play02:05

uh no definitely I mean I think that

play02:07

that's perfect and with the the context

play02:08

Windows continually getting larger I

play02:10

think rag fits more and more into like

play02:13

the the first thing you should try

play02:15

because you can do so much with it but I

play02:16

mean I think I think the simplest um the

play02:20

simplest use case and the one I I often

play02:21

use is that you know if you internal to

play02:24

a company and you're trying to figure

play02:25

out how many days of PTO you have you

play02:28

can't just ask chat GPT hey

play02:30

how many days of PTO do I have a Kalin

play02:32

because it doesn't have any of that

play02:33

information it can't go find that

play02:34

information um however if I pass an

play02:37

additional context like our uh employee

play02:40

handbook and I pass in the whole context

play02:42

of our employee handbook to that and say

play02:44

hey based on this employee handbook how

play02:46

many days of PTO do I have uh now all a

play02:48

sudden it has that extra context to be

play02:50

able to go and search that information

play02:51

and find that right and so um you know

play02:54

it's a it's a very dumb down use case

play02:56

and there's a lot of kind of tooling

play02:58

around that I think that that that helps

play03:00

enable it but at at its I think at its

play03:02

Basics right if I copy and pasted the

play03:04

entire employee handbook into chat GPT

play03:06

and then asked it a question it's going

play03:07

to be much better to to retrieve that

play03:09

information for me so llms in general

play03:12

the way that they are trained is by

play03:14

taking huge amounts of tokens in an

play03:18

unsupervised fashion and feeding them

play03:21

into this these models these Transformer

play03:23

networks and then we are seeing things

play03:25

evolved Beyond Transformers now so we're

play03:26

seeing ssms in the wild but that's

play03:28

another topic and then it's asked to

play03:30

predict what the next token is or what

play03:32

the middle token is so uh llms can

play03:35

become optimized by uh increasing the

play03:39

number of tokens you're feeding in so

play03:41

you could take for example the uh

play03:43

employee handbook module that you were

play03:45

just talking about and feed that into uh

play03:49

the model as a training step and then

play03:51

adjust the weight uh the problem is we

play03:54

only see the emergent properties of

play03:57

llms when we get to very large parameter

play03:59

counts billions of of parameters in the

play04:02

models and that requires thousands of

play04:05

tokens per parameter so that means if

play04:07

you have billions of parameters you need

play04:09

trillions of tokens so to truly get uh

play04:12

fine-tuning of a model it can become

play04:14

quite cost prohibitive to feed in all

play04:16

those additional tokens into the model

play04:19

and then there are techniques like low

play04:21

rank adaptations or qow Ros that uh can

play04:25

improve that but the reality is

play04:28

retrieval augmented generation and

play04:30

prompt engineering are often far less

play04:32

expensive and also sometimes more

play04:35

accurate because feeding tokens into the

play04:38

llm does not actually guarantee accuracy

play04:41

it's just introducing that new

play04:42

information into the llm whereas

play04:45

retrieval augmented generation again it

play04:47

doesn't guarantee accuracy but because

play04:49

you're passing it in the context of the

play04:51

inference of the model you are more

play04:53

likely to get a correct result than if

play04:55

you were to just do continuous

play04:56

pre-training or fine-tuning so that's a

play05:00

lot of the reason that retrieval

play05:02

augmented generation is preferred over

play05:04

other techniques that said there are

play05:08

real world examples where it makes sense

play05:10

to do both retrieval augmented

play05:11

generation and continuous pre-training

play05:14

and parameter efficient fine-tuning and

play05:16

even full fine-tuning so if you are

play05:19

introducing there's a popular model

play05:21

that's available in Bedrock created by

play05:23

meta called llama 2 uh if you are

play05:26

creating uh a you know a new sort of

play05:31

fine-tuned version of llama 2 and you're

play05:34

introducing a lot of new vocabulary

play05:36

let's say llama 2 doesn't have a lot of

play05:38

exposure to say

play05:41

um Hospital handwashing policies I'm

play05:44

just I'm thinking of something

play05:45

completely random and you wanted to

play05:47

introduce all the vocabulary around that

play05:50

to the model you could do a step of

play05:52

continuous pre-training or parameter

play05:54

efficient fuity uh introducing all of

play05:57

this new information into the model

play05:59

introducing that new vocabulary those

play06:01

new tokens to the model and then also

play06:04

further enhance it with retrieval

play06:05

augmented generation but the way that I

play06:08

like to think about it and C I'm going

play06:10

to run this by you so tell me what you

play06:11

think if this is because we could

play06:13

probably start saying this to customers

play06:14

if it works but the way that I've been

play06:17

thinking about it is retrievable

play06:19

augmented generation is like having a

play06:21

diary where I record what I had for

play06:23

lunch every day whereas pre-training is

play06:27

I'll I'll remember all the good meals

play06:29

that I had in my life and I can remember

play06:31

oh I like I like pizza I like pasta I

play06:34

like these things but if you were to ask

play06:36

me what I had on December 25th of

play06:41

1996 I would not be able to tell you uh

play06:45

whereas if I go and look it up in my

play06:46

diary I could tell you precisely oh I

play06:48

had you know fig pudding and ham and all

play06:51

this stuff does that make any sense no

play06:53

and I I like that a lot because it's

play06:55

it's um it's all it's the specifics

play06:58

versus the summarization right it's kind

play07:00

of where you end up but yeah know I

play07:03

think that makes a lot of sense and you

play07:04

can get very specific with the rag stuff

play07:06

and and we've seen it with customers

play07:07

already the the the real kind of

play07:10

question that I think we face with

play07:12

customers is with a combination of

play07:14

prompt engineering and a combination of

play07:16

retrieval augmented generation I feel

play07:18

like we're able to solve close to 90% of

play07:21

the use cases that our customers come to

play07:23

us with but that last 10% is always very

play07:29

resting and we can sometimes eek out

play07:32

another 5% just by doing additional

play07:35

prompt engineering and then there are

play07:37

some Advanced Techniques for retrieval

play07:38

AUM into generation things like hide

play07:41

which is um creating a hypothetical

play07:44

document from other documents and

play07:45

combining sources

play07:48

but I think 90% in many use cases is

play07:52

actually sufficient and that's sort of

play07:55

been my operating thesis so far there

play07:58

are of course exceptions where that's

play08:00

not the case and I think we've been

play08:02

proving that out right I think customers

play08:03

have come to us thinking like hey we we

play08:05

need to we need to build our own model

play08:06

deploy our own model run our own model

play08:08

and you know we talk them off that ledge

play08:10

we like hey like just give us a couple

play08:11

weeks let let us try to do it with Rag

play08:13

and and prompt engineering and it and it

play08:15

works pretty well and the way you know

play08:18

one of the ways i' I've explained to the

play08:19

customers is that you know using just

play08:21

prompt engineering and just reg you're

play08:23

using these things out of the box right

play08:25

and so when the next one comes out it's

play08:27

a lot easier to upgrade and move versus

play08:29

if you're doing all this tweaking and

play08:30

tuning and running it yourself you've

play08:32

got a lot of a you've got a lot bigger

play08:33

lift to to upgrade or hey let me just

play08:35

try this new model out real quick and

play08:37

see what it does um you know you can do

play08:40

that with Rag and prompt engineering you

play08:41

once you start F tuning it's going to be

play08:43

a more expensive you know quick little

play08:46

trial to see if it's going to work

play08:47

better or not and there there are a

play08:49

couple of different specific ways of

play08:51

going about retrieval onment to

play08:53

generation so at its core you can

play08:55

imagine you know the very simplest form

play08:58

of retrieval augment to generation would

play09:00

be to take a single document and to

play09:02

parse it and create embeddings from it

play09:05

those embeddings would be sort of the

play09:06

tokens that represent the document and

play09:09

store it in some sort of uh embedding

play09:12

search

play09:13

space and then you would take the users

play09:17

prompt that you were going to use to

play09:20

create an inference from the large

play09:21

language model you would use that prompt

play09:24

to search the embedding space that you

play09:27

have just created and return relevant

play09:29

pieces of that file uh and instead of

play09:34

returning the embeddings raw you're

play09:35

returning text from that document or you

play09:38

know it doesn't necessarily have to be

play09:40

text but we're primarily talking about

play09:41

large language models so we'll imagine

play09:44

it's all text for now you're returning

play09:46

the the relevant sections of that

play09:49

document putting it into the context of

play09:52

the inference and saying hey answer

play09:53

these questions um and to your point

play09:56

earlier as the context becomes larger

play09:58

and larger there are different tradeoffs

play10:01

to be made um for instance the Llama 2

play10:05

models while they are quite efficient

play10:08

and have a a very great sort of um

play10:11

operational cost a great per token cost

play10:14

it's much lower than that of some of the

play10:16

other proprietary

play10:18

models the context size is only 8K uh

play10:23

and then 4K in some cases so there's not

play10:26

so much context that you can put into to

play10:29

the model at inference time which in the

play10:33

the way that that impacts things is if

play10:35

you're using a llama 2 or a model or a

play10:38

llama 2 variant something like a mistol

play10:40

or a mixol you would then optimize your

play10:44

retrieval rather than optimizing the llm

play10:46

so you would optimize your retrieval by

play10:48

doing additional indexing and making

play10:50

sure the chunks that you're returning

play10:52

are small but uh relevant to the query

play10:56

that the user is making whereas uh and

play10:59

this is actually always a good thing to

play11:01

do always optimize your retrieval but

play11:03

the impetus or the motivation to

play11:05

optimize your retrieval is actually

play11:06

lower the larger the context window uh

play11:10

as long as cost is not a concern because

play11:11

keep in mind you're paying for these

play11:13

models in a in a per cocen way and I

play11:16

think that's an important distinction is

play11:18

that as you change your model not only

play11:21

does your prompt engineering change your

play11:24

method of retrieval changes as well and

play11:26

then the techniques that you're using

play11:27

for retrieval uh and even the techniques

play11:30

you're using for storage and search

play11:34

change yeah and there's um there's

play11:37

another consideration there too like yes

play11:40

cost should drive like how how many

play11:43

results you going to return is it going

play11:44

to be three is it going to be 10 but you

play11:45

also have to know what your data is

play11:47

because it may be that you know in the

play11:48

employee handbook if that's all you

play11:50

vectorize and you put in the database

play11:51

and you're going to ask about PTO well

play11:53

there's one section in the employee

play11:55

handbook about PTO if you're going to

play11:56

return 10 different results you're going

play11:58

to start getting things that don't have

play11:59

the similarity that you need and you're

play12:01

going to start summarizing the wrong

play12:02

things and so there is some data science

play12:05

data engineering around like what am I

play12:06

indexing and based on the questions I'm

play12:09

asking like are there going to be 10

play12:11

good similar results that I expect to

play12:13

come back or is there really just one or

play12:14

two and then I don't have that much data

play12:16

and so you also to think about you know

play12:19

what's going to be relevant you know

play12:20

don't just I've got a huge context

play12:21

window let's return it all like well

play12:23

maybe not right because then you're

play12:24

going be summarizing things that don't

play12:25

make sense and there's also you know

play12:28

they've done an anes on the models and

play12:31

the the ability to retrieve from the

play12:34

context is not perfect so you can do

play12:36

what's called a needle and a hstack

play12:37

search of the context that you're

play12:39

providing where you can provide you know

play12:41

the claw 2 models or the open AI chat

play12:45

GPT models you know huge huge amounts of

play12:48

context and then say return this one

play12:51

obscure fact from the context and uh it

play12:55

has recency bias so things that are very

play12:57

very you know close to to the bottom of

play12:59

the context or very close to the top of

play13:01

the context are more likely to be

play13:04

returned than stuff that's in the middle

play13:06

um and then it does lead to different

play13:09

kinds of optimization techniques to your

play13:11

point and I I think that's interesting

play13:14

and it's something that we've definitely

play13:16

developed quite a bit of experience with

play13:18

over the last you know year now of of

play13:21

messing around with these systems and

play13:23

and building production quality

play13:24

applications for them but we're also

play13:29

starting to do work well we we have been

play13:32

doing work not just with semantic

play13:34

retrieval which is the the embeddings

play13:37

and the searching we've been doing uh

play13:39

retrieval in a non-semantic fashion as

play13:43

well so basically translating say

play13:45

natural language input into a structured

play13:48

query language and using the results

play13:51

that it returned from that SQL query to

play13:53

then infer an additional answer we built

play13:56

this uh project called the uh reinvent

play14:01

session concierge where we took all of

play14:04

the and this is in our public GitHub

play14:05

repo and uh for those of you who may be

play14:07

listening I I might show this a little

play14:09

bit later but so you can go watch the

play14:11

Youtube video but I'll also just post a

play14:15

link to the code as well but the way

play14:17

that this works is that we

play14:19

will basically go out and query the

play14:23

postgress database that we created of

play14:25

all of this um all of these reinvent

play14:29

and we'll query it using natural

play14:30

language and we'll have the clog 2 model

play14:33

translate that natural language into AQL

play14:35

query get the results feed those into

play14:38

the context of the response and say okay

play14:44

you know clawed model please now

play14:46

synthesize a response based on the

play14:48

user's original question so that's a

play14:51

very interesting use case and uh you

play14:54

know we chose to use postest for that

play14:57

which is something we can talk about a

play14:58

little bit later but um there are a lot

play15:01

of other options too open search for

play15:03

example now allows for for hybrid

play15:06

queries so you can do both semantic and

play15:08

lexical search

play15:09

simultaneously uh and you can normalize

play15:12

the scores between them as well which

play15:13

can be really really powerful and I know

play15:15

cleton that you've done quite a bit of

play15:17

work with open search so far and maybe

play15:19

you could tell us a little bit about

play15:20

that yeah no I mean I do want to touch

play15:23

on one thing on your non Vector

play15:24

retrieval stuff because I um I think

play15:27

it's been one of the more surprising

play15:29

things that that we you know we've got

play15:31

50 PC's or so under our belt and and you

play15:34

know we've started having to categorize

play15:36

them as you know you know rag non rag

play15:38

but then within rag is it Vector search

play15:40

or is it not and I think a lot of people

play15:42

jump straight to like oh hey the the rag

play15:45

concept like you use a vector database

play15:46

you do this and and it's important to

play15:49

try to figure out when you use one and

play15:50

when you don't and and you know

play15:52

structured data just doesn't go well

play15:54

into a into a vector store and so you

play15:57

know you mentioned postgress but we've

play15:59

done things where customers are giving

play16:00

us csvs of file or of of data and you

play16:03

know we're using something like P spark

play16:05

to help the LM generate a query and

play16:07

query directly in a CSV because it's

play16:09

structured right just because it's not

play16:10

in a database doesn't mean you need to

play16:12

vectorize it right and so even like you

play16:14

know customers providing us like hey you

play16:16

know we've got these four or five Excel

play16:18

sheets we'd like you to to do ragon you

play16:20

know put them into Kindra it's like well

play16:22

no i' I don't think we need to put them

play16:23

into Kinder I think we just need to you

play16:25

know build you know train the model to

play16:27

know which one to look at and then query

play16:29

them with with pypar and so you know

play16:31

there's there's a lot of non Vector

play16:32

retrieval stuff you can do especially

play16:34

with with structured data but um but

play16:37

yeah back to open search like I I

play16:38

thought I think open search is great I

play16:40

you know um same with PG Vector they

play16:43

give you a lot of flexibility and a lot

play16:45

of customization and so once you get

play16:47

past the POC stage you know you can do a

play16:49

lot where you're doing Vector queries

play16:51

but then you can put a ton of metadata

play16:52

around it to know exactly what you um

play16:55

you know is it is it the text you're

play16:57

doing do you need a link to where that

play16:58

that document actually came from you

play17:00

know we're doing run right now where

play17:02

there's a a question and answer and so

play17:05

we're talking about vectorizing the

play17:06

question so you can search based on the

play17:08

question but then as part of the

play17:10

metadata you're pulling back the answer

play17:12

because that's what you want to

play17:13

summarize and so you know you're not

play17:15

just chunking out the data you're

play17:16

actually doing it in a very specific way

play17:19

just because you know what data you have

play17:20

and what you're expecting to ask and

play17:22

what you're expecting to get back so

play17:24

there's a lot of flexibility when you

play17:25

when you start using you know your own

play17:27

Vector stores

play17:29

what's fascinating to me is that we've

play17:31

sort of made this transition of forcing

play17:35

ourselves to learn a new query language

play17:37

every time into basically using English

play17:40

as the query language and forcing

play17:42

ourselves to teach in the context window

play17:44

of the prompt this is how you can

play17:46

interpret these different things that a

play17:49

user might ask and translate that into

play17:51

this sort of query um and again I know

play17:55

that we don't have uh a ton of people

play17:57

who will will be watching the screen but

play18:00

this is what I'm showing on screen right

play18:02

now as an example of one of the prompts

play18:04

that we used in the reinvent session

play18:06

Navigator and we we store these prompts

play18:09

in Dynamo DB so that we can version them

play18:12

uh by date and time and so if we improve

play18:14

one we can ab test it and go back but

play18:17

you can see uh this is a clog prompt and

play18:19

what it's saying is given an input

play18:21

question use postgressql syntax to

play18:24

generate a syntactically correct

play18:26

postgress SQL query from the following

play18:28

table session uh this table represents

play18:32

events for a conference the table schema

play18:33

will be contained within schema which

play18:35

means I'm going to pass in the schema at

play18:38

runtime because I'm going to derive that

play18:40

and cache it uh as we change the schema

play18:43

over time and then it says the query

play18:46

should be read only write a query in

play18:48

between uh SQL uh XML tags and you know

play18:53

then I say in all caps and and that's

play18:56

the craziest thing about uh some of the

play18:58

these models is that they respond to

play19:01

emotional language and they respond to

play19:02

all caps in a way that they do not

play19:05

respond to uh uh non all caps or or non

play19:10

emotional language so important to note

play19:13

all fields that you include in the wear

play19:15

Clause should also be included in the

play19:16

select Clause the reason for this is

play19:18

that when we return the results we want

play19:22

to be able to render it into like a

play19:23

pandas data frame or into a table or

play19:25

something rather than just getting a raw

play19:29

answer with no context uh except the

play19:31

embedding field uh don't include this in

play19:34

the select Clause this will Aid in the

play19:36

generation of the result and then we

play19:38

annotate the schema so we we have

play19:41

another important note here where we're

play19:43

walking through how to use PG vector and

play19:45

how to use the embeddings and the Syntax

play19:48

for the embeddings the only reason we

play19:50

include this is that it may not be in

play19:53

the base models knowledge set because it

play19:55

is a fairly new extension well I mean PG

play19:58

Vector is not a fairly new extension but

play20:00

some of the syntax uh is new and it's

play20:03

really taken off recently so we want to

play20:04

reinforce that context of the model uh

play20:08

and then we you know pass in the schema

play20:12

and then we pass in a number of example

play20:14

questions so you know one example

play20:16

question is how many sessions are in the

play20:18

Venetian and then we pass in the the

play20:20

example generated query and this is an

play20:23

example of prompt engineering where

play20:24

we're saying select count from session

play20:26

where venue name I like Phoenician um

play20:29

and we can do other more complex queries

play20:32

and we try to give it a a diverse set of

play20:36

things we try to introduce the idea that

play20:40

queries should use ores instead of ands

play20:43

so they should be as inclusive as

play20:45

possible as opposed to being as

play20:47

exclusive as possible because we want to

play20:49

return results this may differ for other

play20:51

applications that we build but in the

play20:53

case of the reinvent session coners

play20:54

we're trying to return uh as many

play20:56

relevant results as possible

play20:59

and then we you know if we go and we

play21:02

look at the schema um we annotate the

play21:06

table schema so we we don't and i' I've

play21:09

switched to a different uh Dynamo DB

play21:11

entry now where I've walked through hey

play21:15

this is the schema of the table and

play21:17

these are all the fields and this is

play21:20

what they are talking about and I give

play21:22

the type of it you know a tag topic is

play21:24

an array of strings topics covered in

play21:27

session for example

play21:28

and you know uh the session type is a

play21:31

string or it's an enum these are all

play21:34

really really powerful prompt

play21:36

engineering techniques that can

play21:37

drastically enhance the results and then

play21:40

you can put in an additional important

play21:41

notes for instance I say session ID

play21:45

string almost always return this field

play21:48

that is a a common kind of technique

play21:51

that we can use in The annotation of

play21:53

this retrieval augmented generation to

play21:55

say no matter what query you're running

play21:58

running this is relevant information

play22:00

that we need even if we don't

play22:01

necessarily end up rendering it to the

play22:03

end User it's still something that we

play22:05

can use uh on our side to to kind of do

play22:08

analytics and things like that uh but I

play22:11

I like walking through this example

play22:13

because I think it's illustrative of

play22:15

taking natural language and querying a

play22:18

non Vector store uh it's also been

play22:20

enhanced a little bit with an in Vector

play22:22

store at the bottom here in terms of

play22:23

embeddings and that gives you that combo

play22:27

of both traditional search uh lexical

play22:29

search and the vector search and we've

play22:33

really seen this sort of stuff start to

play22:36

take off within our customers I I mean

play22:39

to Clayton's point just a moment ago

play22:42

there there have been some very

play22:43

interesting techniques that have been

play22:45

developed and uh there's a lot of

play22:48

tooling now as well but I think you know

play22:52

now that we've sort of explored the use

play22:53

cases we've explored some of the

play22:56

techniques for optimization and some of

play22:57

the techniques

play22:58

for the prompt engineering side of

play23:00

things I think it's time to take a step

play23:02

back and and go back to what uh

play23:05

embeddings are and why we use them and

play23:07

some of the tradeoffs that come with

play23:10

embeddings talking with someone today

play23:12

actually about embeddings and Incredibly

play23:15

powerful um but one of the things that

play23:18

I've noticed is that all the tooling out

play23:21

there like Kindra or Lang chain make it

play23:23

so people don't actually see the

play23:24

embeddings or use the embeddings anymore

play23:26

they're kind of like they're they're

play23:28

abstracted away right um so I think it's

play23:30

important to understand what embeddings

play23:32

are and why we use them and and the

play23:34

power they bring because like I mean the

play23:36

example you were showing right isn't

play23:38

like you didn't populate that with Lang

play23:39

chain that was very much more manual

play23:41

right and so like once you understand

play23:43

that you can get a you can you can do

play23:46

right 2.0 almost right like because you

play23:48

can get a lot more advanced with it and

play23:50

I think that's a very important point

play23:51

though Clayton is what we did there is a

play23:55

manual implementation of retrieval

play23:57

augmented generation but tools like Lang

play23:59

chain do make it much easier and uh that

play24:03

comes at the cost of flexibility I would

play24:05

actually advise anyone listening to go

play24:08

and Implement retriev augmented

play24:10

generation manually you know using

play24:12

whatever programming language you want

play24:14

do that at least once that way you will

play24:17

really develop the kind of deeper

play24:19

understanding of what's

play24:21

happening because Lane chain extrap you

play24:24

know it it it takes away a lot of the

play24:26

complexity and you you some times Miss

play24:29

exactly what's happening under the hood

play24:31

yeah with a vector store I think is

play24:32

important right like not just a database

play24:34

but doing it with with a vector store

play24:36

and and embedding your own data and like

play24:39

so um there's some blog posts out on the

play24:41

can website where I took uh I think OSHA

play24:43

data right just publicly accessible data

play24:45

and made like a chat bot for it right

play24:47

just to just to try it out and so if you

play24:49

don't have data just go find a data set

play24:50

out online and use that as your data

play24:53

set exactly so when it comes to

play24:56

embeddings there we have a couple

play24:58

choices within the AWS ecosystem um you

play25:01

know what I've seen used most is

play25:04

actually the hugging face embeddings um

play25:06

there's also the open AI uh I think

play25:09

they're called the ada2 embeddings um

play25:13

but within Bedrock you have the choice

play25:15

of the Titan embeddings and these are

play25:17

multimodal embeddings as well so that

play25:19

means it can take both image data and

play25:22

Text data and a vector similarity would

play25:24

be the same so if you were to store the

play25:26

text goldfish for example and a picture

play25:28

of a goldfish they would both correspond

play25:30

to the same rough Vector that's really

play25:33

cool and exciting and it unlocks a lot

play25:35

of um interesting search

play25:38

techniques uh but keeping it kind of

play25:42

focused on primarily large language

play25:45

models our experiments have shown that

play25:47

the cohere embeddings are some of the

play25:51

best and the reasoning for this is uh a

play25:54

number a number of different things so

play25:57

first first of all uh there is the size

play26:01

of the the output Vector so the Titan

play26:04

embeddings for example will output a

play26:05

vector of the size 1536 which means

play26:08

there's 1 1536 numbers um in the array

play26:12

that is the vector representing the

play26:13

output of up to an AK context of the

play26:16

Titan embedding

play26:18

now the problem with this is that it

play26:21

doesn't often prior to gen existing the

play26:25

vector search databases were not

play26:27

optimized for large embeddings of that

play26:29

size or large vectors of that size they

play26:31

were actually optimized for much smaller

play26:33

vectors um that problem is going away as

play26:37

as people realize the the value and the

play26:39

capabilities of slightly larger vectors

play26:41

but the um coher embeddings are

play26:43

outputting both 512 and one24 variants

play26:47

which are more searchable uh we found

play26:50

and then the hugging face outputs are

play26:52

512 and like 700 or something uh and

play26:55

then the open AI ones are also one24 so

play26:57

these these smaller embeddings um these

play27:00

smaller output vectors we've actually

play27:02

found that the performance of the

play27:03

smaller output vectors is higher so when

play27:06

you get better semantic similarity

play27:08

search and there's a lot of common

play27:11

benchmarks and techniques that you know

play27:14

you can go read blog posts and and

play27:15

academic papers about all of those but

play27:18

all of the academic research currently

play27:21

backs up that situation that the smaller

play27:23

vectors uh are are better for search I

play27:28

expect that to change so I expect as the

play27:30

hardware improves and as the the large

play27:32

Vector search support improves that will

play27:35

not necessarily be the case and so as

play27:38

you generate these embeddings you know

play27:40

you have to store them somewhere

play27:42

and like I said uh there's a secondary

play27:46

point so we've talked about the size of

play27:47

the vector as the first point but

play27:49

there's also a secondary point which is

play27:51

the quality of the vector

play27:54

so um there's there's two things to

play27:57

break down the quality of the vector

play27:59

here as well and that is the set of

play28:01

input tokens that go in uh if a model

play28:05

was trained primarily on English for

play28:07

example and had a much smaller

play28:08

representation of other scripts or other

play28:11

languages then it may count additional

play28:15

tokens in a lower quality lower

play28:18

correctness lower Precision Vector as

play28:20

output if you were to put the same uh

play28:23

input data in so like if you were to say

play28:26

uh you know flower in English and then

play28:30

Bloomin or whatever in German you may

play28:33

not get the same Vector depending on how

play28:36

that embedding model was trained uh

play28:39

we've actually seen this manifest with

play28:41

some of our customers we had a customer

play28:43

that was parsing a lot of

play28:46

um uh East Asian languages texts things

play28:48

like uh everything from hindii to

play28:52

Mandarin to cantones to to Japanese and

play28:55

and a little bit of Russian language as

play28:56

well so this customer was parsing all of

play28:58

these scripts that were not well

play29:00

represented or not as well represented

play29:03

in the embedding that we were using

play29:05

prior to switching to coh here and we

play29:07

were also getting charge for more tokens

play29:09

for the non-english script so for

play29:11

example the word hello in English would

play29:14

charge us only one token and it would be

play29:16

28 tokens if we did the same thing in

play29:18

Hindi so this caused us to look at some

play29:22

different embedding models and we

play29:24

arrived at the coher model and we found

play29:26

the output and the performance of that

play29:27

was actually excellent so we've been

play29:30

using cohere um for a couple of

play29:32

different customers for the embedding

play29:34

side now uh most of that still I would

play29:38

say in the the proof of concept phase

play29:40

although because once you generate all

play29:42

of these embeddings you're you're going

play29:44

to pay a fee to basically take your

play29:46

entire data set send it into these

play29:48

embedding models and then uh store the

play29:51

output vectors so customers have been

play29:54

still exploring and and a little bit

play29:56

hesitant to go and index everything in

play29:58

the whole wide world uh that said you

play30:01

know we're we're in January of 2024 now

play30:04

and we are starting to see customers

play30:05

move into production with that which is

play30:07

exciting because it's um it's really

play30:10

letting us play with Vector storage at

play30:13

scale um have have you seen this Clayton

play30:16

the the kind of differences in

play30:19

embeddings and then also I guess the

play30:21

Comon the comparison to

play30:23

Kendra yeah we hav't done a ton of uh

play30:27

trial and athm Bets right like we've

play30:29

done a ton of PC's right it's usually

play30:31

pick one and get it to work and then you

play30:32

come back and you can try different

play30:33

things um we've done a lot with KRA um

play30:37

KRA is almost uh it's it's like we

play30:39

mentioned with Lang chain uh it

play30:41

abstracts a lot of the things away from

play30:43

you um I you know early on I was not a

play30:46

fan of Kendra but I have since become

play30:48

more of a Kendra Fanboy as as time has

play30:50

gone on just it it makes I mean we

play30:53

talking about customers and Vector

play30:55

stores and and pcc's and showing the

play30:57

power of of of

play30:58

geni Kindra is is the quickest and

play31:01

easiest way to do it it does come at a

play31:04

little bit higher cost than some of the

play31:05

other things but like you it's got so

play31:07

many connectors you just point it at

play31:09

something and say go it's going to index

play31:10

it for you um it's you know the

play31:12

retrieval is all done via an API clear

play31:15

via an API it's all like it's it's magic

play31:18

right but it once again it's out of the

play31:19

box um which has its pros and cons right

play31:23

you know the things that you were

play31:23

showing with the uh with the the the

play31:27

reinvent search like that does not work

play31:29

with KRA you could like your tool would

play31:31

be much worse if you used Kindra um

play31:34

however I would argue that if you

play31:35

started with Kindra kind of as a proof

play31:37

of concept you would have been able to

play31:38

see that like hey this is going to work

play31:41

here are the next steps we need to take

play31:42

to make this better and that's that's

play31:44

what we're seeing a lot with customers

play31:45

right like you know let's start with

play31:46

Kindra let's get it done in a week or

play31:47

two weeks um and then now we understand

play31:50

your data we understand what you're

play31:51

trying to do and we can better say like

play31:53

hey I think these are the steps you need

play31:55

to take to get to the the MVP P of what

play31:57

you're trying to do next right to get it

play31:59

in front of clients or whoever your

play32:00

customers are going to be I agree

play32:02

entirely and and that I I took the exact

play32:04

same path that you took Clayton is I was

play32:06

at first thinking oh man Kindra you know

play32:09

it's it's so expensive I don't have a

play32:11

ton of control but the ease of use of

play32:15

being able to just turn it on crawl some

play32:17

documents I don't have to worry about

play32:19

embeddings I don't have to worry about

play32:21

uh the retrieval queries or or

play32:23

annotation of the schema or you know

play32:26

continuous crawling anything Kindra just

play32:28

manages all of that for me and that

play32:31

really does let me get started a lot

play32:34

faster I if I had to predict the way

play32:37

most customers will evolve they may

play32:39

start with KRA just to do the proof of

play32:42

concept as you're saying and then evolve

play32:45

into multiple other data stores that

play32:47

allow for customization of the retrieval

play32:51

um and and kind of reach these multi-

play32:53

retrieval scenarios that that's my

play32:55

prediction I'm wrong all the time though

play32:57

so we'll see what happens um I agree I

play33:00

like I I think one of the one of the

play33:02

biggest learnings we've had going

play33:03

through these pcc's is that um it it's

play33:08

not like vector store isn't the answer

play33:10

to everything with rag right and I I

play33:11

think going into these we thought uh we

play33:14

thought Vector store like was like

play33:17

everything's got to be Vector Store

play33:18

everything's going to be similarity

play33:19

search and so we we we we went for that

play33:22

um but as we're going through these

play33:24

things we're seeing like hey you know

play33:25

your data is structured or semi

play33:27

structured um you know let's try to

play33:29

let's try to be more creative at how

play33:31

we're getting that data out or hey some

play33:33

of your data is structured and some of

play33:34

it's not so instead of shoving

play33:35

everything into Kindra let's shove some

play33:37

of it into Kindra but let's also take

play33:39

some of it and and try to just cleer it

play33:41

directly and so like TI your M to

play33:43

retrieval point like I I do definitely

play33:45

think that's the way things are going to

play33:46

head um as people understand Vector

play33:48

stores and the data that they that they

play33:50

need to answer these questions they're

play33:52

try to ask

play33:53

absolutely so I want to walk through

play33:57

some of the algorithmic and index

play33:58

differences very briefly and then I

play34:00

think we can talk um about all the

play34:03

different options and data stores and

play34:06

how they Implement these things and then

play34:07

I think we'll sort of close everything

play34:09

out but uh you know the the core sort of

play34:13

algorithms that we're using right now

play34:15

are like ukian distance that's the

play34:18

traditional sort of search for

play34:22

um uh vectors and then there's things

play34:27

like inner products and there's things

play34:28

like cosine uh cosine distance or cosine

play34:32

similarity and the way that I tell

play34:34

people to think about the different uh

play34:37

versions of these algorithms is some of

play34:40

them work on non-normalized data which

play34:43

means they take into account both the

play34:45

amplitude like it's a true vector and

play34:47

that it's taking into account both the

play34:50

the um amplitude and the direction and

play34:54

then others really only care about the

play34:56

direction and less about the amplitude

play34:59

now the algorithm that you use to search

play35:02

things uh you can sometimes use

play35:04

different algorithms on the same kind of

play35:06

index but I think cosine similarity is

play35:09

really what uh a lot of the industry has

play35:12

arrived at in terms of how they're

play35:15

searching and then in terms of indexing

play35:18

you know people used to use um IV F flat

play35:22

which is like an inverted flat file um

play35:25

where you're you're basically

play35:27

it's an inverted index so this is a very

play35:29

common full Tech search technique it's

play35:31

been around for a long time but then

play35:34

another technique that people have been

play35:36

using more recently and the technique is

play35:38

actually quite old I think it's from

play35:41

probably the 60s but or 70s um the and

play35:45

it's a statistical technique it's called

play35:47

K nearest neighbor uh but optimized

play35:50

versions of that uh like approximate K

play35:52

nearest neighbor are how a lot of

play35:55

indexes are being searched these days is

play35:57

and there's a lot of tuning that you can

play35:58

do with that algorithm and with that

play36:00

index to if you have a good

play36:03

understanding of what your underlying

play36:04

data is or if you've done some

play36:06

statistical analysis on your data a

play36:09

canorous neighbor search can be

play36:10

extremely efficient and extremely

play36:12

powerful and accurate and then more

play36:16

recently uh I think since a couple years

play36:19

ago people have started using hnsw which

play36:22

is high I think it's hierarchical or

play36:25

highly navigable Small World um someone

play36:29

can correct me in the comments I guess

play36:31

because I can't remember all the the

play36:32

definitions but hnsw is actually a

play36:35

really really powerful uh

play36:37

search implementation that um is in my

play36:42

opinion like a much better version of K

play36:45

nearest neighbor for generating correct

play36:48

Vector search results and semantic

play36:50

results

play36:52

um now

play36:54

hnsw was not available in post prior to

play36:58

I think 2023 I I I think it yeah I think

play37:01

it in 2023 is when uh hnsw got added to

play37:05

PG vector and the way that that came

play37:07

about is in my opinion a very

play37:08

interesting story so superbase and AWS

play37:12

the two companies they collaborated with

play37:14

the person who maintains PG Vector I

play37:16

pardon me I can't remember their name

play37:18

but uh they worked together basically to

play37:21

get hnsw implementation into postgress

play37:24

uh which really um made postgress sing

play37:27

in terms of vector search performance so

play37:31

uh open search also got a lot of really

play37:33

great uh additions into the ability to

play37:36

tune these these kirous neighbor indexes

play37:38

so so if you think about it you know

play37:40

stores like pine cone pine cone is a

play37:42

specialized Vector store that uses hnsw

play37:45

I think it also has a couple of other

play37:46

options PG Vector has hnsw it has IBB

play37:50

IVF flat or

play37:52

IVF um it has uh

play37:57

you know a couple other kind of tuning

play38:00

options and size options and then open

play38:02

search uses K nearest neighbor and then

play38:04

KRA we don't really know what it's using

play38:08

although from our experience it seems

play38:11

pretty clear it's using some form of K

play38:13

nearest neighbor um but that's sort of

play38:16

abstracted away from you you don't have

play38:18

to care about what the underlying

play38:19

algorithm is and then

play38:22

mongodb also uses K nearest neighbor and

play38:26

uh

play38:27

one of the interesting differences

play38:28

between mongodb Atlas and its

play38:32

implementation of vector search versus

play38:34

Amazon's document DB and its

play38:37

implementation of vector search is that

play38:40

document DB uses the hnsw search so it

play38:43

actually outperforms the K&N search from

play38:46

mongodb which honestly is like one of

play38:49

the few times I've seen doc DB really

play38:51

get something right where Monger DB

play38:53

might have gone down the wrong path um

play38:56

so I me I'm very interested and excited

play38:57

to see how that develops but now that

play39:00

we've covered sort of all the stores and

play39:02

all of their algorithms and by the way

play39:03

that was not by any means an exhaustive

play39:06

store uh covering of all the different

play39:08

algorithms and techniques there's a lot

play39:11

a lot of different things out there so

play39:15

we will with after covering all of those

play39:17

like with with Gen in general it's a lot

play39:20

of trial air test repeat um do you

play39:23

recommend similar for trying to choose

play39:26

the way you index stuff different stores

play39:28

like do you recommend trying a couple or

play39:30

or like I guess more specifically the

play39:32

the way you do your index searching

play39:34

right like is there a right is there a

play39:35

wrong is there a you know test a few of

play39:38

them I think that's a great question

play39:40

Clayton the way that I think of it is

play39:42

access pattern driven development so if

play39:45

I know in advance what my access

play39:47

patterns are going to be I can do a and

play39:50

so so the ideal world right if

play39:53

everything is going perfectly which by

play39:54

the way in our business it never does um

play39:56

but if if everything is going perfectly

play39:59

prior to getting started I know all of

play40:01

my access patterns and I know the

play40:03

statistical Telemetry information about

play40:05

all of my data I know you know how how

play40:08

often I am I adding new data what is the

play40:10

size of my data what's the average chunk

play40:12

size what's the average change data all

play40:14

of those sorts of things if I have all

play40:15

of that information I can mathematically

play40:17

derive the correct index and the correct

play40:21

technique since that's never actually

play40:23

happened in real life in the history of

play40:25

Consulting um what we typically do is if

play40:29

we have the data available we'll do some

play40:31

statistical analysis of the data to

play40:33

understand uh you know we can do things

play40:35

like ingrams or there's all kinds of

play40:37

different techniques and and our data

play40:39

science team would be better equipped to

play40:41

really talk through all of the

play40:42

techniques there but there are ways that

play40:44

we can say oh this data set is going to

play40:46

be very amable to approximate K nearest

play40:48

neighbor or this data set is actually

play40:50

perfect for hnsw with you know this

play40:53

kernel size and you know this amount of

play40:56

of top K and this amount of top P um

play41:00

those are just sort of parameters that

play41:01

you can put in um so that is often what

play41:06

we'll start with is we'll say well no

play41:09

actually often what we start with is

play41:10

Kindra because then we don't care about

play41:11

it and we get to see some of the access

play41:13

patterns uh of real world usage so uh

play41:16

we'll start with like a a Cann or an hsw

play41:20

style index in a traditional data store

play41:22

and then sometimes it depends on the

play41:24

customer's existing data store right you

play41:26

don't want to introduce a brand new

play41:28

database into your architecture just for

play41:31

the purposes of vector storage um so if

play41:34

you have an existing post grass

play41:36

deployment TG Vector is probably the way

play41:38

that you want to go because you already

play41:40

have the organizational muscle memory

play41:41

around maintaining and running post if

play41:44

you have an existing MB cluster maybe

play41:46

the mongodb akn is the way you want to

play41:49

go because you already have that

play41:50

existing mongodb experience um there are

play41:53

exceptions to that rule so Greenfield

play41:57

will often pull in a brand new database

play41:58

for people just because it's it's being

play42:00

created new there's not a ton of stuff

play42:02

that we have to do um but my my

play42:06

preference my personal preference not

play42:08

this is not a

play42:09

kalant endorsed opinion this is just a

play42:11

Randall opinion I always do PG vector

play42:15

and then I do access pattern driven

play42:16

development and I look at the slope

play42:17

query log and then I adjust the hnsw

play42:19

index based on the slow quy

play42:21

log awesome and then last question is so

play42:25

for people listening that that you know

play42:27

you you you did tell them to go and do

play42:29

this by themselves right and go to do it

play42:32

there isn't a wrong answer when you're

play42:33

pcing right like you can use any of

play42:35

these they're going to work one may just

play42:37

work better than another and and you I

play42:39

you know it matters more when you get to

play42:40

production but from a POC standpoint

play42:42

like pick one use it move forward it's

play42:46

it's still G to work and and I've said

play42:48

this before in other episodes but I

play42:50

think getting started is an art and

play42:52

optimizing is a science so you know

play42:55

getting that canvas

play42:56

from a a blank page into oh that's a

play42:59

tlor swift thing right is blank space

play43:01

baby yeah going from Blank Space baby

play43:04

into a real world like this is a thing

play43:07

that's working you can then go and

play43:09

optimize and you can programmatically

play43:12

optimize which means you can store an AB

play43:14

test and you can compare the full thing

play43:17

and that's why Telemetry in your

play43:18

applications is so important because if

play43:22

you're only measuring and optimizing for

play43:24

retrieval you could go and make the best

play43:26

retrieval in the world but your

play43:27

responses on the the generative side

play43:29

still suck and it's because your your

play43:32

retrieval may be um wicked fast super

play43:36

performant you know you can put all the

play43:38

documents in the world in it but the

play43:41

actual output is still incorrect so

play43:43

maybe you're chunking it wrong maybe all

play43:45

this is wrong so you really need to have

play43:47

continuous Telemetry from you know in

play43:50

user generated request to llm to

play43:54

retriever back to LM back to end user

play43:57

result and then tie that all together

play44:00

and it takes time um and you want to

play44:03

look at like your tm95 and your tm99

play44:06

sort of results from both a performance

play44:07

and from an accuracy standpoint yeah

play44:10

it's fun I mean this is that's the fun

play44:12

part for me is once we have something

play44:13

working and real users are are playing

play44:15

around with it we get to really dive

play44:17

deep into the data and optimize it that

play44:19

that optimization component will

play44:21

sometimes like rewrite an entire prompt

play44:23

because we realized oh you you know we

play44:27

90% of this goes

play44:29

unused um but yeah I mean we we've

play44:32

really had a a great time experimenting

play44:35

with this and there's one last technique

play44:38

that happens in retrieval augmented

play44:39

generation that can be very helpful and

play44:41

this is possible in both Kendra and in

play44:44

um open search and it's called reward

play44:47

modeling um well it's not really reward

play44:50

modeling it's more of a filter that's

play44:51

applied after the search um reward

play44:54

modeling is more of a training technique

play44:55

but you're basically up voting or down

play44:58

voting answers and fragments of

play45:01

documents that are returned based on the

play45:03

users input query and this can improve

play45:06

the overall semantic search performance

play45:08

over time for all of your

play45:10

users um so we'll talk about that in

play45:13

another episode because I think reward

play45:14

modeling on both the search side and on

play45:16

the training side is worth diving in

play45:18

deeper but I just want to close this out

play45:20

by saying that you know there's a lot of

play45:23

strengths to retrieval augmented

play45:24

generation it is absolutely technique

play45:27

that you should investigate it is

play45:28

absolutely a technique that is

play45:30

worthwhile learning and understanding um

play45:33

I think some of the challenges that come

play45:34

with it are understanding the costs both

play45:37

the embedding cost the the chunking the

play45:42

um the keeping the models up to date and

play45:45

uh establishing recency in some of the

play45:47

refal onment to generation and then just

play45:50

making sure the prompts and the

play45:52

retrievers have accurate context and

play45:55

that you're not not uh

play45:57

overloading with irrelevant information

play46:00

Clayton do do you have kind of like the

play46:02

the pros and cons or strength and

play46:03

challenges that you've seen in in your

play46:05

work with the retrieve long Mage

play46:07

generation yeah I mean I think the

play46:09

biggest strength is the is the time to

play46:11

PC right you can go from idea to PC very

play46:13

quickly with with retrieval augmented

play46:16

Generations um and one of the the

play46:18

biggest challenges that I think was a

play46:20

surprise to me as we've been doing all

play46:21

these poc's is

play46:24

um you need to think an architect about

play46:28

how you're going to retrieve it whether

play46:30

it's Vector search whether it's you're

play46:31

just going to store a CSV whether it's

play46:33

going to be a normal database but like

play46:35

the the way you retrieve your data

play46:37

matters because you for rag to work you

play46:40

do have to get the right data into the

play46:41

context window and shoving it on into a

play46:43

vector store may not guarantee you're

play46:45

getting the right data and so um

play46:47

thinking about how your data is is it

play46:48

structured is it unstructured to your

play46:50

point how often does it update how do

play46:52

you feed that updated data in like

play46:54

there's a lot of there's a lot of

play46:55

architecture around around the data um

play46:58

but you know from a PC standpoint like

play47:00

take a snapshot of it don't worry about

play47:01

updating the data you know just make

play47:03

sure it works but you know make sure

play47:04

you're retrieving it in the most

play47:06

efficient way possible to get the

play47:08

correct data into

play47:10

the amazing Clayton thanks again so much

play47:13

for joining us I always enjoy chatting

play47:15

with you my friend and with that uh I

play47:18

will close us out so uh this concludes

play47:22

our episode of real nebuli we hope we

play47:24

leave you with something to think about

play47:25

as you pursue innovating with AI if you

play47:28

enjoyed the show please leave us a

play47:29

rating and review on the platform you're

play47:30

listening or watching on and don't

play47:32

forget to subscribe so you never miss an

play47:34

episode thanks for listening we'll see

play47:36

you next

play47:41

time