RAG for long context LLMs

LangChain
22 Mar 202421:08

Summary

TLDRThe talk explores the evolving role of Retrieval-Augmented Generation (RAG) in the context of increasing context window sizes in language models. It discusses the phenomenon of 'context stuffing' and its limitations, particularly in retrieving and reasoning over multiple facts within a large context. The speaker presents experiments and analyses that highlight the challenges of retrieving information from the start of the context and suggests that RAG will continue to evolve, potentially moving towards document-centric approaches and incorporating more sophisticated reasoning mechanisms.

Takeaways

  • πŸ“ˆ Context windows for language models (LMs) are increasing, with some proprietary models surpassing the 2 trillion token regime.
  • 🧠 The rise of larger context windows has sparked a debate on the relevance of retrieval-augmented generation (RAG) systems, questioning if they are still necessary.
  • πŸ” RAG involves reasoning and retrieval over chunks of information, typically documents, to ground responses to questions.
  • πŸ“Š Experiments show that as the context window grows, the ability to retrieve and reason about information (needles) decreases, especially for information at the start of the context.
  • πŸ€” The phenomenon of decreased retrieval performance with larger context windows may be due to a recency bias, where the model favors recent tokens over older ones.
  • 🚫 There are concerns about the reliability of long context LMs for retrieval tasks, as they may not guarantee the quality of information retrieval.
  • πŸ’‘ The future of RAG may involve less focus on precise chunking and more on document-centric approaches, using full documents or summaries for retrieval.
  • πŸ”— New indexing methods like multi-representation indexing and hierarchical indexing with Raptor provide interesting alternatives for document-centric RAG systems.
  • ♻️ Iterative RAG systems, which include reasoning on top of retrieval and generation, are becoming more relevant as they provide a more cyclic and self-correcting approach.
  • πŸ” Techniques like question rewriting and web searches can be used to handle questions outside the scope of the retrieval index, offering a fallback for RAG systems.
  • 🌐 The evolution of RAG systems is expected to continue, incorporating long-context embeddings and cyclic flows for improved performance and adaptability.

Q & A

  • What is the main topic of Lance's talk at the San Francisco meetups?

    -The main topic of Lance's talk is whether the Retrieve-And-Generate (RAG) approach is becoming obsolete due to the increasing context window sizes of large language models (LLMs).

  • How has the context window size for LLMs changed recently?

    -The context window size for LLMs has been increasing, with state-of-the-art models now able to handle hundreds to thousands of pages of text, as opposed to just dozens of pages a year ago.

  • What is the significance of the 'multi-needle' test conducted by Lance and Greg Cameron?

    -The 'multi-needle' test is designed to pressure test the ability of LLMs to retrieve and reason about multiple facts from a larger context window, challenging the idea that LLMs can effectively replace RAG systems.

  • What did the analysis of GPD-4 with different numbers of needles placed in a 120,000 token context window reveal?

    -The analysis revealed that the performance or the percentage of needles retrieved drops with respect to the number of needles, and it also gets worse if the model is asked to reason on those needles.

  • What is the 'recency bias' mentioned in the talk and how does it affect retrieval?

    -The 'recency bias' refers to the tendency of models to focus on recent tokens, which makes retrieval of information from the beginning of the context window more difficult compared to information near the end.

  • What are the three main observations from the analysis of the 'multi-needle' test?

    -The three main observations are: 1) Reasoning is harder than retrieval, 2) More needles make the task more difficult, and 3) Needles towards the start of the context are harder to retrieve than those towards the end.

  • What is the 'document-centric RAG' approach mentioned in the talk?

    -The 'document-centric RAG' approach involves operating on the context of full documents rather than focusing on precise retrieval of document chunks. It uses methods like multi-representation indexing and hierarchical indexing to retrieve the right document for the LLM to generate a response.

  • How does the 'Raptor' paper from Stanford propose to handle questions that require information integration across many documents?

    -The 'Raptor' paper proposes a method where documents are embedded, clustered, and summarized recursively until a single high-level summary for the entire corpus of documents is produced. This summary is used in retrieval for questions that draw information across numerous documents.

  • What is the 'self-RAG' paper and how does it change the RAG paradigm?

    -The 'self-RAG' paper introduces a cyclic flow to the RAG paradigm, where the system grades the relevance of documents, rewrites the question if necessary, and iterates through retrieval and generation stages to improve accuracy and address errors.

  • How does the 'corrective RAG' approach handle questions that are outside the scope of the retriever's index?

    -The 'corrective RAG' approach grades the documents and if they are not relevant, it performs a web search and returns the search results to the LM for final generation, providing a fallback mechanism for out-of-domain questions.

  • What are some key takeaways from Lance's talk regarding the future of RAG systems?

    -Key takeaways include the continued relevance of routing and query analysis, the potential shift towards working with full documents, the use of innovative indexing methods like multi-representation and hierarchical indexing, and the integration of reasoning in the retrieval and generation stages to create more robust and cyclic RAG systems.

Outlines

00:00

πŸ€– Introduction to the Debate on the Relevance of RAG in Large Language Models

The speaker, Lance, introduces the topic of the debate surrounding the relevance of Retrieval-Augmented Generation (RAG) in the context of increasingly large language models (LLMs). He notes the growing size of context windows in LLMs and questions the need for a retrieval system when models can process thousands of pages. Lance discusses the phenomenon of 'context stuffing' and its implications for RAG, highlighting the importance of understanding the limitations and potential of current models in retrieving and reasoning over information.

05:04

πŸ“Š Analysis of GPD-4's Performance in Needle Retrieval

Lance presents an analysis of GPD-4's performance in retrieving 'needles' (specific facts) from a larger context. He explains the methodology of placing needles at various intervals within the context and testing the model's ability to retrieve them. The results show a decrease in retrieval performance with an increase in the number of needles and a notable difficulty in retrieving needles placed earlier in the context. Lance also discusses the potential reasons for this phenomenon, such as recency bias, and shares insights from others in the field.

10:04

πŸ”„ The Evolution of RAG and the Shift Towards Document-Centric Systems

The speaker discusses the evolution of RAG and the potential shift towards more document-centric systems. He questions the traditional approach of precise chunking and suggests that long context models may change the way we think about RAG. Lance introduces the idea of multi-representation indexing and the Raptor approach for document retrieval, emphasizing the importance of considering full documents and their summaries for efficient information retrieval.

15:04

πŸ’‘ Enhancing RAG with Iterative Reasoning and Adaptive Retrieval

Lance explores the concept of enhancing RAG with iterative reasoning and adaptive retrieval. He introduces the idea of self-RAG, which involves grading the relevance of documents and using this feedback to improve the generation process. The speaker also discusses the potential of using web searches as a fallback for questions outside the scope of the index, thus making RAG systems more robust and adaptable.

20:06

πŸš€ Future Directions for RAG and Large Context Models

In the concluding part, Lance outlines the future directions for RAG and the use of large context models. He emphasizes the continued relevance of query analysis, document-centric indexing, and iterative reasoning in enhancing RAG systems. The speaker also highlights the importance of balancing performance, accuracy, and latency, and suggests that we will likely see more cyclic and self-reflective RAG pipelines as we move towards more sophisticated language models.

Mindmap

Keywords

πŸ’‘Context Windows

In the context of the video, 'context windows' refer to the scope or size of the text that a language model (LM) can consider at one time. As the speaker mentions, these windows are getting larger for LLMs (Large Language Models), which means they can process more text at once. This is significant because it affects the model's ability to retrieve and reason over information, which is a central theme of the video.

πŸ’‘Retrieval-Augmented Generation (RAG)

RAG is a process that combines retrieval of information with the generation of text by a language model. It involves using aζ£€η΄’η³»η»Ÿ (retrieval system) to find relevant documents or information chunks and then using a language model to generate responses based on that retrieved information. The video explores whether the increasing size of context windows in LMs impacts the need for RAG and its effectiveness.

πŸ’‘Token

A 'token' in the context of language models refers to a basic unit of text, such as a word, phrase, or even a character, that the model uses to understand and generate text. The number of tokens used in pre-training a language model is indicative of the model's size and the amount of data it has been trained on. The video discusses the increase in token usage as a sign of the growing capabilities of language models.

πŸ’‘Needle in a Haystack (Hy) Challenge

The 'needle in a haystack' challenge is a test used to evaluate the ability of a language model to retrieve specific pieces of information (the 'needles') from a large set of text (the 'haystack'). This is relevant to the video's discussion on the effectiveness of language models with large context windows in performing retrieval tasks. The challenge involves placing specific 'needles' within a text and asking the LM to identify and retrieve them.

πŸ’‘Recency Bias

Recency bias refers to the tendency of a language model to prioritize or give more weight to recent or more recent information when generating text. In the context of the video, this bias could lead to the model focusing more on information that is closer to the current point of text generation and potentially overlooking older or less recent information in the context window.

πŸ’‘Long Context LMs

Long Context LMs are language models that are designed to handle and process large amounts of text, or a 'long context.' These models are significant because they push the boundaries of what LMs can do, such as processing hundreds to thousands of pages worth of text. The video explores the implications of these models on the future of RAG and information retrieval.

πŸ’‘Multi-Needle Challenge

The 'multi-needle challenge' is an extension of the traditional 'needle in a haystack' challenge, where multiple pieces of information ('needles') are hidden within a larger text ('haystack') and the language model is tasked with retrieving all of them. This challenge is used to test the model's ability to handle complex retrieval tasks involving multiple facts, which is crucial for understanding the capabilities and limitations of LMs in real-world scenarios.

πŸ’‘Reasoning

Reasoning in the context of the video refers to the language model's ability to not only retrieve information but also to process and make inferences from that information. It is a higher-level cognitive function that goes beyond simple retrieval, involving the model's capability to understand and use the retrieved data to generate coherent and logical responses.

πŸ’‘Latency

Latency in the context of the video refers to the delay or waiting time experienced by users when interacting with language models, particularly in the context of retrieval and generation tasks. As models get larger and more complex, there is a trade-off between the accuracy and performance of the model and the speed at which they can respond to user inputs.

πŸ’‘Document-Centric RAG

Document-Centric RAG is a retrieval-augmented generation approach that focuses on operating with the context of full documents rather than smaller, more specific chunks of text. This method involves retrieving entire documents that are relevant to a user's query and then using a language model to generate responses based on the full document context.

πŸ’‘Multi-Representation Indexing

Multi-Representation Indexing is a method of document indexing where different representations of a document, such as summaries or descriptive chunks, are created and indexed. This allows for efficient retrieval of the right document based on a user's query, without the need to embed the full document in the index. It simplifies the retrieval process and can improve performance.

Highlights

Context windows are getting larger for LLMs, with proprietary models reaching over 2 trillion token regime.

State of the art models a year ago were processing 4,000 to 8,000 tokens, which has increased to hundreds of pages with recent models like Claud 3 and Gemini.

The phenomenon of larger context windows has sparked a debate on whether RAG (retrieval-augmented generation) is becoming obsolete.

RAG involves reasoning and retrieval over chunks of information, typically documents, to ground responses in the retrieved content.

Experiments were conducted with Greg Cameron to pressure test the capabilities of LLMs in multi-needle scenarios, which mimic RAG use cases.

Results show that as the number of needles (facts) increases, the performance of retrieval drops, especially when reasoning is involved.

There is a tendency for models to have better retrieval for information closer to the end of the context window, indicating a potential recency bias.

The paper discusses the limitations of context stuffing in large LLMs, emphasizing that there are no guarantees for retrieval.

The future of RAG may involve a shift from precise chunking to more document-centric approaches, using full documents or summaries for retrieval.

Multi-representation indexing is introduced as a method for document retrieval, using document summaries for indexing and retrieval, then passing full documents to the LM.

Raptor, a hierarchical document summarization and indexing approach, is presented as a solution for integrating information across many documents.

Self-RAG is a cyclic flow approach that involves grading document relevance and performing question rewriting or further iterations to improve accuracy.

C-RAG (Corrective RAG) is a method that uses web searches as a fallback when questions are outside the domain of the retriever.

The talk emphasizes the importance of query analysis, routing, and construction in RAG systems, regardless of the LLM context length.

The future of RAG is likely to see more cyclic flows and document-centric indexing, moving away from a naive prompt-response paradigm.

The discussion on recency bias highlights the need for careful consideration of information retrieval mechanisms in LLMs.

The talk concludes with the assertion that RAG is not dead but will evolve alongside improvements in long context LLMs.

Transcripts

play00:04

hi this is Lance from Lang chain this is

play00:06

a talk I gave at two recent meetups in

play00:08

San Francisco called is rag really dead

play00:11

um and I figured since you know a lot of

play00:13

people actually weren't able to make

play00:14

those meetups I just record this and put

play00:17

this on YouTube and and see if this is

play00:18

of interest to folks um so we all kind

play00:22

of recognize that context windows are

play00:24

getting larger for llms so on the x-axis

play00:27

you can see the tokens used in

play00:28

pre-training that's of course you know

play00:30

getting larger as well um proprietary

play00:33

models are somewhere over the 2 trillion

play00:35

token regime we don't quite know where

play00:37

they sit uh and we've all the way down

play00:39

to smaller models like 52 trained on far

play00:41

fewer

play00:42

tokens um but what's really notable is

play00:45

on the y axis you can see about a year

play00:48

ago state of the art models were on the

play00:50

order of 4,000 to 8,000 tokens and

play00:52

that's you know dozens of pages um we

play00:55

saw Claud 2 come out with a 200,000

play00:58

token model earlier I think it was last

play01:00

year um gbd4 128,000 tokens now that's

play01:04

hundreds of pages and now we're seeing

play01:06

Claud 3 and Gemini come out with million

play01:09

token models so this is hundreds to

play01:10

thousands of pages so because of this

play01:14

phenomenon people have been kind of

play01:15

wondering is rag dead if you can stuff

play01:17

you know many thousands of pages into

play01:19

the Contex window open llm why do you

play01:22

need a retrieval system um it's a good

play01:24

question spoke sparked a lot of

play01:26

interesting debate on Twitter um and

play01:29

it's maybe first just kind of grounding

play01:30

on what is rag so rag is really the

play01:32

process of reasoning and retrieval over

play01:34

chunks of of information that have been

play01:37

retrieved um it's starting with you know

play01:39

documents that are indexed um they're

play01:42

retrievable through some mechanism

play01:44

typically some kind of semantic sity

play01:46

search or keyword search other

play01:47

mechanisms retrieve doct then pass to an

play01:50

llm and the llm reasons about them to

play01:52

ground response to the question in the

play01:55

retrieve document so that's kind of the

play01:56

overall

play01:57

flow but the important point to make is

play01:59

that typically it's multile documents

play02:01

and involve some form of

play02:03

reasoning so one of the questions I

play02:05

asked recently is you know if longc LMS

play02:08

can replace rag it should be able to

play02:10

perform you know multia retrieval and

play02:12

reasoning from its own context really

play02:15

effectively so I teamed up with Greg

play02:17

Cameron uh to kind of pressure test this

play02:19

and he had done some really nice needle

play02:21

and the Haack analyses already focused

play02:23

on kind of single facts called needles

play02:26

placed in a Hy stack of Paul Graham

play02:29

essays um so I kind of extended that to

play02:32

kind of mirror the rag use case or kind

play02:34

of the rag context uh where I took

play02:37

multiple facts so I call it multi-

play02:39

needle um I buil on a funny needle in

play02:42

the HTO challenge published by anthropic

play02:44

where they at they basically placed

play02:46

Pizza ingredients in the context uh and

play02:48

asked the LM to retrieve this

play02:50

combination of pizza ingredients I did I

play02:53

kind of Rift on that and I basically

play02:55

split the pizza ingredients up into

play02:56

three different needles and placed those

play02:59

three ingredients different places in

play03:00

the context and then ask the LM to

play03:03

recover those three ingredients um from

play03:06

the context so again the setup is the

play03:09

question is whether the secret

play03:10

ingredients need to build a perfect

play03:11

Pizza the needles are the ingredients

play03:13

figs Pudo goat cheese um I place them in

play03:17

the context at some specified intervals

play03:20

so the way this test works is you can

play03:22

basically set the percent of context you

play03:24

want to place the first needle and the

play03:27

remaining two are placed at roughly

play03:28

equal intervals in the remaining context

play03:30

after the first so that's kind of the

play03:31

way the test is set up now it's all open

play03:33

source by the way the link is below so

play03:36

needs are placed um you ask a question

play03:39

you prompt LM with with kind of um with

play03:42

this context in the question and then

play03:44

produces the answer and now the the

play03:45

framework will grade the response both

play03:49

one are you know all are all the the

play03:52

specified ingredients present in the

play03:54

answer and two if not which ones are

play03:57

missing so I ran analysis on this with

play04:00

GPD 4 and came kind of came up with some

play04:02

with some fun results um so you can see

play04:05

on the left here what this is looking at

play04:07

is different numbers of needles placed

play04:09

in 120,000 token context window for GPD

play04:12

4 and I'm asking um gbd4 to retrieve

play04:17

either one three or 10 needles now I'm

play04:21

also asking it to do reasoning on those

play04:23

needles that's what you can see in those

play04:25

red bars so green is just retrieved the

play04:27

ingredients red is reasoning and the

play04:29

reasoning challenge here is just return

play04:31

the first letter of each ingredient so

play04:34

we find is basically two things the

play04:37

performance or the percentage of needles

play04:39

retrieved drops with respect to the

play04:41

number of needles that's kind of

play04:43

intuitive you place more facts

play04:45

performance gets worse but also it gets

play04:47

worse if you ask it to reason so if you

play04:50

say um just return the needles it does a

play04:53

little bit better than if you say return

play04:55

the needles and tell me the first letter

play04:58

so you overlay reasoning so this is the

play04:59

first observation morax is harder uh and

play05:03

reasoning is harder uh than just

play05:06

retrieval now the second question we ask

play05:08

is where are these needles actually

play05:10

present in the context that we're

play05:11

missing right so we know for example um

play05:15

retrieval of um 10 needles is around 60%

play05:20

so where are the missing needles in the

play05:23

context so on the right you can see

play05:25

results telling us actually which

play05:27

specific needles uh it our the model

play05:30

fails to retrieve so we can see is as

play05:34

you go from a th000 tokens up to 120,000

play05:37

tokens on the X here and you look at

play05:39

needle one placed at the start of the

play05:41

document to needle 10 placed at the end

play05:44

at a th000 token context link you can

play05:47

retrieve them all so again kind of match

play05:49

what we see over here small well

play05:52

actually sorry over here everything I'm

play05:53

looking at 120,000 tokens so that's

play05:56

really not the point uh the point is

play05:58

actually smaller context uh better

play06:01

retrieval so that's kind of point one um

play06:05

as I increase the context window I

play06:07

actually see that uh there is increased

play06:11

failure to retrieve needles which you

play06:12

see can see in red here towards the

play06:15

start of the

play06:16

document um and so this is an

play06:18

interesting result um and it actually

play06:20

matches what Greg saw with single needle

play06:22

case as well so the way to think about

play06:24

it is it appears that um you know if you

play06:28

for example read a book and I asked you

play06:30

a question about the first chapter you

play06:31

might have forgotten it same kind of

play06:33

phenomenon appears to happen here with

play06:35

retrieval where needles towards the

play06:37

start of the context are are kind of

play06:39

Forgotten or are not well retrieved

play06:42

relative to those of the end so this is

play06:44

an effect we see with gbd4 it's been

play06:46

reproduced quite a bit so I ran nine

play06:48

different trials here Greg's also seen

play06:50

this repeatedly with single needle so it

play06:52

seems like a pretty consistent

play06:54

result and there's an interesting point

play06:56

I put this on Twitter and a number of

play06:57

folks um you know replied and someone

play07:00

sent me this paper which is pretty

play07:01

interesting and it mentions recency bias

play07:04

is one possible reason so the most

play07:06

informative tokens for predicting the

play07:07

next token uh you know are are are

play07:11

present close to or recent to kind of

play07:14

where you're doing your generation and

play07:15

so there's a bias to attend to recent

play07:17

tokens which is obviously not great for

play07:20

the retrieval problem as we saw here so

play07:23

again the results show us that um

play07:27

reasoning is a bit harder than retrieval

play07:29

more needles is more difficult and

play07:31

needles towards the start of your

play07:33

context are harder to retrieve than

play07:35

towards the end those are three main

play07:37

observations from this and it may be

play07:39

indeed due to this recency bias so

play07:42

overall what this kind of tells you is

play07:44

be wary of just context stuffing in

play07:46

large long context LMS there are no

play07:48

retrieval

play07:49

guarantees and also there's some recent

play07:52

results that came out actually just

play07:53

today suggesting that single needle may

play07:55

be misleadingly easy um you know there's

play07:59

no reason

play08:00

it's retrieving a single needle um and

play08:03

also these guys I'm I show this tweet

play08:06

here showed that um the in a lot of

play08:09

these needle and Haack challenges

play08:11

including mine the facts that we look

play08:13

for are very different than um the

play08:17

background kind of Hy stack of Paul

play08:19

Graham essays and so that may be kind of

play08:20

an interesting artifact they note that

play08:23

indeed if the needle is more subtle

play08:25

retrieval is worse so I think basically

play08:28

when you see really strong performing

play08:30

needle and hyack analyses put up by

play08:32

model providers you should be skeptical

play08:35

um you shouldn't necessarily assume that

play08:37

you're going to get high quality

play08:38

retrieval from these long contact LMS uh

play08:40

for numerous reasons you need to think

play08:42

about retrieval of multiple facts um you

play08:45

need to think about reasoning on top of

play08:46

retrieval you need to think about the

play08:48

subtlety of the retrieval relative to

play08:50

the background context because for many

play08:53

of these needle and the hyack challenges

play08:54

it's a single needle no reasoning and

play08:56

the needle itself is very different from

play08:58

the background so anyway those may all

play09:00

make the challenge a bit easier than a

play09:02

real world scenario of fact retrieval so

play09:04

I just want to like kind of lay out that

play09:06

those cautionary notes but you know I

play09:10

think it is fair to say this will

play09:11

certainly get better and I think it's

play09:14

also fair to say that rag will change

play09:16

and this is just like a nearly not a

play09:18

great joke but Frank zap a musician made

play09:21

the point Jazz isn't dead it just smells

play09:23

funny you know I think same for rag rag

play09:25

is not dead but it will change I think

play09:27

that's like kind of the key Point here

play09:29

um so just as a followup on that rag

play09:32

today is focus on precise retrieval of

play09:34

relevant doc chunks so it's very focused

play09:36

on typically taking documents chunking

play09:39

them in some particular way often using

play09:41

very idiosyncratic chunking methods

play09:43

things like chunk size are kind of

play09:45

picked almost arbitrarily embedding them

play09:47

storing them in an index taking a

play09:50

question embedding it doing KNN uh

play09:52

similarity search to retrieve relevant

play09:54

chunks you're often setting a k

play09:56

parameter which is the number of chunks

play09:57

you retrieve you often will do some kind

play09:59

of filtering or Pro processing on the

play10:01

retriev chunks and then ground your

play10:04

answer in those retrieve chunks so it's

play10:05

very focused on precise retrieval of

play10:07

just the right chunks now in a world

play10:11

where you have very long context models

play10:13

I think there's the a fair question to

play10:15

ask is is this really kind of the most

play10:17

reasonable approach so kind of on the

play10:20

left here you can kind of see this

play10:22

notion closer to today of I need the

play10:24

exact relevant chunk you can risk over

play10:26

engineering you can have you know higher

play10:28

complexity sensitivity to these odd

play10:31

parameters like chunk size k um and you

play10:33

can indeed suffer lower recall because

play10:35

you're really only picking very precise

play10:37

chunks you're beholden to very

play10:38

particular embedding models so you know

play10:41

I think going forward as long contact

play10:43

models get better and better there are

play10:45

definitely question you should certainly

play10:47

question the current kind of very

play10:49

precise chunking rag Paradigm but on the

play10:51

flip side I think just throwing all your

play10:54

docks into context probably will also

play10:56

not be the preferred approach you'll

play10:58

suffer higher latency higher token usage

play11:00

I should note that today 100,000 token

play11:02

GPD 4 is like $1 per generation I spent

play11:06

a lot of money on Lang Chain's account

play11:08

uh on that multi needle analysis I don't

play11:10

want to tell Harrison how much I spent

play11:12

uh so it's it's you know it's not good

play11:15

right um You Can't audit retrieval um

play11:18

and security and and authentication are

play11:20

issues if for example you need different

play11:22

users different different access to

play11:24

certain kind of retriev documents or

play11:25

chunks in the Contex stuffing case you

play11:28

you kind of can't security as easily so

play11:30

there's probably some predo optimal

play11:32

regime kind of here in the middle and um

play11:36

you know I I put this out on Twitter I

play11:37

think there's some reasonable points

play11:38

raised I think you know this inclusion

play11:41

at the document level is probably pretty

play11:42

sane documents are self-contained chunks

play11:45

of context um so you know what about

play11:48

document Centric rag so no chunking uh

play11:51

but just like operate on the context of

play11:53

full documents so you know if you think

play11:56

forward to the rag Paradigm that's

play11:58

document Centric you still have the

play12:00

problem of taking an input question

play12:02

routing it to the right document um this

play12:04

doesn't change so I think a lot of

play12:06

methods that we think about for kind of

play12:08

query analysis um taking an input

play12:11

question rewriting it in a certain way

play12:13

to optimize retrieval things like

play12:15

routing taking a question routing it to

play12:17

the right database be it a relational

play12:19

database graph database Vector store um

play12:22

and quer construction methods so for

play12:24

example text to SQL text to Cipher for

play12:27

graphs um or text to even like metadata

play12:30

filters for for Vector stores those are

play12:32

all still relevant in the world that you

play12:34

have long Contex llms um you're probably

play12:37

not going to dump your entire SQL DB and

play12:39

feed that to the llm you're still going

play12:41

to have SQL queries you're still going

play12:42

to have graph queries um you may be more

play12:45

permissive with what you extract but it

play12:47

still is very reasonable to store the

play12:49

majority of your structured data in

play12:50

these in these forms likewise with

play12:53

unstructured data like documents like we

play12:55

said before it still probably makes

play12:57

sense to en to you know store

play12:59

independently but just simply aim to

play13:01

retrieve full documents rather than

play13:02

worrying about these OS syncratic

play13:04

parameters like like chunk size um and

play13:07

along those lines there's a lot of

play13:09

methods out there we've we've done a few

play13:11

of these that are kind of well optimized

play13:13

for document retrieval so one I want a

play13:15

flag is what we call

play13:16

multi-representation indexing and

play13:18

there's actually a really nice paper on

play13:20

this called dense X retriever or

play13:21

proposition indexing but the main point

play13:23

is simply this what you do is you take

play13:26

your OD document you produce a

play13:27

representation like a summary of that

play13:29

document you index that summary right

play13:32

and then um at retrieval time you ask

play13:35

your question you embed your question

play13:37

and you simply use a highle summary to

play13:40

just retrieve the right document you

play13:41

pass the full document to the LM for uh

play13:44

kind of final generation so it's kind of

play13:47

a nice trick where you don't have to

play13:49

worry about embedding full documents in

play13:50

this particular case you can use kind of

play13:53

very nice descriptive summarization

play13:55

prompts to build descriptive summaries

play13:57

and the problem you're solving here is

play13:59

just get me the right document it's an

play14:00

easier problem than get me the right

play14:02

chunk so this is kind of a nice approach

play14:05

it there's also different variants of it

play14:07

which I share below one is called parent

play14:08

document retriever where you could use

play14:10

in principle if you wanted smaller

play14:12

chunks but then just return full

play14:14

documents but anyway the point is

play14:16

preserving full documents for Generation

play14:18

but using representations like summaries

play14:20

or chunks for retrieval so that's kind

play14:22

of like approach one that I think is

play14:24

really interesting approach two is this

play14:27

idea of raptor is a cool paper came out

play14:29

of Stanford somewhere recently and this

play14:32

solves the problem of what if for

play14:33

certain questions I need to integrate

play14:35

information across many documents so

play14:38

what this approach does is it takes

play14:39

documents and it it embeds them and

play14:42

clusters them and then it summarizes

play14:44

each cluster um and it does this

play14:46

recursively until end up with only one

play14:48

very high level summary for the entire

play14:50

Corpus of documents and what they do is

play14:52

they take this kind of this abstraction

play14:54

hierarchy so to speak of different

play14:56

document summarizations and they just

play14:58

index all of it and they use this in

play15:01

retrieval and so basically if you have a

play15:02

question that draws an information

play15:04

across numerous documents you probably

play15:06

have a summary present and and indexed

play15:09

that kind of has that answer captured so

play15:12

it's a nice trick to consolidate

play15:14

information across documents um they

play15:16

there a paper actually reports you know

play15:19

these documents in their case or the

play15:20

leafes are actually document chunks or

play15:22

slic but I actually showed I have a

play15:25

video on it in a notebook that this

play15:26

works across full documents as well um

play15:30

and that's a nice segue into to do this

play15:32

you do need to think about long context

play15:34

embedding models because you're

play15:35

embedding full documents and that's a

play15:37

really interesting thing to track um the

play15:40

you know hazy research uh put out a

play15:42

really nice um uh blog post on this

play15:45

using uh with the Monarch mixer so it's

play15:47

kind of a new architecture that tends to

play15:50

longer context they have a 32,000 token

play15:53

embedding model that's pres that's

play15:55

available on together AI absolutely

play15:57

worth experimenting with I think this is

play15:58

really interesting Trend so long long

play16:00

Contex embeddings kind of play really

play16:02

well with this kind of idea you take

play16:04

full documents embed them using for

play16:06

example long contct edting models and

play16:08

you can kind of build these document

play16:09

summarization trees um really

play16:12

effectively so I think this another nice

play16:13

trick for working with full documents in

play16:16

the long context kind of llm regime um

play16:20

one other thing I'll note I think

play16:22

there's also going to Mo be move away

play16:24

from kind of single shot rag well

play16:26

today's rag we typically you know we

play16:28

chunk do doents uh uh embed them store

play16:31

them in an index you know do retrieval

play16:33

and then do generation but there's no

play16:35

reason why you shouldn't kind of do

play16:37

reasoning on top of the generation or

play16:39

reasoning on top of the retrieval and

play16:41

feed back if there are errors so there's

play16:43

a really nice paper called selfrag um

play16:46

that kind of reports this we implemented

play16:48

this using Lang graph works really well

play16:50

and the simp the idea is simply to you

play16:53

know grade the relevance of your

play16:54

documents relative to your question

play16:56

first if they're not relevant you

play16:58

rewrite the question you can do you can

play16:59

do many things in this case we do

play17:01

question rewriting and try again um we

play17:03

also grade for hallucinations we grade

play17:06

for answer relevance but anyway it kind

play17:08

of moves rag from a single shot Paradigm

play17:10

to a kind of a cyclic flow uh in which

play17:13

you actually do various gradings

play17:15

Downstream and this is all relevant in

play17:17

the long context llm regime as well in

play17:20

fact you know it you you absolutely

play17:22

should take advantage of of for example

play17:25

increasingly fast and performant LMS to

play17:28

do these great

play17:29

um Frameworks like lra allow you to

play17:32

build these kind of these flows which

play17:34

build which allows you to kind of have a

play17:36

more performant uh kind of kind of

play17:39

self-reflective rag pipeline now I did

play17:41

get a lot of questions about latency

play17:43

here and I completely agree there's a

play17:44

trade-off between kind of performance

play17:46

accuracy and latency that's present here

play17:48

I think the real answer is you can opt

play17:51

to use very fast uh for example models

play17:54

like grock we're seeing um you know gp35

play17:57

turbos of very fast these are fairly

play17:59

easy grading challenges so you can use

play18:02

very very fast LMS to do the grading and

play18:04

for example um you you can also restrict

play18:08

this to only do one turn of of kind of

play18:10

cyclic iteration so you can kind of

play18:11

restrict the latency in that way as well

play18:14

so anyway I think it's a really cool

play18:15

approach still relevant in the world as

play18:17

we move towards longer context so it's

play18:19

kind of like building reasoning on top

play18:20

of rag um in the uh generation and

play18:25

retrieval stages and a related point one

play18:27

of the challenges with

play18:29

is that your index for example you you

play18:32

may have a question that is that ask

play18:35

something that's outside the scope of

play18:36

your index and this is kind of always a

play18:38

problem so a really cool paper called c

play18:40

c rag or corrective rag came out you

play18:42

know a couple months ago that basically

play18:44

does a grading just like we talked about

play18:46

before and then if the documents are not

play18:48

relevant you kick off and do a web

play18:50

search and basically return the search

play18:52

results to the LM for final generation

play18:54

so it's a nice fallback in cases where

play18:57

um you're you the questions out of the

play18:59

domain of your retriever so you know

play19:02

again nice trick overlay reasing on top

play19:04

of rag I think this trend you know

play19:06

continues um because you know it it just

play19:09

it makes rag systems you know more

play19:12

performant uh and less brittle to

play19:15

questions that are out of domain so you

play19:16

know that's another kind of nice idea

play19:19

this particular approach also we showed

play19:21

works really well with with uh with open

play19:23

source models so I ran this with mraw 7B

play19:25

it can run locally on my laptop using a

play19:27

llama so again really nice approach I

play19:30

encourage you to look into this um and

play19:31

this is all kind of independent of the

play19:33

llm kind of context length these are

play19:35

reasoning you can add on top of the

play19:38

retrieval stage that that can kind of

play19:40

improve overall performance and so the

play19:42

overall picture kind of looks like this

play19:44

where you know I think that the the the

play19:48

problem of routing your question to the

play19:50

right database and or to the right

play19:52

document kind of remains in place queer

play19:54

analysis is still quite relevant routing

play19:56

is still relevant queer construction is

play19:57

still relevant

play19:59

um in the long context regime I think

play20:01

there is less of an emphasis on document

play20:03

chunking working with full documents is

play20:05

probably kind of more Paro optimal so to

play20:07

speak um there's some some clever tricks

play20:10

for IND indexing of documents like the

play20:12

multi-representation indexing we talked

play20:14

about the hierarchical indexing using

play20:16

Raptor that we talked about as well are

play20:17

two interesting ideas for document

play20:19

Centric indexing um and then kind of

play20:22

reasing in generation post retrieval on

play20:25

retrieval itself to grade on the

play20:27

generations themselves checking for

play20:29

hallucinations those are all kind of

play20:31

interesting and relevant parts of a rag

play20:33

system that I think we'll probably will

play20:35

see more and more of as we move more

play20:37

away from like a more naive prompt

play20:39

response Paradigm more to like a flow

play20:41

Paradigm we're seeing that actually

play20:42

already in code generation it's probably

play20:44

going to carry over to rag as well where

play20:46

we kind of build rag systems that have

play20:48

kind of a cyclic flow to them operate on

play20:50

documents use lomics llms um and still

play20:53

use kind of routing and query analysis

play20:54

so reasoning pre- retrieval reasoning

play20:57

post- retrieval so one any that was kind

play20:59

of my talk um and yeah feel free to

play21:01

leave any comments on the video and I'll

play21:03

try to answer any questions but um yeah

play21:05

that's that's probably about it thank

play21:07

you

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
AI EvolutionLanguage ModelsRetrieval SystemsContext WindowsInformation RetrievalLong-Form ContentTech MeetupsSan FranciscoOpen SourceIndustry Trends