Is RAG Really Dead? Testing Multi Fact Retrieval & Reasoning in GPT4-128k
Summary
TLDRLance 从 Lang Smith 分享了一个有趣的分析,他最近在研究多针和干草堆(multi-needle and Haystack)的概念。这个分析探讨了大型语言模型(LLMs)如何从长文本中检索特定事实。他提到了 Gemini 1.5 和 CLA 3 能够处理高达百万个标记的上下文长度,引发了是否能够完全用 LLMs 替代外部检索系统(RAG)的讨论。Greg Camand 的针和干草堆分析测试了 GPD 4 和 Claude 在不同上下文长度和文档位置下检索事实的能力。Lance 扩展了这项工作,增加了多针检索和评估的能力,并通过实验发现,随着上下文长度的增加,LLMs 检索文档前部信息的能力会下降。他还提到了 Lang Smith 作为评估工具的好处,包括记录所有运行和便于审计。最后,Lance 讨论了使用 LLMs 进行推理和检索时的限制,并强调了检索和推理作为独立问题的重要性。
Takeaways
- 📈 **多针检索分析**:Lance 讨论了对大型语言模型(LLM)进行的一项分析,称为多针检索(multi-needle retrieval),旨在了解模型在处理长文本上下文时检索特定信息的能力。
- 🔍 **上下文长度的影响**:随着上下文长度的增加,如Gemini 1.5和CLA 3报告的百万级令牌上下文长度,LLM在检索文档开始处的事实时性能下降。
- 📚 **文档位置的重要性**:在长文本上下文中,LLM更有可能检索到文档后半部分的信息,而不是前半部分的信息。
- 💡 **多针检索的挑战**:当需要从上下文中检索多个事实(多针)时,如Google报告的100针检索,性能会随着上下文长度的增加和针的数量增加而降低。
- 📊 **实验设计**:通过在不同位置注入多个“针”(即关键信息点),并改变上下文长度,Lance评估了LLM在不同条件下的检索性能。
- 🛠️ **工具和资源**:Lance介绍了如何使用Greg的开源仓库和LangSmith工具来设置实验、运行测试并记录结果。
- 📝 **审计和验证**:通过LangSmith,可以详细记录实验的每个步骤,包括上下文、问题、答案和LLM的生成结果,便于审计和验证。
- 📉 **性能退化模式**:随着上下文长度的增加,尤其是当文档开始处的针数量增加时,LLM检索性能会下降,显示出一种模式。
- 🧠 **推理与检索的关系**:推理任务建立在成功检索的基础上,如果检索效果不佳,推理性能也会受到影响。
- 💰 **成本考量**:虽然长上下文的测试成本较高,但通过精心设计实验,可以在合理的预算内进行有意义的研究。
- 📝 **数据共享**:所有数据和工具都是开源的,可以在Greg的仓库中找到,便于其他研究者复现和验证实验结果。
- ⚠️ **LLM的限制**:Lance强调了在考虑用LLM替代传统检索系统(RAG)时,需要理解长上下文检索的限制和挑战。
Q & A
Lance 正在讨论的分析项目叫什么名字?
-Lance 正在讨论的分析项目叫做 'multi-needle and Haystack'。
为什么 Gemini 1.5 和 CLA 3 报告了高达一百万个令牌的上下文长度后,会引发很多问题?
-这是因为在拥有一百万个令牌的上下文中,可以包含成百上千页的信息,这引发了是否可以完全用大型语言模型(LLM)替代传统的检索系统(如 RAG)的问题。
Greg Camand 进行的 'Needle and Haystack' 分析主要尝试回答什么问题?
-Greg Camand 的 'Needle and Haystack' 分析主要尝试回答在不同上下文长度和事实在文档中的位置下,LLM 能够多好地从上下文中检索特定事实。
在多针检索和评估中,'needles' 是指什么?
-'Needles' 在这里指的是需要从上下文中检索的具体信息或事实。
Google 最近报告的 100 针检索展示了什么能力?
-Google 展示的能力是在单个查询中检索 100 个独特的 'needles',即在单个回合中从上下文中检索 100 个不同的事实或信息点。
Lance 在 Greg 的开源库中添加了什么功能?
-Lance 在 Greg 的开源库中添加了多针检索和评估的能力,允许在上下文中注入多个 'needles' 并评估性能。
LangSmith 作为评估工具有哪些优势?
-LangSmith 作为评估工具的优势包括能够记录所有运行情况、为你编排评估过程,并且非常适合审计。
在 Lance 的分析中,为什么在文档的开始部分放置的 'needles' 更难被检索?
-在 Lance 的分析中,文档开始部分的 'needles' 更难被检索可能是因为语言模型在处理长上下文时,对文档早期部分的信息记忆或检索能力较弱。
为什么 Lance 认为即使在长上下文中,多针检索也不能保证检索到所有事实?
-因为随着上下文长度的增加和 'needles' 数量的增加,检索性能会下降,尤其是在文档的开始部分,LLM 可能会 '忘记' 或未能检索到这些信息。
Lance 在分析中使用了什么方法来验证 'needles' 是否正确放置在上下文中?
-Lance 使用了 LangSmith 提供的详细日志功能,通过搜索特定的关键词(如 'secret ingredient'),来验证所有的 'needles' 是否确实存在于上下文中,并按预期放置。
在 Lance 的研究中,为什么推理(reasoning)的性能可能会受到检索(retrieval)性能的限制?
-因为在需要推理的场景中,首先需要正确检索到所有相关的信息或事实,然后才能进行有效的推理。如果检索阶段未能找到所有必要的信息,那么推理阶段的性能自然会受到影响。
Lance 提到的分析成本大概是多少,他建议如何在预算内进行有效的测试?
-Lance 提到的分析成本大约是 2 美元左右,主要是长上下文测试的成本较高。他建议如果是为了个人研究,可以只进行单次通过的测试,这样可以在合理的预算内进行多项测试,尤其是在关注中等上下文长度范围时。
Outlines
😀 介绍多针检索分析
Lance从Lang Chain介绍了他最近进行的一项有趣的分析,名为“多针和干草堆”。这项分析探讨了大型语言模型(LLMs)如何从大量文本中检索特定事实。Lance讨论了随着上下文长度的增加,如Gemini 1.5和CLA 3能够处理高达一百万个token的上下文长度,人们开始质疑是否还需要外部检索系统。Greg Camand的“针和干草堆”分析尝试回答了这些问题,通过测试不同上下文长度和文档中事实放置位置对LLMs检索性能的影响。Lance进一步扩展了这项工作,增加了多针检索和评估的能力,以模拟真实应用场景中的文档检索需求。
📈 多针检索的实现和评估
Lance详细说明了如何使用Lang Smith作为评估工具来实现和评估多针检索。他展示了如何设置Lang Smith项目,创建数据集,并在其中添加问题和答案对。然后,他解释了如何使用Greg的开源代码库来运行实验,包括设置上下文长度、文档深度百分比和需要注入的针的数量。实验结果会记录在Lang Smith中,包括检索到的针、上下文长度、插入百分比、模型名称等信息,为审计和验证提供了便利。
🔍 实验结果分析
Lance分享了使用GBD4模型进行的多针检索实验结果。实验设置了不同的评估集,分别针对1针、3针和10针的检索,并测试了从1K到120K不同上下文长度的检索能力。结果显示,当上下文长度增加时,尤其是针的数量增加时,检索性能会下降。有趣的是,模型更有可能忘记或未能检索到文档前半部分的信息。此外,Lance还展示了一个需要检索并进行推理的实验,要求模型返回每个秘密成分的首字母,结果表明,随着针的数量增加,检索和推理的性能都会下降。
💡 长上下文检索的局限性
Lance强调了长上下文检索的局限性,尤其是在多针检索的情况下,没有检索保证,尤其是当针的数量和上下文大小增加时。他还指出,模型倾向于在文档的前部检索时失败,这可能会影响对文档早期部分重要信息的检索。此外,Lance提到了提示(prompting)的重要性,以及检索和推理作为独立问题的重要性,其中检索是进行有效推理的前提。
📚 开源工具和成本考量
Lance提到所有使用的工具和数据都是开源的,可以在Greg的代码库中的viz部分找到。他鼓励大家使用这些工具进行实验,因为长上下文LLMs非常有前景。他还提到了进行此类分析的成本问题,指出虽然长上下文测试的成本较高,但是通过精心设计的研究,可以在合理的预算内进行大量试验。
Mindmap
Keywords
💡多针检索(multi-needle retrieval)
💡上下文长度(context length)
💡Gemini 1.5
💡CLM (Claude) 3
💡Needle and Haystack分析
💡LangSmith
💡长上下文LLMs(long context LLMs)
💡检索失败(retrieval failure)
💡推理(reasoning)
💡成本(cost)
💡数据集(dataset)
Highlights
Lance从Lang chain介绍了一个有趣的分析,称为multi-needle and Haystack,探讨了LMS在不同上下文长度和事实放置位置下检索特定事实的能力。
Gemini 1.5和CLA 3最近报告了高达一百万个token的上下文长度,引发了关于是否可以完全替代外部检索系统的讨论。
Greg Camand进行了一项名为needle and Haystack的分析,测试了GPT-4和Claude在不同上下文长度和文档位置下的事实检索能力。
研究发现,对于较长的上下文,LMS在检索文档开头的事实时表现不佳。
对于RAG系统,通常需要检索多个事实,并在文档的不同部分进行推理。
Google报告了其Gemini 1.5能够在单次交互中检索100个独特的'needles'。
Lance在Greg的开源仓库中添加了多针检索和评估的功能。
通过Lang Smith作为评估工具,可以记录所有运行情况,便于审计和分析。
使用Lang Smith进行测试,可以创建数据集并运行实验,所有信息都会被记录和存储。
实验结果显示,随着上下文长度的增加,多针检索的性能会下降,尤其是当针位于文档的前部时。
对于长上下文的检索,LMS趋向于忘记或未能检索到文档前部的事实。
实验还测试了在检索的基础上进行推理的能力,发现随着针的数量增加,推理的准确性也会下降。
检索并不是有保证的,尤其是当针的数量和上下文大小增加时,可能无法保证检索到所有事实。
改进的提示(prompting)可能是提高检索效率的关键。
检索和推理是两个独立的问题,检索是进行有效推理的前提。
即使是设计良好的研究,也可以在合理的预算内完成,长上下文测试的成本较高。
所有数据和工具都是开源的,可以在Greg的仓库中找到,以供进一步分析和使用。
Transcripts
hi this is Lance from Lang chain I want
to talk about a pretty fun analysis that
I've been working on for the last few
days um called multi- needle and Hy
stack um and we're going to talk through
these graphs and what they mean and some
of the major insights that came out of
this analysis uh maybe but first I want
to like kind of set the
stage so context lengths for LMS have
been increasing most notably Gemini 1.5
and CLA 3 have recently reported up to a
million token context lengths and this
has provoked a lot of questions like if
you have a million tokens which is
hundreds or maybe thousands of pages can
you replace rag altogether you know why
would you need an external retrieval
system if you can Plum huge amounts of
context directly into these llms so it's
a really good question and a really
interesting debate to help kind of
address this Greg camand recently put
out an analysis called needle and
Haystack which attempts to answer the
question how well can these LMS retrieve
specific facts from their context with
respect to things like how long the
context is or where the fact is placed
within the context so he did a pretty
influential analysis on gptv 4 and also
Claude um which basically tested along
the X are different context lengths so
going from 1K all the way up to 120k in
the case of GPD 4 and on the Y being
different document placements or or fact
placements within the document so either
put it at the start of the doc or put at
the end so he basically injects this
needle into the context at different
places and varies the uh context length
and each time ask the question that's
that you need the fact or the needle to
answer and basically scores it you know
can the llm get it right or
wrong and what he's found is that the
llm at least in this case gbd4 fails to
retrieve facts towards the start of
documents in the regime of longer
context so that was kind of punchline
pretty interesting it's a nice way to
characterize performance of retrieval
with respect to these two important
parameters but you know for rag you
really want to retrieve multiple facts
so this was testing single fact
retrieval but typically for rag systems
you're chunking a document you're
retrieving some number of chunks maybe
you know three to five and you can do
this reasoning over disperate parts of a
document using similarity search and
chunk
retrieval so kind of to map the idea of
rag onto this approach you really would
need to be able to retrieve various
facts from a context um so it's maybe
like three needles five needles 10
needles and Google recently reported 100
needle retrieval so what they're showing
here is the ability to retrieve 100
unique needles in a single turn uh with
Gemini 1.5 and they test a large number
of points here they vary the context
blank you can see on the X and they show
the recall their number of needles that
they uh return on the
Y now this kind of analysis is really
interesting because if we're really
talking about the ability to kind of
remove rag from our workflows and strict
rely strictly on context stuffing we
really need to know like what is the
retrieval you know a recall with respect
to the Contex length or number of
needles does this really work well what
are the pitfalls right so I recently
added the ability to do multi- needle
retrieval and
evaluation uh to Greg's open source repo
so Greg op Source this whole analysis
and what I did is I went ahead and added
the ability to inject multiple needles
in a context and characterize and
evaluate the performance so the flow is
laid out here it's pretty simple and all
you need is is actually three things so
you need uh try just move this over so
you need a
question you need to know your needles
you need to have an answer so that's
kind of the way this
structured so a fun toy question which
is kind of derived from an interesting
Claude 3 needle in a Hast stack analysis
is related to Pizza ingredients so they
reported some funny results with Claude
basically trying to find a needle
related to Pizza ingredients in a c of
of other context and Claud 3 kind of
recognized it was being tested it was
kind of a funny tweet that went around
uh but it's actually kind of a fun a fun
challenge we can actually take the
question which was what are the secret
ingredients need to build the perfect
pizza and the answer the secret
ingredients are figs Pudo and goat
cheese and we can just parse that out
into three separate needles so our first
needle is figs are the secret
ingredients the second needle is Pudo
the third needle is goat cheese now the
way this analysis works is we take those
needles and we basically partition them
into the context uh at different
locations so we basically pick placement
of the first needle and the other two uh
are then allocated accordingly to in
into like kind of rough equal
spacing uh depending on how much context
is left after you place the first one so
that's kind of the way it works um you
then pass that context to an llm along
with a question have the llm answer the
question given the context um and then
we
evaluate um in this toy example the
answer only contains figs and evaluation
returns a score of one for figs and also
tells us which needle it retrieved
spatially um so that's the overall flow
and that's really what goes on when you
run this analysis based on the stuff
that I added to Greg's
repo um
now um if we want to set this up another
thing I add to the repo is the ability
to use Langs Smith as your evaluator and
this is a lot of nice properties it
allows you to log all of your runs it
orchestrates the evaluation for you and
it's really good for auditing which we
can see here in a minute so I'm going to
go ahead and and create uh a Lang Smith
data set to show how this works um and
all you need is so here's a notebook
I've just set uh a few different Secrets
um or uh environment variables um for
like lenss withth API key length chain
endpoint and tracing B2 so basically set
these and these are done in your lsmith
setup and once you've done this you're
all set to go so if you go over then to
your Lang Smith uh kind of overall
project page it'll look something like
this where you can see over here on the
left you have projects annotation cues
deployments data T and testing um and
what we're going to do is we're going to
go to data set and
testing and what we're going to do here
is we're going to say create a new data
set I'll move this down
here and we're going to create a new
data set that contains our question and
our answer and we're going to call this
multi needle
test so here we go we'll call this
multin needle testing
we'll create a key value data set and
just say create so now you can see we've
created a new data set it's empty um
there no tests and no examples and what
we're going to do is we're going to say
add example here and what we're going to
do is just copy over our
question there we go we'll copy over
that
answer so again our questions what are
the secret ingredients need to build the
perfect Pizza our answer contains those
secret ingredients we submit that so now
we have an example so this is basically
a data set with one example question
answer pair no tests yet so that's kind
of step
one now all we need to do simply is I've
already done this so um if we go up here
so this is Greg's
repo um you need to clone this you need
to just follow these commands to clone
setup create a virtual environment pip
install a few things and then you're set
to go so basically all we've done is
set up lsmith set these environment
variables clone the repo that's it
create a data set we're ready to
go so if we go
back this is the command that we can use
to run our valuation and there a few
pieces so we use Langs Smith our
evaluator we set some number of context
length intervals to test so in this case
I'm going to run three tests and we can
set our context lengths minimum and our
maximum so I'm going to say I'm want to
go from a th000 tokens all the way up to
120,000
tokens um and you know three intervals
so we'll do basically one data point in
the middle um I'm going to set this
document depth percent Min is the
initial point at which I insert the
first needle and then the other to will
be set accordingly in equal spacing in
the remaining context that's kind of all
that you need there we'll use open AI
I'll set the model I want to test I'm G
to basically flag multi needle
evaluation to be true and here's where
I'm basically going to point to this
data set that we just created this
multin needle test data set I specified
here in eval set and the final thing is
I just say here's the three needles I
want to inject into the context and you
can see that Maps exactly to what we had
here we had our question our answer
those are in Langs Smith and our needles
we just pass in so that's really it we
can do all we can take this
command and I can go over
to so here we go so I'm in my Fork of
Greg's repo right
now um and I'm just going to go ahead
and
run so we should see this go ahead and
kick off it has some nice logging so it
shows like okay um here's like our
experiment you know here's like what
we're testing um here's the needles that
we're injecting here's where we
inserting the needles um and this is
like our first experiment with a th000
tokens um and it's just rolling through
these parameters so it rolls through
these experiments as we've just set laid
out and what we can see is if we go over
to Langs Smith now we're going to start
to see experiments or tests roll in
that's pretty
nice and what we can do is we can kind
of Select here different settings we
want to look at so if you want to look
at needles retrieve for for example we
can see those scores here now this final
one's still running which is fine um but
it's all here for us and let's try to
refresh that okay so now we have them
all so this is pretty cool you can see
how much it cost so again be careful
here at
120,000 tokens it cost about a dollar so
it's expensive but smaller contexts very
cheap right so you're priced per token
um you can see the latencies here the
P99 you see the p50 latency uh creation
time runs and what's kind of cool is you
can see everything about the experiment
is logged here the the needles retrieved
the context length the first needle
depth percentage the insertion
percentages model name the needles
number of needles totals so it's all
there for you now I'm going to show you
something that's pretty cool you can
click on one of these and actually opens
up the run and here you can see here was
like our input here was the reference or
here was the input here's like the
reference output this is like the
correct answer and here's what the llm
actually said so in this case it looks
like the secret need to build a perfect
Pizza include Pudo gois and figs so it
gets it right and you can see it scores
as three which is exactly right so
that's pretty cool let's look at another
one um so we can look at our this is our
U 60,000 token experiment this so it's
the one in the middle and in this case
you can see the secret ingredients need
to build the perfect Pizza are goat
cheese and Pudo so it's missing figs so
that's kind of interesting and we can
see it correctly scores it as two now we
can even go further we can actually open
this run and you can see over here this
is just the trace this is kind of
everything that happened we can look at
our prompt um and this is like the whole
everything in our prompt now what's
pretty nice is we can actually see right
here this is the entire prompt that went
into the
llm um and here is the answer now what I
like about this is if we want to verify
that all the needles were actually
present so we can actually just search
for like secret ingredient right you can
see figs are one of the secret green so
it's in there it's definitely in the
context it's not in the answer um we can
just keep searching so Pudo GOI we can
see all three secret ingredients are in
the context and two are in the
generation so it's a nice sanity check
to say okay the needles are actually
there they're where you expect them to
be um I find that very useful for
auditing and making sure that like um
you know the needs r Place correctly so
it's really nice to be able to do that
uh to kind of convince yourself that
everything's working as
expected um so what's pretty cool then
is when I've done this um I actually can
do a bit more so I'm just GNA copy over
some code here and this will all be made
public um so I'm just paste this into
not my notebook and what this is going
to do this is basically just going to
grab my data sets um I just pass in my
eval set name and it's going to suck in
the data that we just ran and I think
that's probably done we can have a look
yep so you can see nice it loads this as
a pandage data frame it has all this uh
you know all this run information that
we want has the needles um and what's
really nice is actually it has also the
Run URL so this is now a public link
that can be
shared um and you can share those with
anyone so anyone can go ahead and audit
this run and convince thems that it was
doing the right
thing so with all this we can actually
do quite a lot um and we've actually
done uh kind of a more involved
study um that I want to talk about a
little bit right now so with this
Tooling in place we're able to ask some
interesting questions um we focus on gp4
to start and we kind of just talk
through usage this is a blog post that
I'll go out tomorrow along with this
video um we talk through usage so I
don't want to cover that too much we
talk through workflow but here's like
Crux of the analysis that we did so we
set up three eval sets uh for one three
and 10
needles um and what we did is we ran a a
bunch of Trials we burned a fair number
of tokens on this analysis but it was
worth it because it was it's it's pretty
interesting what we did is test the
ability for gbd4 to retrieve needles uh
with respect to like the number of
needles and also the context length so
you can see here is what we're doing is
we're asking gbd4 to retrieve either one
three or 10 needles again these are
Peach ingredients in a single turn and
what you can see is with one
needle um whether it's a th000 tokens or
120,000 tokens retrieval is quite good
so if you place a single needle not a
problem now I should note that this
needle is placed in the middle of the
context which may be relevant uh but at
least for this study the single needle
case is right in the middle and it's
able to get that every time regardless
of the context but here's where things
get a bit
interesting if we bump up the number of
needles so for example look at three
needles or 10 needles you can see the
performance start to degrade quite a lot
when we bump up the context so
retrieving three needles out of 120,000
tokens um you see performance drop to
like you know around 80% of the needles
are retrieved and it drops to more like
60% if you go to 10 needles so there's
an effect that happens as you add more
needles in the long contract regime you
miss more and more of them that's kind
of interesting so because we have all
the logging for every insertion point of
each needle we can also ask like where
are these failures actually occurring
and this this actually shows it right
here so what's kind of an interesting
result is that we
found here is an example of retrieving
10
needles um at different context sizes so
on the X here you can see the context
length going from a th000 tokens up to
120,000 tokens and on the Y here you can
see each of our needles so this is a
star of the document up at the top so
needle one 2 down to 10 are inserted in
our document at different locations and
each one of these is like one experiment
so at a th000 tokens you ask the LM to
retrieve all 10 needles and it gets them
all you can see this green means 100%
retrieval and what happens is as you
increase the context window you start to
see a real degradation in
performance now that shouldn't be
surprising we already know that from up
here that's the case right getting 10
needles from
120,000 is a lot worse than getting than
getting from a th a th000 you get them
every time 120,000 it's 60% but what's
interesting is this heat map tells you
where they're failing and there's a real
pattern there the degradation and
performance is actually due to Needles
Place towards the the top of the
document um so that's an interesting
result it appears that
um needles placed earlier in the context
uh have a lower chance of being
remembered or retrieved um now this is a
result that Greg also saw in the single
needle case and it appears to carry over
to the multi- needle case and it
actually starts a bit earlier in terms
of contact sizes so you can think about
it like this if you have a document and
you have like three different facts you
want to retrieve to answer a question
you're more likely to get the facts
retrieved from the latter half of the
document than the first so it can kind
of like forget about the first part of
the document or the fact in the first
part of the document which is very
important because of course valuable
information is present in the earlier
stages of the document uh but we may or
may not be able to retrieve it the
likely of retrieval drops a lot as you
move into this um kind of earlier part
of the document so so anyway that's an
interesting result it kind of follows
what Greg
reported um for the the single needle
case as well um now a final thing we
show is that often times you don't just
want to retrieve you also want to do
some reasoning um so we built a second
set of evalve challenges that ask for
the first letter of every ingredient so
you have to retrieve and also reason to
return the first letter of every of
every secret
ingredient and we found here is
basically that green is reasoning red is
retriev Ral and this is all done at
120,000 tokens as you bump up the number
of needles you can basically see that
both get worse as you would expect and
reasoning lags retrieval a bit which is
also what you kind of would expect
retrieval kind of sets an upper bound on
your ability to reason so again it's
important to recognize that one
retrieval is not guaranteed and two
reasoning May degrade a little bit
relative to retrieval so those are kind
of the two points that are really
important to recognize um
and maybe I'll just like kind of
underscore some of the main observations
here and you know especially as we think
about long context more we think about
replacing rag in certain use cases it's
very important to understand the
limitations of long context retrieval um
and like you know multi- needle
retrieval from long context in
particular there's no retrieval
guarantees so multiple facts are not
guaranteed to be retrieved especially as
number of needles and the contact size
increases we saw that pretty
clearly there can be different patterns
of retrieval failure so gbd4 tends to
retrieve needles uh tends to fail in
retrieval of needles towards the start
of the document as the contact length
increases that's point two point three
is prompting may definitely matter here
so I don't I don't presume to say we
have the ideal prompt I was using kind
of the The Prompt that Greg already had
in the repo it absolutely may be the
case that improved prompting is
necessary um it's been there's been sub
evidence that you need that for claws so
that's like very valid to consider um
and also that retrieval and reasoning
are both uh very important tasks and
retrieval might kind of set a bound on
your ability to reason and so if you
have a challenge that requires reasoning
on top of retrieval um you're kind of
like stacking two problems on one
another and your ability to reason may
or may not be kind of governed or
limited by your ability to retrieve the
facts which is pretty intuitive but you
know it's just a good thing to
underscore that like reasoning and
retrieval are independent problems um
and retrieval is like a precondition to
reason well um those are the main things
I want to leave you with um I also will
have a quick note here that if we go
back to our data set um which we can see
here like how much did this actually
cost I know people are kind of worried
about that so just the three tests that
we did are around um you know maybe
around around $2 so again it's really
those long context tests that are quite
costly lsmith shows you the cost here so
you kind of you can actually track it
pretty carefully but the nice thing is
for like a kind of well-designed study
you could spend maybe $10 and you could
actually look at you know quite a number
of Trials especially if you care more
about like the the middle context length
regime uh you can you can do quite a
number of tests within a reasonable
budget you know so I just I just want to
throw that out there that you don't
necessarily have to totally break the
bank to do some interesting work here um
the things we did were pretty costly but
that was mostly because we generated a
large number of replic just to validate
the results but if you're doing it for
yourself you could do it just a single
pass and this is actually not that many
measurements you know so here it would
be like 1 2 34 you know six measurements
if you don't do any replicates so you
know it's not too not too costly to to
produce some inry results using this
approach uh everything's open source all
the data is open source as well um it's
all checked into Greg's repo you can
find it um in the viz
section and uh yep mul data sets it'll
all be there so yeah I encourage you to
play with this hopefully it's useful and
I think long Contex llms are super
promising and interesting and Analysis
like this hopefully will make them at
least a little bit more understandable
and build some intuition as to whether
or not you can use them to actually
replace rag or not so thanks very
much
Browse More Related Video
Self-reflective RAG with LangGraph: Self-RAG and CRAG
[ML News] Jamba, CMD-R+, and other new models (yes, I know this is like a week behind 🙃)
Stanford CS25: V3 I Retrieval Augmented Language Models
Google IO 2024: The Gemini Era!
Build an AI code generator w/ RAG to write working LangChain
Understand DSPy: Programming AI Pipelines
5.0 / 5 (0 votes)