Is RAG Really Dead? Testing Multi Fact Retrieval & Reasoning in GPT4-128k
Summary
TLDRこのビデオスクリプトでは、Lang chainのLanceが「multi-needle and Haystack」と題された分析について語っています。最近のLMSでは、文脈の長さが増加しており、Gemini 1.5やCLA 3などは最大100万トークンの文脈長さを報告しています。これにより、外部検索システムなしで大量の文脈を直接LMSに組み込むことが可能か、という議論が生まれました。Greg Camandの「needle and Haystack」分析に触発され、Lanceは複数の「針」を文脈に注入し、そのパフォーマンスを評価する新しい分析をGregのオープンソースリポジトリに追加しました。この分析を通じて、文脈の長さや針の数に応じたLMSの検索能力の限界と可能性について深い洞察を提供しています。
Takeaways
- 📈 コンテキスト長さが増加しているLLM(言語モデル)について、最近のGemini 1.5やCLA 3が最大100万トークンのコンテキスト長さを報告している。
- 🔍 RAG(外部検索システム)の必要性に疑問を呈し、大量のコンテキストを直接LLMに供給することで置き換え可能かどうかが議論されている。
- 📊 Greg Camandによる「針と干し草」分析は、LLMがコンテキスト内の特定の事実をどの程度うまく検索できるかを探るものである。
- 📍 文書内の事実の配置やコンテキストの長さが、LLMの事実検索性能に影響を与えることが示されている。
- 🤖 複数の事実(針)をコンテキスト内で検索する「マルチニードル」検索の重要性が強調されており、Googleは100針検索を報告している。
- 🛠️ マルチニードル検索と評価機能をGregのオープンソースリポジトリに追加し、実験の流れが簡潔に説明されている。
- 📝 実験セットアップにはLang Smithを使った評価が含まれ、実行結果を監査するためのツールが提供されている。
- 🔎 GPT-4を使用した詳細な分析が行われ、コンテキスト長と針の数による検索性能の変化が観察されている。
- 📉 文書の前半に配置された事実は、長いコンテキストでは検索されにくい傾向にあるとの結果が示されている。
- 💡 長いコンテキストでの複数事実の検索や、検索と推論の関係についての洞察が提供されており、LLMの限界と可能性を理解するための重要な情報が含まれている。
Q & A
「多針と干し草の山」分析の目的は何ですか?
-LMSがコンテキストから特定の事実をどれだけうまく取り出せるか、つまり、文書内の事実の位置やコンテキストの長さなどの条件に応じて、LMSの特定情報の取得能力を評価することが目的です。
Gemini 1.5やCLA 3に関する言及の意味は何ですか?
-これらのモデルは、最大100万トークンまでの長いコンテキスト長をサポートしていることが示され、従来の外部検索システム(RAG)の必要性に疑問を投げかけています。
多針検索とは何ですか?
-多針検索は、LMSが一度に複数の異なる事実(「針」)をコンテキスト(「干し草の山」)から取り出せるかを評価する分析です。
「Lang Smith」の役割は何ですか?
-Lang Smithは、分析の実行、結果の記録、評価のオーケストレーションを担当し、分析プロセスを監査するのに役立ちます。
LMSが文書の始めの方の事実を取得するのがなぜ難しいのですか?
-分析結果によると、LMSは長いコンテキストで文書の先頭にある情報を取得するのが苦手であり、特に多くの情報を扱う際にこの傾向が強まることが示されています。
RAGをLMSで置き換える可能性についての調査結果は何ですか?
-長いコンテキストと多数の「針」を含む状況では、LMSがすべての事実を確実に取り出せるわけではないため、RAGの完全な置き換えは困難である可能性が示唆されています。
分析で使用された「秘密のピザの材料」とは何ですか?
-分析で使用された「秘密のピザの材料」は、イチジク、プードゥ、そしてヤギのチーズでした。
評価セットの作成に何が必要ですか?
-評価セットを作成するには、質問とそれに対する答えを含むデータセットが必要です。Lang Smithを使用してこれらを管理します。
「針」の配置が分析結果にどのように影響しますか?
-「針」の配置は、特に文書の先頭に近い位置に置かれた場合、取得の成功率に大きく影響します。文書の後半にある事実の方が、より正確に取得されやすいです。
多針検索のコストについての考察は何ですか?
-分析のコストはコンテキストの長さによって異なり、特に長いコンテキストでは費用が高くなりますが、適切に設計された研究では、合理的な予算内で多くの試験を実施できます。
Outlines
🔍 探索長文脈下的LMS記憶與檢索能力
這段落介紹了Lance從Lang chain進行的一項名為multi-needle and Hy,stack的分析。分析的焦點在於探討長文脈下LMS(Language Models)的檢索能力,尤其是當文脈長度達到百萬個token時,是否能夠完全取代外部檢索系統。Lance引用了Greg camand的針對GPT-4和CLA的分析,該分析測試了不同文脈長度下LMS檢索特定事實的能力。分析發現,當文脈變長時,LMS在檢索文檔開頭部分的事實時表現不佳。此外,Lance還提到了Google的100針檢索能力,這顯示了在單次轉換中檢索100個獨特需求的能力。
🧩 實現多針檢索與評估的過程
這部分詳細說明了如何實現多針檢索與評估。Lance介紹了他對Greg的開源庫進行的修改,使得能夠在一個文脈中注入多個需求點,並評估LMS的性能。他通過一個有趣的例子——找出完美披薩的秘密成分——來展示這個過程。這個例子涉及到將問題、答案和需求點注入到文脈中,然後讓LMS在給定文脈的情況下回答問題,並評估其表現。Lance還介紹了使用Langs Smith作為評估工具,它可以記錄所有運行並協調評估過程,對於審計非常有用。
📊 分析結果:長文脈下的檢索與成本
這段落展示了使用Langs Smith進行的實驗結果。實驗中,Lance測試了在不同文脈長度下,GPT-4檢索多個需求點的能力。結果顯示,隨著文脈長度的增加,檢索性能會下降,尤其是在檢索更多需求點時。此外,還觀察到一個模式:當需求點放置在文檔較前部分時,檢索成功率較低。Lance還提到了進行這種分析的成本,指出長文脈測試相對較貴,但通過精心設計的研究,可以在合理的預算內完成。
🤔 對長文脈LMS檢索能力的深入洞察
這部分探討了長文脈LMS檢索能力的局限性。Lance通過他們的分析發現,隨著需求點數量和文脈長度的增加,檢索保證並不是絕對的。特別是當需求點位於文檔開頭時,檢索成功率會降低。此外,他還提到了提示的重要性,並強調了檢索和推理是兩個獨立的任務,檢索可能會限制推理的能力。最後,Lance總結了幾個主要觀點,並鼓勵人們使用這種方法來更好地理解長文脈LMS的潛力和限制。
💡 結論:長文脈LMS檢索的未來應用
在這最後一段中,Lance總結了長文脈LMS檢索的關鍵點和可能的應用。他強調了長文脈檢索的局限性,尤其是在多針檢索和長文脈情況下。他提到,檢索並非總是保證的,並且可能存在不同的檢索失敗模式。此外,他還提醒人們,檢索和推理是兩個不同的問題,並且檢索可能會限制推理的能力。最後,他提到了這些分析的開源性質,鼓勵人們利用這些工具和數據集來進行自己的研究,並提供了相關的資源連結。
Mindmap
Keywords
💡コンテキスト長
💡ニードルと藁の山
💡RAG
💡複数ニードル検索
💡言語モデルの性能
💡Lang chain
💡Lang Smith
💡オープンソース
💡事実の配置
💡コスト
Highlights
Lance from Lang chain introduces a new analysis called multi-needle in haystack, aiming to understand large language models' (LLMs) context retrieval capabilities.
Context lengths for LLMs have been increasing significantly, with Gemini 1.5 and CLA 3 reporting up to a million token context lengths, raising questions about the necessity of external retrieval systems.
Greg Kamand's analysis, needle in haystack, is discussed, which tests LLMs' ability to retrieve specific facts from large contexts and how fact placement affects retrieval.
Lance adds the ability for multi-needle retrieval and evaluation to Greg's open-source repository, aiming to test LLMs' capability in retrieving multiple facts from extended contexts.
A toy example using pizza ingredients as needles demonstrates how multi-needle retrieval is set up and evaluated.
Lang Smith's tool is used for orchestrating and auditing the evaluation, emphasizing the importance of precise data logging and analysis.
The evaluation process includes setting up data sets in Lang Smith, cloning Greg's repo, and running tests with various context lengths and needle placements.
Early results show that LLMs' performance in retrieving needles from long contexts decreases as the number of needles increases.
A detailed breakdown of the evaluation's cost is provided, showing that even though tests with longer contexts are expensive, meaningful research can still be conducted within a reasonable budget.
Analysis reveals that needles placed earlier in the context are harder for LLMs to retrieve, a significant finding that impacts how information should be structured for LLMs.
Additional experiments test LLMs' ability to not only retrieve information but also to perform reasoning tasks based on the retrieved facts.
The importance of prompt design in improving LLMs' retrieval and reasoning capabilities is highlighted, suggesting that there's room for optimization.
The analysis concludes with observations on the limitations of long context retrieval, especially for multi-needle retrieval, underscoring the complexity of completely replacing external retrieval systems.
Lance underscores that the entire analysis, including tooling and data, is open source, encouraging further experimentation and exploration by the community.
The talk concludes with optimism about the potential of long context LLMs and the importance of continued analysis to understand their capabilities and limitations fully.
Transcripts
hi this is Lance from Lang chain I want
to talk about a pretty fun analysis that
I've been working on for the last few
days um called multi- needle and Hy
stack um and we're going to talk through
these graphs and what they mean and some
of the major insights that came out of
this analysis uh maybe but first I want
to like kind of set the
stage so context lengths for LMS have
been increasing most notably Gemini 1.5
and CLA 3 have recently reported up to a
million token context lengths and this
has provoked a lot of questions like if
you have a million tokens which is
hundreds or maybe thousands of pages can
you replace rag altogether you know why
would you need an external retrieval
system if you can Plum huge amounts of
context directly into these llms so it's
a really good question and a really
interesting debate to help kind of
address this Greg camand recently put
out an analysis called needle and
Haystack which attempts to answer the
question how well can these LMS retrieve
specific facts from their context with
respect to things like how long the
context is or where the fact is placed
within the context so he did a pretty
influential analysis on gptv 4 and also
Claude um which basically tested along
the X are different context lengths so
going from 1K all the way up to 120k in
the case of GPD 4 and on the Y being
different document placements or or fact
placements within the document so either
put it at the start of the doc or put at
the end so he basically injects this
needle into the context at different
places and varies the uh context length
and each time ask the question that's
that you need the fact or the needle to
answer and basically scores it you know
can the llm get it right or
wrong and what he's found is that the
llm at least in this case gbd4 fails to
retrieve facts towards the start of
documents in the regime of longer
context so that was kind of punchline
pretty interesting it's a nice way to
characterize performance of retrieval
with respect to these two important
parameters but you know for rag you
really want to retrieve multiple facts
so this was testing single fact
retrieval but typically for rag systems
you're chunking a document you're
retrieving some number of chunks maybe
you know three to five and you can do
this reasoning over disperate parts of a
document using similarity search and
chunk
retrieval so kind of to map the idea of
rag onto this approach you really would
need to be able to retrieve various
facts from a context um so it's maybe
like three needles five needles 10
needles and Google recently reported 100
needle retrieval so what they're showing
here is the ability to retrieve 100
unique needles in a single turn uh with
Gemini 1.5 and they test a large number
of points here they vary the context
blank you can see on the X and they show
the recall their number of needles that
they uh return on the
Y now this kind of analysis is really
interesting because if we're really
talking about the ability to kind of
remove rag from our workflows and strict
rely strictly on context stuffing we
really need to know like what is the
retrieval you know a recall with respect
to the Contex length or number of
needles does this really work well what
are the pitfalls right so I recently
added the ability to do multi- needle
retrieval and
evaluation uh to Greg's open source repo
so Greg op Source this whole analysis
and what I did is I went ahead and added
the ability to inject multiple needles
in a context and characterize and
evaluate the performance so the flow is
laid out here it's pretty simple and all
you need is is actually three things so
you need uh try just move this over so
you need a
question you need to know your needles
you need to have an answer so that's
kind of the way this
structured so a fun toy question which
is kind of derived from an interesting
Claude 3 needle in a Hast stack analysis
is related to Pizza ingredients so they
reported some funny results with Claude
basically trying to find a needle
related to Pizza ingredients in a c of
of other context and Claud 3 kind of
recognized it was being tested it was
kind of a funny tweet that went around
uh but it's actually kind of a fun a fun
challenge we can actually take the
question which was what are the secret
ingredients need to build the perfect
pizza and the answer the secret
ingredients are figs Pudo and goat
cheese and we can just parse that out
into three separate needles so our first
needle is figs are the secret
ingredients the second needle is Pudo
the third needle is goat cheese now the
way this analysis works is we take those
needles and we basically partition them
into the context uh at different
locations so we basically pick placement
of the first needle and the other two uh
are then allocated accordingly to in
into like kind of rough equal
spacing uh depending on how much context
is left after you place the first one so
that's kind of the way it works um you
then pass that context to an llm along
with a question have the llm answer the
question given the context um and then
we
evaluate um in this toy example the
answer only contains figs and evaluation
returns a score of one for figs and also
tells us which needle it retrieved
spatially um so that's the overall flow
and that's really what goes on when you
run this analysis based on the stuff
that I added to Greg's
repo um
now um if we want to set this up another
thing I add to the repo is the ability
to use Langs Smith as your evaluator and
this is a lot of nice properties it
allows you to log all of your runs it
orchestrates the evaluation for you and
it's really good for auditing which we
can see here in a minute so I'm going to
go ahead and and create uh a Lang Smith
data set to show how this works um and
all you need is so here's a notebook
I've just set uh a few different Secrets
um or uh environment variables um for
like lenss withth API key length chain
endpoint and tracing B2 so basically set
these and these are done in your lsmith
setup and once you've done this you're
all set to go so if you go over then to
your Lang Smith uh kind of overall
project page it'll look something like
this where you can see over here on the
left you have projects annotation cues
deployments data T and testing um and
what we're going to do is we're going to
go to data set and
testing and what we're going to do here
is we're going to say create a new data
set I'll move this down
here and we're going to create a new
data set that contains our question and
our answer and we're going to call this
multi needle
test so here we go we'll call this
multin needle testing
we'll create a key value data set and
just say create so now you can see we've
created a new data set it's empty um
there no tests and no examples and what
we're going to do is we're going to say
add example here and what we're going to
do is just copy over our
question there we go we'll copy over
that
answer so again our questions what are
the secret ingredients need to build the
perfect Pizza our answer contains those
secret ingredients we submit that so now
we have an example so this is basically
a data set with one example question
answer pair no tests yet so that's kind
of step
one now all we need to do simply is I've
already done this so um if we go up here
so this is Greg's
repo um you need to clone this you need
to just follow these commands to clone
setup create a virtual environment pip
install a few things and then you're set
to go so basically all we've done is
set up lsmith set these environment
variables clone the repo that's it
create a data set we're ready to
go so if we go
back this is the command that we can use
to run our valuation and there a few
pieces so we use Langs Smith our
evaluator we set some number of context
length intervals to test so in this case
I'm going to run three tests and we can
set our context lengths minimum and our
maximum so I'm going to say I'm want to
go from a th000 tokens all the way up to
120,000
tokens um and you know three intervals
so we'll do basically one data point in
the middle um I'm going to set this
document depth percent Min is the
initial point at which I insert the
first needle and then the other to will
be set accordingly in equal spacing in
the remaining context that's kind of all
that you need there we'll use open AI
I'll set the model I want to test I'm G
to basically flag multi needle
evaluation to be true and here's where
I'm basically going to point to this
data set that we just created this
multin needle test data set I specified
here in eval set and the final thing is
I just say here's the three needles I
want to inject into the context and you
can see that Maps exactly to what we had
here we had our question our answer
those are in Langs Smith and our needles
we just pass in so that's really it we
can do all we can take this
command and I can go over
to so here we go so I'm in my Fork of
Greg's repo right
now um and I'm just going to go ahead
and
run so we should see this go ahead and
kick off it has some nice logging so it
shows like okay um here's like our
experiment you know here's like what
we're testing um here's the needles that
we're injecting here's where we
inserting the needles um and this is
like our first experiment with a th000
tokens um and it's just rolling through
these parameters so it rolls through
these experiments as we've just set laid
out and what we can see is if we go over
to Langs Smith now we're going to start
to see experiments or tests roll in
that's pretty
nice and what we can do is we can kind
of Select here different settings we
want to look at so if you want to look
at needles retrieve for for example we
can see those scores here now this final
one's still running which is fine um but
it's all here for us and let's try to
refresh that okay so now we have them
all so this is pretty cool you can see
how much it cost so again be careful
here at
120,000 tokens it cost about a dollar so
it's expensive but smaller contexts very
cheap right so you're priced per token
um you can see the latencies here the
P99 you see the p50 latency uh creation
time runs and what's kind of cool is you
can see everything about the experiment
is logged here the the needles retrieved
the context length the first needle
depth percentage the insertion
percentages model name the needles
number of needles totals so it's all
there for you now I'm going to show you
something that's pretty cool you can
click on one of these and actually opens
up the run and here you can see here was
like our input here was the reference or
here was the input here's like the
reference output this is like the
correct answer and here's what the llm
actually said so in this case it looks
like the secret need to build a perfect
Pizza include Pudo gois and figs so it
gets it right and you can see it scores
as three which is exactly right so
that's pretty cool let's look at another
one um so we can look at our this is our
U 60,000 token experiment this so it's
the one in the middle and in this case
you can see the secret ingredients need
to build the perfect Pizza are goat
cheese and Pudo so it's missing figs so
that's kind of interesting and we can
see it correctly scores it as two now we
can even go further we can actually open
this run and you can see over here this
is just the trace this is kind of
everything that happened we can look at
our prompt um and this is like the whole
everything in our prompt now what's
pretty nice is we can actually see right
here this is the entire prompt that went
into the
llm um and here is the answer now what I
like about this is if we want to verify
that all the needles were actually
present so we can actually just search
for like secret ingredient right you can
see figs are one of the secret green so
it's in there it's definitely in the
context it's not in the answer um we can
just keep searching so Pudo GOI we can
see all three secret ingredients are in
the context and two are in the
generation so it's a nice sanity check
to say okay the needles are actually
there they're where you expect them to
be um I find that very useful for
auditing and making sure that like um
you know the needs r Place correctly so
it's really nice to be able to do that
uh to kind of convince yourself that
everything's working as
expected um so what's pretty cool then
is when I've done this um I actually can
do a bit more so I'm just GNA copy over
some code here and this will all be made
public um so I'm just paste this into
not my notebook and what this is going
to do this is basically just going to
grab my data sets um I just pass in my
eval set name and it's going to suck in
the data that we just ran and I think
that's probably done we can have a look
yep so you can see nice it loads this as
a pandage data frame it has all this uh
you know all this run information that
we want has the needles um and what's
really nice is actually it has also the
Run URL so this is now a public link
that can be
shared um and you can share those with
anyone so anyone can go ahead and audit
this run and convince thems that it was
doing the right
thing so with all this we can actually
do quite a lot um and we've actually
done uh kind of a more involved
study um that I want to talk about a
little bit right now so with this
Tooling in place we're able to ask some
interesting questions um we focus on gp4
to start and we kind of just talk
through usage this is a blog post that
I'll go out tomorrow along with this
video um we talk through usage so I
don't want to cover that too much we
talk through workflow but here's like
Crux of the analysis that we did so we
set up three eval sets uh for one three
and 10
needles um and what we did is we ran a a
bunch of Trials we burned a fair number
of tokens on this analysis but it was
worth it because it was it's it's pretty
interesting what we did is test the
ability for gbd4 to retrieve needles uh
with respect to like the number of
needles and also the context length so
you can see here is what we're doing is
we're asking gbd4 to retrieve either one
three or 10 needles again these are
Peach ingredients in a single turn and
what you can see is with one
needle um whether it's a th000 tokens or
120,000 tokens retrieval is quite good
so if you place a single needle not a
problem now I should note that this
needle is placed in the middle of the
context which may be relevant uh but at
least for this study the single needle
case is right in the middle and it's
able to get that every time regardless
of the context but here's where things
get a bit
interesting if we bump up the number of
needles so for example look at three
needles or 10 needles you can see the
performance start to degrade quite a lot
when we bump up the context so
retrieving three needles out of 120,000
tokens um you see performance drop to
like you know around 80% of the needles
are retrieved and it drops to more like
60% if you go to 10 needles so there's
an effect that happens as you add more
needles in the long contract regime you
miss more and more of them that's kind
of interesting so because we have all
the logging for every insertion point of
each needle we can also ask like where
are these failures actually occurring
and this this actually shows it right
here so what's kind of an interesting
result is that we
found here is an example of retrieving
10
needles um at different context sizes so
on the X here you can see the context
length going from a th000 tokens up to
120,000 tokens and on the Y here you can
see each of our needles so this is a
star of the document up at the top so
needle one 2 down to 10 are inserted in
our document at different locations and
each one of these is like one experiment
so at a th000 tokens you ask the LM to
retrieve all 10 needles and it gets them
all you can see this green means 100%
retrieval and what happens is as you
increase the context window you start to
see a real degradation in
performance now that shouldn't be
surprising we already know that from up
here that's the case right getting 10
needles from
120,000 is a lot worse than getting than
getting from a th a th000 you get them
every time 120,000 it's 60% but what's
interesting is this heat map tells you
where they're failing and there's a real
pattern there the degradation and
performance is actually due to Needles
Place towards the the top of the
document um so that's an interesting
result it appears that
um needles placed earlier in the context
uh have a lower chance of being
remembered or retrieved um now this is a
result that Greg also saw in the single
needle case and it appears to carry over
to the multi- needle case and it
actually starts a bit earlier in terms
of contact sizes so you can think about
it like this if you have a document and
you have like three different facts you
want to retrieve to answer a question
you're more likely to get the facts
retrieved from the latter half of the
document than the first so it can kind
of like forget about the first part of
the document or the fact in the first
part of the document which is very
important because of course valuable
information is present in the earlier
stages of the document uh but we may or
may not be able to retrieve it the
likely of retrieval drops a lot as you
move into this um kind of earlier part
of the document so so anyway that's an
interesting result it kind of follows
what Greg
reported um for the the single needle
case as well um now a final thing we
show is that often times you don't just
want to retrieve you also want to do
some reasoning um so we built a second
set of evalve challenges that ask for
the first letter of every ingredient so
you have to retrieve and also reason to
return the first letter of every of
every secret
ingredient and we found here is
basically that green is reasoning red is
retriev Ral and this is all done at
120,000 tokens as you bump up the number
of needles you can basically see that
both get worse as you would expect and
reasoning lags retrieval a bit which is
also what you kind of would expect
retrieval kind of sets an upper bound on
your ability to reason so again it's
important to recognize that one
retrieval is not guaranteed and two
reasoning May degrade a little bit
relative to retrieval so those are kind
of the two points that are really
important to recognize um
and maybe I'll just like kind of
underscore some of the main observations
here and you know especially as we think
about long context more we think about
replacing rag in certain use cases it's
very important to understand the
limitations of long context retrieval um
and like you know multi- needle
retrieval from long context in
particular there's no retrieval
guarantees so multiple facts are not
guaranteed to be retrieved especially as
number of needles and the contact size
increases we saw that pretty
clearly there can be different patterns
of retrieval failure so gbd4 tends to
retrieve needles uh tends to fail in
retrieval of needles towards the start
of the document as the contact length
increases that's point two point three
is prompting may definitely matter here
so I don't I don't presume to say we
have the ideal prompt I was using kind
of the The Prompt that Greg already had
in the repo it absolutely may be the
case that improved prompting is
necessary um it's been there's been sub
evidence that you need that for claws so
that's like very valid to consider um
and also that retrieval and reasoning
are both uh very important tasks and
retrieval might kind of set a bound on
your ability to reason and so if you
have a challenge that requires reasoning
on top of retrieval um you're kind of
like stacking two problems on one
another and your ability to reason may
or may not be kind of governed or
limited by your ability to retrieve the
facts which is pretty intuitive but you
know it's just a good thing to
underscore that like reasoning and
retrieval are independent problems um
and retrieval is like a precondition to
reason well um those are the main things
I want to leave you with um I also will
have a quick note here that if we go
back to our data set um which we can see
here like how much did this actually
cost I know people are kind of worried
about that so just the three tests that
we did are around um you know maybe
around around $2 so again it's really
those long context tests that are quite
costly lsmith shows you the cost here so
you kind of you can actually track it
pretty carefully but the nice thing is
for like a kind of well-designed study
you could spend maybe $10 and you could
actually look at you know quite a number
of Trials especially if you care more
about like the the middle context length
regime uh you can you can do quite a
number of tests within a reasonable
budget you know so I just I just want to
throw that out there that you don't
necessarily have to totally break the
bank to do some interesting work here um
the things we did were pretty costly but
that was mostly because we generated a
large number of replic just to validate
the results but if you're doing it for
yourself you could do it just a single
pass and this is actually not that many
measurements you know so here it would
be like 1 2 34 you know six measurements
if you don't do any replicates so you
know it's not too not too costly to to
produce some inry results using this
approach uh everything's open source all
the data is open source as well um it's
all checked into Greg's repo you can
find it um in the viz
section and uh yep mul data sets it'll
all be there so yeah I encourage you to
play with this hopefully it's useful and
I think long Contex llms are super
promising and interesting and Analysis
like this hopefully will make them at
least a little bit more understandable
and build some intuition as to whether
or not you can use them to actually
replace rag or not so thanks very
much
5.0 / 5 (0 votes)