Is RAG Really Dead? Testing Multi Fact Retrieval & Reasoning in GPT4-128k

LangChain
13 Mar 202423:18

Summary

TLDRこのビデオスクリプトでは、Lang chainのLanceが「multi-needle and Haystack」と題された分析について語っています。最近のLMSでは、文脈の長さが増加しており、Gemini 1.5やCLA 3などは最大100万トークンの文脈長さを報告しています。これにより、外部検索システムなしで大量の文脈を直接LMSに組み込むことが可能か、という議論が生まれました。Greg Camandの「needle and Haystack」分析に触発され、Lanceは複数の「針」を文脈に注入し、そのパフォーマンスを評価する新しい分析をGregのオープンソースリポジトリに追加しました。この分析を通じて、文脈の長さや針の数に応じたLMSの検索能力の限界と可能性について深い洞察を提供しています。

Takeaways

  • 📈 コンテキスト長さが増加しているLLM(言語モデル)について、最近のGemini 1.5やCLA 3が最大100万トークンのコンテキスト長さを報告している。
  • 🔍 RAG(外部検索システム)の必要性に疑問を呈し、大量のコンテキストを直接LLMに供給することで置き換え可能かどうかが議論されている。
  • 📊 Greg Camandによる「針と干し草」分析は、LLMがコンテキスト内の特定の事実をどの程度うまく検索できるかを探るものである。
  • 📍 文書内の事実の配置やコンテキストの長さが、LLMの事実検索性能に影響を与えることが示されている。
  • 🤖 複数の事実(針)をコンテキスト内で検索する「マルチニードル」検索の重要性が強調されており、Googleは100針検索を報告している。
  • 🛠️ マルチニードル検索と評価機能をGregのオープンソースリポジトリに追加し、実験の流れが簡潔に説明されている。
  • 📝 実験セットアップにはLang Smithを使った評価が含まれ、実行結果を監査するためのツールが提供されている。
  • 🔎 GPT-4を使用した詳細な分析が行われ、コンテキスト長と針の数による検索性能の変化が観察されている。
  • 📉 文書の前半に配置された事実は、長いコンテキストでは検索されにくい傾向にあるとの結果が示されている。
  • 💡 長いコンテキストでの複数事実の検索や、検索と推論の関係についての洞察が提供されており、LLMの限界と可能性を理解するための重要な情報が含まれている。

Q & A

  • 「多針と干し草の山」分析の目的は何ですか?

    -LMSがコンテキストから特定の事実をどれだけうまく取り出せるか、つまり、文書内の事実の位置やコンテキストの長さなどの条件に応じて、LMSの特定情報の取得能力を評価することが目的です。

  • Gemini 1.5やCLA 3に関する言及の意味は何ですか?

    -これらのモデルは、最大100万トークンまでの長いコンテキスト長をサポートしていることが示され、従来の外部検索システム(RAG)の必要性に疑問を投げかけています。

  • 多針検索とは何ですか?

    -多針検索は、LMSが一度に複数の異なる事実(「針」)をコンテキスト(「干し草の山」)から取り出せるかを評価する分析です。

  • 「Lang Smith」の役割は何ですか?

    -Lang Smithは、分析の実行、結果の記録、評価のオーケストレーションを担当し、分析プロセスを監査するのに役立ちます。

  • LMSが文書の始めの方の事実を取得するのがなぜ難しいのですか?

    -分析結果によると、LMSは長いコンテキストで文書の先頭にある情報を取得するのが苦手であり、特に多くの情報を扱う際にこの傾向が強まることが示されています。

  • RAGをLMSで置き換える可能性についての調査結果は何ですか?

    -長いコンテキストと多数の「針」を含む状況では、LMSがすべての事実を確実に取り出せるわけではないため、RAGの完全な置き換えは困難である可能性が示唆されています。

  • 分析で使用された「秘密のピザの材料」とは何ですか?

    -分析で使用された「秘密のピザの材料」は、イチジク、プードゥ、そしてヤギのチーズでした。

  • 評価セットの作成に何が必要ですか?

    -評価セットを作成するには、質問とそれに対する答えを含むデータセットが必要です。Lang Smithを使用してこれらを管理します。

  • 「針」の配置が分析結果にどのように影響しますか?

    -「針」の配置は、特に文書の先頭に近い位置に置かれた場合、取得の成功率に大きく影響します。文書の後半にある事実の方が、より正確に取得されやすいです。

  • 多針検索のコストについての考察は何ですか?

    -分析のコストはコンテキストの長さによって異なり、特に長いコンテキストでは費用が高くなりますが、適切に設計された研究では、合理的な予算内で多くの試験を実施できます。

Outlines

00:00

🔍 探索長文脈下的LMS記憶與檢索能力

這段落介紹了Lance從Lang chain進行的一項名為multi-needle and Hy,stack的分析。分析的焦點在於探討長文脈下LMS(Language Models)的檢索能力,尤其是當文脈長度達到百萬個token時,是否能夠完全取代外部檢索系統。Lance引用了Greg camand的針對GPT-4和CLA的分析,該分析測試了不同文脈長度下LMS檢索特定事實的能力。分析發現,當文脈變長時,LMS在檢索文檔開頭部分的事實時表現不佳。此外,Lance還提到了Google的100針檢索能力,這顯示了在單次轉換中檢索100個獨特需求的能力。

05:01

🧩 實現多針檢索與評估的過程

這部分詳細說明了如何實現多針檢索與評估。Lance介紹了他對Greg的開源庫進行的修改,使得能夠在一個文脈中注入多個需求點,並評估LMS的性能。他通過一個有趣的例子——找出完美披薩的秘密成分——來展示這個過程。這個例子涉及到將問題、答案和需求點注入到文脈中,然後讓LMS在給定文脈的情況下回答問題,並評估其表現。Lance還介紹了使用Langs Smith作為評估工具,它可以記錄所有運行並協調評估過程,對於審計非常有用。

10:02

📊 分析結果:長文脈下的檢索與成本

這段落展示了使用Langs Smith進行的實驗結果。實驗中,Lance測試了在不同文脈長度下,GPT-4檢索多個需求點的能力。結果顯示,隨著文脈長度的增加,檢索性能會下降,尤其是在檢索更多需求點時。此外,還觀察到一個模式:當需求點放置在文檔較前部分時,檢索成功率較低。Lance還提到了進行這種分析的成本,指出長文脈測試相對較貴,但通過精心設計的研究,可以在合理的預算內完成。

15:03

🤔 對長文脈LMS檢索能力的深入洞察

這部分探討了長文脈LMS檢索能力的局限性。Lance通過他們的分析發現,隨著需求點數量和文脈長度的增加,檢索保證並不是絕對的。特別是當需求點位於文檔開頭時,檢索成功率會降低。此外,他還提到了提示的重要性,並強調了檢索和推理是兩個獨立的任務,檢索可能會限制推理的能力。最後,Lance總結了幾個主要觀點,並鼓勵人們使用這種方法來更好地理解長文脈LMS的潛力和限制。

20:05

💡 結論:長文脈LMS檢索的未來應用

在這最後一段中,Lance總結了長文脈LMS檢索的關鍵點和可能的應用。他強調了長文脈檢索的局限性,尤其是在多針檢索和長文脈情況下。他提到,檢索並非總是保證的,並且可能存在不同的檢索失敗模式。此外,他還提醒人們,檢索和推理是兩個不同的問題,並且檢索可能會限制推理的能力。最後,他提到了這些分析的開源性質,鼓勵人們利用這些工具和數據集來進行自己的研究,並提供了相關的資源連結。

Mindmap

Keywords

💡コンテキスト長

コンテキスト長は、言語モデルが一度に参照できるトークン(言語の基本単位)の数を指します。ビデオスクリプトでは、最近の言語モデルが最大100万トークンまでのコンテキスト長を持つことができると述べられています。これは数百から数千ページに相当し、外部検索システムなしで大量のコンテキストを直接処理できるかどうかという議論につながります。

💡ニードルと藁の山

「ニードルと藁の山」は、特定の事実を大量の情報から見つけ出す能力を評価するための分析手法です。ビデオでは、Greg Camandが行ったこの分析が紹介され、異なるコンテキスト長や文書内の事実の配置によって、言語モデルが特定の事実をどのように検索できるかを調査しています。

💡RAG

RAGはRetrieval-Augmented Generationの略で、言語モデルが応答を生成する前に外部の情報源から情報を検索するプロセスを指します。ビデオスクリプトでは、大規模なコンテキスト長がRAGシステムを置き換える可能性があるかどうかについての議論が展開されています。

💡複数ニードル検索

複数ニードル検索は、言語モデルが一度に複数の事実(「ニードル」)をコンテキスト(「藁の山」)から検索できるかどうかを評価するプロセスです。ビデオでは、この概念を用いて、言語モデルの検索性能を複数の事実に対してテストし、評価する新しい分析方法が紹介されています。

💡言語モデルの性能

言語モデルの性能は、モデルが特定のタスクをどの程度うまくこなせるかを指します。ビデオでは、異なるコンテキスト長や事実の配置に応じて、言語モデルが情報をどの程度正確に検索できるかを評価する分析が行われています。

💡Lang chain

Lang chainは、ビデオスクリプトで言及されたプロジェクトまたは組織の名前です。このコンテキストでは、複数ニードル検索と評価機能をGreg Camandのオープンソースリポジトリに追加した作業について説明しています。

💡Lang Smith

Lang Smithは、ビデオで言及されている評価ツールです。このツールを使用すると、言語モデルのランをログに記録し、評価を自動化し、監査に便利な方法で結果を追跡できます。

💡オープンソース

オープンソースは、ソースコードが公開され、誰でも閲覧、修正、配布が可能なソフトウェアのことを指します。ビデオでは、Greg Camandの分析をオープンソース化したことが言及されており、この分析を基にしたLang chainの作業も公開されています。

💡事実の配置

事実の配置は、言語モデルがコンテキスト内の情報をどの位置に配置するかを指します。ビデオでは、事実(ニードル)を文書の異なる位置に配置し、言語モデルがそれらをどの程度うまく検索できるかをテストする分析が行われています。

💡コスト

ビデオでは、言語モデルを用いた分析の実行コストについても触れられています。特に、異なるコンテキスト長をテストする際のコストが言及され、長いコンテキストでのテストが高額になる可能性があることが指摘されています。

Highlights

Lance from Lang chain introduces a new analysis called multi-needle in haystack, aiming to understand large language models' (LLMs) context retrieval capabilities.

Context lengths for LLMs have been increasing significantly, with Gemini 1.5 and CLA 3 reporting up to a million token context lengths, raising questions about the necessity of external retrieval systems.

Greg Kamand's analysis, needle in haystack, is discussed, which tests LLMs' ability to retrieve specific facts from large contexts and how fact placement affects retrieval.

Lance adds the ability for multi-needle retrieval and evaluation to Greg's open-source repository, aiming to test LLMs' capability in retrieving multiple facts from extended contexts.

A toy example using pizza ingredients as needles demonstrates how multi-needle retrieval is set up and evaluated.

Lang Smith's tool is used for orchestrating and auditing the evaluation, emphasizing the importance of precise data logging and analysis.

The evaluation process includes setting up data sets in Lang Smith, cloning Greg's repo, and running tests with various context lengths and needle placements.

Early results show that LLMs' performance in retrieving needles from long contexts decreases as the number of needles increases.

A detailed breakdown of the evaluation's cost is provided, showing that even though tests with longer contexts are expensive, meaningful research can still be conducted within a reasonable budget.

Analysis reveals that needles placed earlier in the context are harder for LLMs to retrieve, a significant finding that impacts how information should be structured for LLMs.

Additional experiments test LLMs' ability to not only retrieve information but also to perform reasoning tasks based on the retrieved facts.

The importance of prompt design in improving LLMs' retrieval and reasoning capabilities is highlighted, suggesting that there's room for optimization.

The analysis concludes with observations on the limitations of long context retrieval, especially for multi-needle retrieval, underscoring the complexity of completely replacing external retrieval systems.

Lance underscores that the entire analysis, including tooling and data, is open source, encouraging further experimentation and exploration by the community.

The talk concludes with optimism about the potential of long context LLMs and the importance of continued analysis to understand their capabilities and limitations fully.

Transcripts

play00:01

hi this is Lance from Lang chain I want

play00:03

to talk about a pretty fun analysis that

play00:05

I've been working on for the last few

play00:07

days um called multi- needle and Hy

play00:10

stack um and we're going to talk through

play00:12

these graphs and what they mean and some

play00:14

of the major insights that came out of

play00:15

this analysis uh maybe but first I want

play00:18

to like kind of set the

play00:19

stage so context lengths for LMS have

play00:22

been increasing most notably Gemini 1.5

play00:25

and CLA 3 have recently reported up to a

play00:27

million token context lengths and this

play00:30

has provoked a lot of questions like if

play00:32

you have a million tokens which is

play00:33

hundreds or maybe thousands of pages can

play00:35

you replace rag altogether you know why

play00:38

would you need an external retrieval

play00:39

system if you can Plum huge amounts of

play00:41

context directly into these llms so it's

play00:44

a really good question and a really

play00:45

interesting debate to help kind of

play00:47

address this Greg camand recently put

play00:50

out an analysis called needle and

play00:52

Haystack which attempts to answer the

play00:55

question how well can these LMS retrieve

play00:57

specific facts from their context with

play01:00

respect to things like how long the

play01:01

context is or where the fact is placed

play01:04

within the context so he did a pretty

play01:07

influential analysis on gptv 4 and also

play01:10

Claude um which basically tested along

play01:14

the X are different context lengths so

play01:17

going from 1K all the way up to 120k in

play01:19

the case of GPD 4 and on the Y being

play01:23

different document placements or or fact

play01:26

placements within the document so either

play01:28

put it at the start of the doc or put at

play01:30

the end so he basically injects this

play01:32

needle into the context at different

play01:35

places and varies the uh context length

play01:39

and each time ask the question that's

play01:41

that you need the fact or the needle to

play01:44

answer and basically scores it you know

play01:46

can the llm get it right or

play01:48

wrong and what he's found is that the

play01:52

llm at least in this case gbd4 fails to

play01:54

retrieve facts towards the start of

play01:56

documents in the regime of longer

play01:58

context so that was kind of punchline

play02:00

pretty interesting it's a nice way to

play02:02

characterize performance of retrieval

play02:04

with respect to these two important

play02:06

parameters but you know for rag you

play02:10

really want to retrieve multiple facts

play02:12

so this was testing single fact

play02:14

retrieval but typically for rag systems

play02:16

you're chunking a document you're

play02:18

retrieving some number of chunks maybe

play02:20

you know three to five and you can do

play02:23

this reasoning over disperate parts of a

play02:25

document using similarity search and

play02:27

chunk

play02:28

retrieval so kind of to map the idea of

play02:32

rag onto this approach you really would

play02:34

need to be able to retrieve various

play02:36

facts from a context um so it's maybe

play02:39

like three needles five needles 10

play02:42

needles and Google recently reported 100

play02:45

needle retrieval so what they're showing

play02:48

here is the ability to retrieve 100

play02:51

unique needles in a single turn uh with

play02:54

Gemini 1.5 and they test a large number

play02:56

of points here they vary the context

play02:58

blank you can see on the X and they show

play03:00

the recall their number of needles that

play03:02

they uh return on the

play03:04

Y now this kind of analysis is really

play03:06

interesting because if we're really

play03:09

talking about the ability to kind of

play03:11

remove rag from our workflows and strict

play03:13

rely strictly on context stuffing we

play03:16

really need to know like what is the

play03:18

retrieval you know a recall with respect

play03:20

to the Contex length or number of

play03:22

needles does this really work well what

play03:24

are the pitfalls right so I recently

play03:28

added the ability to do multi- needle

play03:30

retrieval and

play03:31

evaluation uh to Greg's open source repo

play03:35

so Greg op Source this whole analysis

play03:37

and what I did is I went ahead and added

play03:39

the ability to inject multiple needles

play03:41

in a context and characterize and

play03:43

evaluate the performance so the flow is

play03:46

laid out here it's pretty simple and all

play03:48

you need is is actually three things so

play03:51

you need uh try just move this over so

play03:55

you need a

play03:57

question you need to know your needles

play03:59

you need to have an answer so that's

play04:02

kind of the way this

play04:03

structured so a fun toy question which

play04:06

is kind of derived from an interesting

play04:07

Claude 3 needle in a Hast stack analysis

play04:10

is related to Pizza ingredients so they

play04:13

reported some funny results with Claude

play04:15

basically trying to find a needle

play04:17

related to Pizza ingredients in a c of

play04:19

of other context and Claud 3 kind of

play04:21

recognized it was being tested it was

play04:23

kind of a funny tweet that went around

play04:25

uh but it's actually kind of a fun a fun

play04:27

challenge we can actually take the

play04:29

question which was what are the secret

play04:31

ingredients need to build the perfect

play04:32

pizza and the answer the secret

play04:34

ingredients are figs Pudo and goat

play04:36

cheese and we can just parse that out

play04:37

into three separate needles so our first

play04:40

needle is figs are the secret

play04:42

ingredients the second needle is Pudo

play04:43

the third needle is goat cheese now the

play04:45

way this analysis works is we take those

play04:47

needles and we basically partition them

play04:49

into the context uh at different

play04:51

locations so we basically pick placement

play04:53

of the first needle and the other two uh

play04:56

are then allocated accordingly to in

play04:58

into like kind of rough equal

play05:01

spacing uh depending on how much context

play05:03

is left after you place the first one so

play05:05

that's kind of the way it works um you

play05:09

then pass that context to an llm along

play05:11

with a question have the llm answer the

play05:13

question given the context um and then

play05:16

we

play05:17

evaluate um in this toy example the

play05:20

answer only contains figs and evaluation

play05:22

returns a score of one for figs and also

play05:25

tells us which needle it retrieved

play05:28

spatially um so that's the overall flow

play05:31

and that's really what goes on when you

play05:32

run this analysis based on the stuff

play05:33

that I added to Greg's

play05:35

repo um

play05:38

now um if we want to set this up another

play05:42

thing I add to the repo is the ability

play05:44

to use Langs Smith as your evaluator and

play05:47

this is a lot of nice properties it

play05:49

allows you to log all of your runs it

play05:51

orchestrates the evaluation for you and

play05:52

it's really good for auditing which we

play05:54

can see here in a minute so I'm going to

play05:57

go ahead and and create uh a Lang Smith

play06:00

data set to show how this works um and

play06:03

all you need is so here's a notebook

play06:06

I've just set uh a few different Secrets

play06:09

um or uh environment variables um for

play06:12

like lenss withth API key length chain

play06:14

endpoint and tracing B2 so basically set

play06:17

these and these are done in your lsmith

play06:19

setup and once you've done this you're

play06:21

all set to go so if you go over then to

play06:24

your Lang Smith uh kind of overall

play06:27

project page it'll look something like

play06:29

this where you can see over here on the

play06:32

left you have projects annotation cues

play06:34

deployments data T and testing um and

play06:37

what we're going to do is we're going to

play06:38

go to data set and

play06:39

testing and what we're going to do here

play06:42

is we're going to say create a new data

play06:44

set I'll move this down

play06:46

here and we're going to create a new

play06:48

data set that contains our question and

play06:50

our answer and we're going to call this

play06:52

multi needle

play06:53

test so here we go we'll call this

play06:57

multin needle testing

play07:00

we'll create a key value data set and

play07:02

just say create so now you can see we've

play07:05

created a new data set it's empty um

play07:08

there no tests and no examples and what

play07:10

we're going to do is we're going to say

play07:11

add example here and what we're going to

play07:14

do is just copy over our

play07:17

question there we go we'll copy over

play07:19

that

play07:22

answer so again our questions what are

play07:24

the secret ingredients need to build the

play07:25

perfect Pizza our answer contains those

play07:27

secret ingredients we submit that so now

play07:30

we have an example so this is basically

play07:32

a data set with one example question

play07:33

answer pair no tests yet so that's kind

play07:37

of step

play07:38

one now all we need to do simply is I've

play07:42

already done this so um if we go up here

play07:45

so this is Greg's

play07:47

repo um you need to clone this you need

play07:50

to just follow these commands to clone

play07:52

setup create a virtual environment pip

play07:54

install a few things and then you're set

play07:55

to go so basically all we've done is

play08:00

set up lsmith set these environment

play08:02

variables clone the repo that's it

play08:06

create a data set we're ready to

play08:09

go so if we go

play08:13

back this is the command that we can use

play08:16

to run our valuation and there a few

play08:18

pieces so we use Langs Smith our

play08:21

evaluator we set some number of context

play08:24

length intervals to test so in this case

play08:26

I'm going to run three tests and we can

play08:29

set our context lengths minimum and our

play08:31

maximum so I'm going to say I'm want to

play08:33

go from a th000 tokens all the way up to

play08:36

120,000

play08:37

tokens um and you know three intervals

play08:41

so we'll do basically one data point in

play08:43

the middle um I'm going to set this

play08:47

document depth percent Min is the

play08:49

initial point at which I insert the

play08:51

first needle and then the other to will

play08:53

be set accordingly in equal spacing in

play08:55

the remaining context that's kind of all

play08:58

that you need there we'll use open AI

play09:01

I'll set the model I want to test I'm G

play09:03

to basically flag multi needle

play09:05

evaluation to be true and here's where

play09:07

I'm basically going to point to this

play09:09

data set that we just created this

play09:10

multin needle test data set I specified

play09:12

here in eval set and the final thing is

play09:15

I just say here's the three needles I

play09:16

want to inject into the context and you

play09:19

can see that Maps exactly to what we had

play09:20

here we had our question our answer

play09:23

those are in Langs Smith and our needles

play09:26

we just pass in so that's really it we

play09:29

can do all we can take this

play09:31

command and I can go over

play09:36

to so here we go so I'm in my Fork of

play09:42

Greg's repo right

play09:43

now um and I'm just going to go ahead

play09:46

and

play09:47

run so we should see this go ahead and

play09:49

kick off it has some nice logging so it

play09:52

shows like okay um here's like our

play09:55

experiment you know here's like what

play09:57

we're testing um here's the needles that

play09:59

we're injecting here's where we

play10:01

inserting the needles um and this is

play10:03

like our first experiment with a th000

play10:05

tokens um and it's just rolling through

play10:07

these parameters so it rolls through

play10:09

these experiments as we've just set laid

play10:11

out and what we can see is if we go over

play10:15

to Langs Smith now we're going to start

play10:17

to see experiments or tests roll in

play10:21

that's pretty

play10:22

nice and what we can do is we can kind

play10:25

of Select here different settings we

play10:27

want to look at so if you want to look

play10:28

at needles retrieve for for example we

play10:30

can see those scores here now this final

play10:32

one's still running which is fine um but

play10:35

it's all here for us and let's try to

play10:37

refresh that okay so now we have them

play10:39

all so this is pretty cool you can see

play10:43

how much it cost so again be careful

play10:45

here at

play10:46

120,000 tokens it cost about a dollar so

play10:49

it's expensive but smaller contexts very

play10:53

cheap right so you're priced per token

play10:56

um you can see the latencies here the

play10:58

P99 you see the p50 latency uh creation

play11:01

time runs and what's kind of cool is you

play11:04

can see everything about the experiment

play11:05

is logged here the the needles retrieved

play11:08

the context length the first needle

play11:10

depth percentage the insertion

play11:12

percentages model name the needles

play11:15

number of needles totals so it's all

play11:16

there for you now I'm going to show you

play11:18

something that's pretty cool you can

play11:20

click on one of these and actually opens

play11:22

up the run and here you can see here was

play11:23

like our input here was the reference or

play11:25

here was the input here's like the

play11:27

reference output this is like the

play11:28

correct answer and here's what the llm

play11:30

actually said so in this case it looks

play11:33

like the secret need to build a perfect

play11:35

Pizza include Pudo gois and figs so it

play11:37

gets it right and you can see it scores

play11:39

as three which is exactly right so

play11:41

that's pretty cool let's look at another

play11:43

one um so we can look at our this is our

play11:47

U 60,000 token experiment this so it's

play11:50

the one in the middle and in this case

play11:53

you can see the secret ingredients need

play11:55

to build the perfect Pizza are goat

play11:56

cheese and Pudo so it's missing figs so

play11:58

that's kind of interesting and we can

play11:59

see it correctly scores it as two now we

play12:02

can even go further we can actually open

play12:03

this run and you can see over here this

play12:07

is just the trace this is kind of

play12:09

everything that happened we can look at

play12:10

our prompt um and this is like the whole

play12:13

everything in our prompt now what's

play12:14

pretty nice is we can actually see right

play12:17

here this is the entire prompt that went

play12:20

into the

play12:22

llm um and here is the answer now what I

play12:25

like about this is if we want to verify

play12:28

that all the needles were actually

play12:31

present so we can actually just search

play12:33

for like secret ingredient right you can

play12:35

see figs are one of the secret green so

play12:37

it's in there it's definitely in the

play12:39

context it's not in the answer um we can

play12:41

just keep searching so Pudo GOI we can

play12:44

see all three secret ingredients are in

play12:46

the context and two are in the

play12:48

generation so it's a nice sanity check

play12:50

to say okay the needles are actually

play12:51

there they're where you expect them to

play12:53

be um I find that very useful for

play12:55

auditing and making sure that like um

play12:58

you know the needs r Place correctly so

play13:00

it's really nice to be able to do that

play13:02

uh to kind of convince yourself that

play13:03

everything's working as

play13:05

expected um so what's pretty cool then

play13:09

is when I've done this um I actually can

play13:12

do a bit more so I'm just GNA copy over

play13:16

some code here and this will all be made

play13:18

public um so I'm just paste this into

play13:21

not my notebook and what this is going

play13:22

to do this is basically just going to

play13:24

grab my data sets um I just pass in my

play13:27

eval set name and it's going to suck in

play13:29

the data that we just ran and I think

play13:33

that's probably done we can have a look

play13:36

yep so you can see nice it loads this as

play13:39

a pandage data frame it has all this uh

play13:43

you know all this run information that

play13:44

we want has the needles um and what's

play13:47

really nice is actually it has also the

play13:51

Run URL so this is now a public link

play13:54

that can be

play13:55

shared um and you can share those with

play13:58

anyone so anyone can go ahead and audit

play14:00

this run and convince thems that it was

play14:03

doing the right

play14:05

thing so with all this we can actually

play14:08

do quite a lot um and we've actually

play14:11

done uh kind of a more involved

play14:14

study um that I want to talk about a

play14:17

little bit right now so with this

play14:20

Tooling in place we're able to ask some

play14:22

interesting questions um we focus on gp4

play14:26

to start and we kind of just talk

play14:29

through usage this is a blog post that

play14:30

I'll go out tomorrow along with this

play14:32

video um we talk through usage so I

play14:35

don't want to cover that too much we

play14:36

talk through workflow but here's like

play14:38

Crux of the analysis that we did so we

play14:41

set up three eval sets uh for one three

play14:45

and 10

play14:46

needles um and what we did is we ran a a

play14:51

bunch of Trials we burned a fair number

play14:53

of tokens on this analysis but it was

play14:55

worth it because it was it's it's pretty

play14:56

interesting what we did is test the

play14:59

ability for gbd4 to retrieve needles uh

play15:03

with respect to like the number of

play15:04

needles and also the context length so

play15:08

you can see here is what we're doing is

play15:10

we're asking gbd4 to retrieve either one

play15:12

three or 10 needles again these are

play15:14

Peach ingredients in a single turn and

play15:17

what you can see is with one

play15:20

needle um whether it's a th000 tokens or

play15:24

120,000 tokens retrieval is quite good

play15:29

so if you place a single needle not a

play15:32

problem now I should note that this

play15:34

needle is placed in the middle of the

play15:35

context which may be relevant uh but at

play15:38

least for this study the single needle

play15:40

case is right in the middle and it's

play15:41

able to get that every time regardless

play15:43

of the context but here's where things

play15:45

get a bit

play15:46

interesting if we bump up the number of

play15:49

needles so for example look at three

play15:51

needles or 10 needles you can see the

play15:53

performance start to degrade quite a lot

play15:56

when we bump up the context so

play15:59

retrieving three needles out of 120,000

play16:02

tokens um you see performance drop to

play16:04

like you know around 80% of the needles

play16:07

are retrieved and it drops to more like

play16:09

60% if you go to 10 needles so there's

play16:12

an effect that happens as you add more

play16:14

needles in the long contract regime you

play16:17

miss more and more of them that's kind

play16:19

of interesting so because we have all

play16:22

the logging for every insertion point of

play16:24

each needle we can also ask like where

play16:27

are these failures actually occurring

play16:30

and this this actually shows it right

play16:33

here so what's kind of an interesting

play16:36

result is that we

play16:39

found here is an example of retrieving

play16:42

10

play16:43

needles um at different context sizes so

play16:47

on the X here you can see the context

play16:48

length going from a th000 tokens up to

play16:51

120,000 tokens and on the Y here you can

play16:55

see each of our needles so this is a

play16:58

star of the document up at the top so

play17:00

needle one 2 down to 10 are inserted in

play17:03

our document at different locations and

play17:06

each one of these is like one experiment

play17:09

so at a th000 tokens you ask the LM to

play17:13

retrieve all 10 needles and it gets them

play17:15

all you can see this green means 100%

play17:18

retrieval and what happens is as you

play17:21

increase the context window you start to

play17:23

see a real degradation in

play17:26

performance now that shouldn't be

play17:28

surprising we already know that from up

play17:30

here that's the case right getting 10

play17:33

needles from

play17:35

120,000 is a lot worse than getting than

play17:38

getting from a th a th000 you get them

play17:40

every time 120,000 it's 60% but what's

play17:43

interesting is this heat map tells you

play17:46

where they're failing and there's a real

play17:47

pattern there the degradation and

play17:50

performance is actually due to Needles

play17:51

Place towards the the top of the

play17:54

document um so that's an interesting

play17:57

result it appears that

play17:59

um needles placed earlier in the context

play18:03

uh have a lower chance of being

play18:06

remembered or retrieved um now this is a

play18:10

result that Greg also saw in the single

play18:11

needle case and it appears to carry over

play18:14

to the multi- needle case and it

play18:16

actually starts a bit earlier in terms

play18:18

of contact sizes so you can think about

play18:20

it like this if you have a document and

play18:23

you have like three different facts you

play18:25

want to retrieve to answer a question

play18:28

you're more likely to get the facts

play18:30

retrieved from the latter half of the

play18:31

document than the first so it can kind

play18:33

of like forget about the first part of

play18:36

the document or the fact in the first

play18:37

part of the document which is very

play18:39

important because of course valuable

play18:41

information is present in the earlier

play18:42

stages of the document uh but we may or

play18:45

may not be able to retrieve it the

play18:47

likely of retrieval drops a lot as you

play18:49

move into this um kind of earlier part

play18:52

of the document so so anyway that's an

play18:54

interesting result it kind of follows

play18:56

what Greg

play18:57

reported um for the the single needle

play19:00

case as well um now a final thing we

play19:04

show is that often times you don't just

play19:07

want to retrieve you also want to do

play19:09

some reasoning um so we built a second

play19:13

set of evalve challenges that ask for

play19:15

the first letter of every ingredient so

play19:17

you have to retrieve and also reason to

play19:20

return the first letter of every of

play19:21

every secret

play19:22

ingredient and we found here is

play19:24

basically that green is reasoning red is

play19:28

retriev Ral and this is all done at

play19:30

120,000 tokens as you bump up the number

play19:33

of needles you can basically see that

play19:35

both get worse as you would expect and

play19:38

reasoning lags retrieval a bit which is

play19:40

also what you kind of would expect

play19:41

retrieval kind of sets an upper bound on

play19:43

your ability to reason so again it's

play19:46

important to recognize that one

play19:49

retrieval is not guaranteed and two

play19:51

reasoning May degrade a little bit

play19:53

relative to retrieval so those are kind

play19:55

of the two points that are really

play19:56

important to recognize um

play19:59

and maybe I'll just like kind of

play20:00

underscore some of the main observations

play20:02

here and you know especially as we think

play20:04

about long context more we think about

play20:07

replacing rag in certain use cases it's

play20:09

very important to understand the

play20:10

limitations of long context retrieval um

play20:13

and like you know multi- needle

play20:15

retrieval from long context in

play20:17

particular there's no retrieval

play20:19

guarantees so multiple facts are not

play20:21

guaranteed to be retrieved especially as

play20:23

number of needles and the contact size

play20:25

increases we saw that pretty

play20:27

clearly there can be different patterns

play20:29

of retrieval failure so gbd4 tends to

play20:32

retrieve needles uh tends to fail in

play20:34

retrieval of needles towards the start

play20:36

of the document as the contact length

play20:38

increases that's point two point three

play20:41

is prompting may definitely matter here

play20:43

so I don't I don't presume to say we

play20:46

have the ideal prompt I was using kind

play20:47

of the The Prompt that Greg already had

play20:49

in the repo it absolutely may be the

play20:52

case that improved prompting is

play20:54

necessary um it's been there's been sub

play20:57

evidence that you need that for claws so

play20:58

that's like very valid to consider um

play21:03

and also that retrieval and reasoning

play21:04

are both uh very important tasks and

play21:08

retrieval might kind of set a bound on

play21:10

your ability to reason and so if you

play21:12

have a challenge that requires reasoning

play21:14

on top of retrieval um you're kind of

play21:16

like stacking two problems on one

play21:18

another and your ability to reason may

play21:20

or may not be kind of governed or

play21:21

limited by your ability to retrieve the

play21:23

facts which is pretty intuitive but you

play21:25

know it's just a good thing to

play21:26

underscore that like reasoning and

play21:27

retrieval are independent problems um

play21:30

and retrieval is like a precondition to

play21:32

reason well um those are the main things

play21:35

I want to leave you with um I also will

play21:38

have a quick note here that if we go

play21:41

back to our data set um which we can see

play21:43

here like how much did this actually

play21:45

cost I know people are kind of worried

play21:46

about that so just the three tests that

play21:48

we did are around um you know maybe

play21:51

around around $2 so again it's really

play21:54

those long context tests that are quite

play21:56

costly lsmith shows you the cost here so

play21:58

you kind of you can actually track it

play22:00

pretty carefully but the nice thing is

play22:03

for like a kind of well-designed study

play22:05

you could spend maybe $10 and you could

play22:06

actually look at you know quite a number

play22:08

of Trials especially if you care more

play22:10

about like the the middle context length

play22:13

regime uh you can you can do quite a

play22:14

number of tests within a reasonable

play22:16

budget you know so I just I just want to

play22:18

throw that out there that you don't

play22:20

necessarily have to totally break the

play22:21

bank to do some interesting work here um

play22:24

the things we did were pretty costly but

play22:26

that was mostly because we generated a

play22:27

large number of replic just to validate

play22:29

the results but if you're doing it for

play22:31

yourself you could do it just a single

play22:32

pass and this is actually not that many

play22:34

measurements you know so here it would

play22:35

be like 1 2 34 you know six measurements

play22:38

if you don't do any replicates so you

play22:40

know it's not too not too costly to to

play22:43

produce some inry results using this

play22:45

approach uh everything's open source all

play22:47

the data is open source as well um it's

play22:50

all checked into Greg's repo you can

play22:52

find it um in the viz

play22:55

section and uh yep mul data sets it'll

play22:59

all be there so yeah I encourage you to

play23:02

play with this hopefully it's useful and

play23:03

I think long Contex llms are super

play23:05

promising and interesting and Analysis

play23:08

like this hopefully will make them at

play23:09

least a little bit more understandable

play23:10

and build some intuition as to whether

play23:12

or not you can use them to actually

play23:13

replace rag or not so thanks very

play23:16

much

Rate This

5.0 / 5 (0 votes)

関連タグ
マルチニードルLLM分析検索能力長期コンテキストGPT-4Gemini 1.5CLA 3データ抽出検索の限界理論的検証オープンソース