Is RAG Really Dead? Testing Multi Fact Retrieval & Reasoning in GPT4-128k

LangChain
13 Mar 202423:18

Summary

TLDRLance 从 Lang Smith 分享了一个有趣的分析,他最近在研究多针和干草堆(multi-needle and Haystack)的概念。这个分析探讨了大型语言模型(LLMs)如何从长文本中检索特定事实。他提到了 Gemini 1.5 和 CLA 3 能够处理高达百万个标记的上下文长度,引发了是否能够完全用 LLMs 替代外部检索系统(RAG)的讨论。Greg Camand 的针和干草堆分析测试了 GPD 4 和 Claude 在不同上下文长度和文档位置下检索事实的能力。Lance 扩展了这项工作,增加了多针检索和评估的能力,并通过实验发现,随着上下文长度的增加,LLMs 检索文档前部信息的能力会下降。他还提到了 Lang Smith 作为评估工具的好处,包括记录所有运行和便于审计。最后,Lance 讨论了使用 LLMs 进行推理和检索时的限制,并强调了检索和推理作为独立问题的重要性。

Takeaways

  • 📈 **多针检索分析**:Lance 讨论了对大型语言模型(LLM)进行的一项分析,称为多针检索(multi-needle retrieval),旨在了解模型在处理长文本上下文时检索特定信息的能力。
  • 🔍 **上下文长度的影响**:随着上下文长度的增加,如Gemini 1.5和CLA 3报告的百万级令牌上下文长度,LLM在检索文档开始处的事实时性能下降。
  • 📚 **文档位置的重要性**:在长文本上下文中,LLM更有可能检索到文档后半部分的信息,而不是前半部分的信息。
  • 💡 **多针检索的挑战**:当需要从上下文中检索多个事实(多针)时,如Google报告的100针检索,性能会随着上下文长度的增加和针的数量增加而降低。
  • 📊 **实验设计**:通过在不同位置注入多个“针”(即关键信息点),并改变上下文长度,Lance评估了LLM在不同条件下的检索性能。
  • 🛠️ **工具和资源**:Lance介绍了如何使用Greg的开源仓库和LangSmith工具来设置实验、运行测试并记录结果。
  • 📝 **审计和验证**:通过LangSmith,可以详细记录实验的每个步骤,包括上下文、问题、答案和LLM的生成结果,便于审计和验证。
  • 📉 **性能退化模式**:随着上下文长度的增加,尤其是当文档开始处的针数量增加时,LLM检索性能会下降,显示出一种模式。
  • 🧠 **推理与检索的关系**:推理任务建立在成功检索的基础上,如果检索效果不佳,推理性能也会受到影响。
  • 💰 **成本考量**:虽然长上下文的测试成本较高,但通过精心设计实验,可以在合理的预算内进行有意义的研究。
  • 📝 **数据共享**:所有数据和工具都是开源的,可以在Greg的仓库中找到,便于其他研究者复现和验证实验结果。
  • ⚠️ **LLM的限制**:Lance强调了在考虑用LLM替代传统检索系统(RAG)时,需要理解长上下文检索的限制和挑战。

Q & A

  • Lance 正在讨论的分析项目叫什么名字?

    -Lance 正在讨论的分析项目叫做 'multi-needle and Haystack'。

  • 为什么 Gemini 1.5 和 CLA 3 报告了高达一百万个令牌的上下文长度后,会引发很多问题?

    -这是因为在拥有一百万个令牌的上下文中,可以包含成百上千页的信息,这引发了是否可以完全用大型语言模型(LLM)替代传统的检索系统(如 RAG)的问题。

  • Greg Camand 进行的 'Needle and Haystack' 分析主要尝试回答什么问题?

    -Greg Camand 的 'Needle and Haystack' 分析主要尝试回答在不同上下文长度和事实在文档中的位置下,LLM 能够多好地从上下文中检索特定事实。

  • 在多针检索和评估中,'needles' 是指什么?

    -'Needles' 在这里指的是需要从上下文中检索的具体信息或事实。

  • Google 最近报告的 100 针检索展示了什么能力?

    -Google 展示的能力是在单个查询中检索 100 个独特的 'needles',即在单个回合中从上下文中检索 100 个不同的事实或信息点。

  • Lance 在 Greg 的开源库中添加了什么功能?

    -Lance 在 Greg 的开源库中添加了多针检索和评估的能力,允许在上下文中注入多个 'needles' 并评估性能。

  • LangSmith 作为评估工具有哪些优势?

    -LangSmith 作为评估工具的优势包括能够记录所有运行情况、为你编排评估过程,并且非常适合审计。

  • 在 Lance 的分析中,为什么在文档的开始部分放置的 'needles' 更难被检索?

    -在 Lance 的分析中,文档开始部分的 'needles' 更难被检索可能是因为语言模型在处理长上下文时,对文档早期部分的信息记忆或检索能力较弱。

  • 为什么 Lance 认为即使在长上下文中,多针检索也不能保证检索到所有事实?

    -因为随着上下文长度的增加和 'needles' 数量的增加,检索性能会下降,尤其是在文档的开始部分,LLM 可能会 '忘记' 或未能检索到这些信息。

  • Lance 在分析中使用了什么方法来验证 'needles' 是否正确放置在上下文中?

    -Lance 使用了 LangSmith 提供的详细日志功能,通过搜索特定的关键词(如 'secret ingredient'),来验证所有的 'needles' 是否确实存在于上下文中,并按预期放置。

  • 在 Lance 的研究中,为什么推理(reasoning)的性能可能会受到检索(retrieval)性能的限制?

    -因为在需要推理的场景中,首先需要正确检索到所有相关的信息或事实,然后才能进行有效的推理。如果检索阶段未能找到所有必要的信息,那么推理阶段的性能自然会受到影响。

  • Lance 提到的分析成本大概是多少,他建议如何在预算内进行有效的测试?

    -Lance 提到的分析成本大约是 2 美元左右,主要是长上下文测试的成本较高。他建议如果是为了个人研究,可以只进行单次通过的测试,这样可以在合理的预算内进行多项测试,尤其是在关注中等上下文长度范围时。

Outlines

00:00

😀 介绍多针检索分析

Lance从Lang Chain介绍了他最近进行的一项有趣的分析,名为“多针和干草堆”。这项分析探讨了大型语言模型(LLMs)如何从大量文本中检索特定事实。Lance讨论了随着上下文长度的增加,如Gemini 1.5和CLA 3能够处理高达一百万个token的上下文长度,人们开始质疑是否还需要外部检索系统。Greg Camand的“针和干草堆”分析尝试回答了这些问题,通过测试不同上下文长度和文档中事实放置位置对LLMs检索性能的影响。Lance进一步扩展了这项工作,增加了多针检索和评估的能力,以模拟真实应用场景中的文档检索需求。

05:01

📈 多针检索的实现和评估

Lance详细说明了如何使用Lang Smith作为评估工具来实现和评估多针检索。他展示了如何设置Lang Smith项目,创建数据集,并在其中添加问题和答案对。然后,他解释了如何使用Greg的开源代码库来运行实验,包括设置上下文长度、文档深度百分比和需要注入的针的数量。实验结果会记录在Lang Smith中,包括检索到的针、上下文长度、插入百分比、模型名称等信息,为审计和验证提供了便利。

10:02

🔍 实验结果分析

Lance分享了使用GBD4模型进行的多针检索实验结果。实验设置了不同的评估集,分别针对1针、3针和10针的检索,并测试了从1K到120K不同上下文长度的检索能力。结果显示,当上下文长度增加时,尤其是针的数量增加时,检索性能会下降。有趣的是,模型更有可能忘记或未能检索到文档前半部分的信息。此外,Lance还展示了一个需要检索并进行推理的实验,要求模型返回每个秘密成分的首字母,结果表明,随着针的数量增加,检索和推理的性能都会下降。

15:03

💡 长上下文检索的局限性

Lance强调了长上下文检索的局限性,尤其是在多针检索的情况下,没有检索保证,尤其是当针的数量和上下文大小增加时。他还指出,模型倾向于在文档的前部检索时失败,这可能会影响对文档早期部分重要信息的检索。此外,Lance提到了提示(prompting)的重要性,以及检索和推理作为独立问题的重要性,其中检索是进行有效推理的前提。

20:05

📚 开源工具和成本考量

Lance提到所有使用的工具和数据都是开源的,可以在Greg的代码库中的viz部分找到。他鼓励大家使用这些工具进行实验,因为长上下文LLMs非常有前景。他还提到了进行此类分析的成本问题,指出虽然长上下文测试的成本较高,但是通过精心设计的研究,可以在合理的预算内进行大量试验。

Mindmap

Keywords

💡多针检索(multi-needle retrieval)

多针检索是指在一个长文本上下文中检索多个特定信息点(称为“针”或“needles”)的能力。在视频中,Lance讨论了对大型语言模型(LLMs)进行多针检索的测试,以评估它们在处理大量上下文信息时的性能。例如,测试了模型在120,000个令牌的上下文中检索10个不同信息点的能力,这与理解长文本上下文的挑战和可能性密切相关。

💡上下文长度(context length)

上下文长度指的是在语言模型处理信息时考虑的文本长度。在视频中,Lance提到了上下文长度对检索性能的影响,尤其是在测试模型是否能够从长达数十万令牌的文本中检索信息时。上下文长度对于理解模型如何处理和记忆大量信息至关重要。

💡Gemini 1.5

Gemini 1.5是Google开发的大型语言模型,视频中提到它能够执行100针检索,即在单次交互中检索100个独特的信息点。这个模型的性能展示了当前语言模型在处理复杂检索任务方面的进展。

💡CLM (Claude) 3

CLM (Claude) 3是另一个大型语言模型,视频中提到它在处理特定类型的检索任务时遇到了一些有趣的结果。例如,在尝试从一个大型上下文中找到与披萨配料相关的信息点时,模型表现出了对测试条件的认识,这引发了社区中的一些讨论。

💡Needle and Haystack分析

Needle and Haystack分析是一种评估语言模型检索特定事实性能的方法。在视频中,Lance提到了Greg Camand进行的此类分析,它通过改变上下文长度和事实在文档中的位置来测试模型的检索能力。这种分析方法对于理解模型在不同条件下的性能非常有用。

💡LangSmith

LangSmith是一个用于评估和审计语言模型性能的工具。在视频中,Lance展示了如何使用LangSmith来记录和分析模型的检索测试。它允许用户创建数据集,运行测试,并详细查看每次测试的结果,包括成本、延迟和检索成功率。

💡长上下文LLMs(long context LLMs)

长上下文LLMs是指能够处理大量文本信息的语言模型。视频中讨论了这些模型在检索任务中的潜力,尤其是在尝试替代传统的检索系统(如RAG,Retrieval-Augmented Generation)时。长上下文LLMs的性能对于理解它们在实际应用中的可行性至关重要。

💡检索失败(retrieval failure)

检索失败是指语言模型未能从给定的上下文中找到所需的信息点。视频中提到,随着上下文长度的增加和需要检索的信息点数量的增加,模型的检索失败率也会上升。特别是在文档的早期部分,模型更有可能忘记或未能检索到相关信息。

💡推理(reasoning)

推理是指在检索到的信息基础上进行逻辑推导的能力。在视频中,Lance提到了在检索任务之上增加推理任务的挑战,例如要求模型不仅检索出披萨的配料,还要推理出这些配料首字母的组合。推理能力对于评估模型的高级认知功能非常重要。

💡成本(cost)

成本在这里指的是运行语言模型检索测试时所需的经济投入。视频指出,随着上下文长度的增加,进行这些测试的成本也会上升。然而,通过精心设计测试,可以在合理的预算内获得有价值的结果,这对于研究者和开发者来说是一个重要的考虑因素。

💡数据集(dataset)

数据集是用于训练或评估语言模型的一组数据。在视频中,Lance创建了一个名为'multi needle test'的数据集,用于存储测试问题和答案,这是进行多针检索测试的基础。数据集的创建和管理是语言模型开发和评估过程中的一个关键步骤。

Highlights

Lance从Lang chain介绍了一个有趣的分析,称为multi-needle and Haystack,探讨了LMS在不同上下文长度和事实放置位置下检索特定事实的能力。

Gemini 1.5和CLA 3最近报告了高达一百万个token的上下文长度,引发了关于是否可以完全替代外部检索系统的讨论。

Greg Camand进行了一项名为needle and Haystack的分析,测试了GPT-4和Claude在不同上下文长度和文档位置下的事实检索能力。

研究发现,对于较长的上下文,LMS在检索文档开头的事实时表现不佳。

对于RAG系统,通常需要检索多个事实,并在文档的不同部分进行推理。

Google报告了其Gemini 1.5能够在单次交互中检索100个独特的'needles'。

Lance在Greg的开源仓库中添加了多针检索和评估的功能。

通过Lang Smith作为评估工具,可以记录所有运行情况,便于审计和分析。

使用Lang Smith进行测试,可以创建数据集并运行实验,所有信息都会被记录和存储。

实验结果显示,随着上下文长度的增加,多针检索的性能会下降,尤其是当针位于文档的前部时。

对于长上下文的检索,LMS趋向于忘记或未能检索到文档前部的事实。

实验还测试了在检索的基础上进行推理的能力,发现随着针的数量增加,推理的准确性也会下降。

检索并不是有保证的,尤其是当针的数量和上下文大小增加时,可能无法保证检索到所有事实。

改进的提示(prompting)可能是提高检索效率的关键。

检索和推理是两个独立的问题,检索是进行有效推理的前提。

即使是设计良好的研究,也可以在合理的预算内完成,长上下文测试的成本较高。

所有数据和工具都是开源的,可以在Greg的仓库中找到,以供进一步分析和使用。

Transcripts

play00:01

hi this is Lance from Lang chain I want

play00:03

to talk about a pretty fun analysis that

play00:05

I've been working on for the last few

play00:07

days um called multi- needle and Hy

play00:10

stack um and we're going to talk through

play00:12

these graphs and what they mean and some

play00:14

of the major insights that came out of

play00:15

this analysis uh maybe but first I want

play00:18

to like kind of set the

play00:19

stage so context lengths for LMS have

play00:22

been increasing most notably Gemini 1.5

play00:25

and CLA 3 have recently reported up to a

play00:27

million token context lengths and this

play00:30

has provoked a lot of questions like if

play00:32

you have a million tokens which is

play00:33

hundreds or maybe thousands of pages can

play00:35

you replace rag altogether you know why

play00:38

would you need an external retrieval

play00:39

system if you can Plum huge amounts of

play00:41

context directly into these llms so it's

play00:44

a really good question and a really

play00:45

interesting debate to help kind of

play00:47

address this Greg camand recently put

play00:50

out an analysis called needle and

play00:52

Haystack which attempts to answer the

play00:55

question how well can these LMS retrieve

play00:57

specific facts from their context with

play01:00

respect to things like how long the

play01:01

context is or where the fact is placed

play01:04

within the context so he did a pretty

play01:07

influential analysis on gptv 4 and also

play01:10

Claude um which basically tested along

play01:14

the X are different context lengths so

play01:17

going from 1K all the way up to 120k in

play01:19

the case of GPD 4 and on the Y being

play01:23

different document placements or or fact

play01:26

placements within the document so either

play01:28

put it at the start of the doc or put at

play01:30

the end so he basically injects this

play01:32

needle into the context at different

play01:35

places and varies the uh context length

play01:39

and each time ask the question that's

play01:41

that you need the fact or the needle to

play01:44

answer and basically scores it you know

play01:46

can the llm get it right or

play01:48

wrong and what he's found is that the

play01:52

llm at least in this case gbd4 fails to

play01:54

retrieve facts towards the start of

play01:56

documents in the regime of longer

play01:58

context so that was kind of punchline

play02:00

pretty interesting it's a nice way to

play02:02

characterize performance of retrieval

play02:04

with respect to these two important

play02:06

parameters but you know for rag you

play02:10

really want to retrieve multiple facts

play02:12

so this was testing single fact

play02:14

retrieval but typically for rag systems

play02:16

you're chunking a document you're

play02:18

retrieving some number of chunks maybe

play02:20

you know three to five and you can do

play02:23

this reasoning over disperate parts of a

play02:25

document using similarity search and

play02:27

chunk

play02:28

retrieval so kind of to map the idea of

play02:32

rag onto this approach you really would

play02:34

need to be able to retrieve various

play02:36

facts from a context um so it's maybe

play02:39

like three needles five needles 10

play02:42

needles and Google recently reported 100

play02:45

needle retrieval so what they're showing

play02:48

here is the ability to retrieve 100

play02:51

unique needles in a single turn uh with

play02:54

Gemini 1.5 and they test a large number

play02:56

of points here they vary the context

play02:58

blank you can see on the X and they show

play03:00

the recall their number of needles that

play03:02

they uh return on the

play03:04

Y now this kind of analysis is really

play03:06

interesting because if we're really

play03:09

talking about the ability to kind of

play03:11

remove rag from our workflows and strict

play03:13

rely strictly on context stuffing we

play03:16

really need to know like what is the

play03:18

retrieval you know a recall with respect

play03:20

to the Contex length or number of

play03:22

needles does this really work well what

play03:24

are the pitfalls right so I recently

play03:28

added the ability to do multi- needle

play03:30

retrieval and

play03:31

evaluation uh to Greg's open source repo

play03:35

so Greg op Source this whole analysis

play03:37

and what I did is I went ahead and added

play03:39

the ability to inject multiple needles

play03:41

in a context and characterize and

play03:43

evaluate the performance so the flow is

play03:46

laid out here it's pretty simple and all

play03:48

you need is is actually three things so

play03:51

you need uh try just move this over so

play03:55

you need a

play03:57

question you need to know your needles

play03:59

you need to have an answer so that's

play04:02

kind of the way this

play04:03

structured so a fun toy question which

play04:06

is kind of derived from an interesting

play04:07

Claude 3 needle in a Hast stack analysis

play04:10

is related to Pizza ingredients so they

play04:13

reported some funny results with Claude

play04:15

basically trying to find a needle

play04:17

related to Pizza ingredients in a c of

play04:19

of other context and Claud 3 kind of

play04:21

recognized it was being tested it was

play04:23

kind of a funny tweet that went around

play04:25

uh but it's actually kind of a fun a fun

play04:27

challenge we can actually take the

play04:29

question which was what are the secret

play04:31

ingredients need to build the perfect

play04:32

pizza and the answer the secret

play04:34

ingredients are figs Pudo and goat

play04:36

cheese and we can just parse that out

play04:37

into three separate needles so our first

play04:40

needle is figs are the secret

play04:42

ingredients the second needle is Pudo

play04:43

the third needle is goat cheese now the

play04:45

way this analysis works is we take those

play04:47

needles and we basically partition them

play04:49

into the context uh at different

play04:51

locations so we basically pick placement

play04:53

of the first needle and the other two uh

play04:56

are then allocated accordingly to in

play04:58

into like kind of rough equal

play05:01

spacing uh depending on how much context

play05:03

is left after you place the first one so

play05:05

that's kind of the way it works um you

play05:09

then pass that context to an llm along

play05:11

with a question have the llm answer the

play05:13

question given the context um and then

play05:16

we

play05:17

evaluate um in this toy example the

play05:20

answer only contains figs and evaluation

play05:22

returns a score of one for figs and also

play05:25

tells us which needle it retrieved

play05:28

spatially um so that's the overall flow

play05:31

and that's really what goes on when you

play05:32

run this analysis based on the stuff

play05:33

that I added to Greg's

play05:35

repo um

play05:38

now um if we want to set this up another

play05:42

thing I add to the repo is the ability

play05:44

to use Langs Smith as your evaluator and

play05:47

this is a lot of nice properties it

play05:49

allows you to log all of your runs it

play05:51

orchestrates the evaluation for you and

play05:52

it's really good for auditing which we

play05:54

can see here in a minute so I'm going to

play05:57

go ahead and and create uh a Lang Smith

play06:00

data set to show how this works um and

play06:03

all you need is so here's a notebook

play06:06

I've just set uh a few different Secrets

play06:09

um or uh environment variables um for

play06:12

like lenss withth API key length chain

play06:14

endpoint and tracing B2 so basically set

play06:17

these and these are done in your lsmith

play06:19

setup and once you've done this you're

play06:21

all set to go so if you go over then to

play06:24

your Lang Smith uh kind of overall

play06:27

project page it'll look something like

play06:29

this where you can see over here on the

play06:32

left you have projects annotation cues

play06:34

deployments data T and testing um and

play06:37

what we're going to do is we're going to

play06:38

go to data set and

play06:39

testing and what we're going to do here

play06:42

is we're going to say create a new data

play06:44

set I'll move this down

play06:46

here and we're going to create a new

play06:48

data set that contains our question and

play06:50

our answer and we're going to call this

play06:52

multi needle

play06:53

test so here we go we'll call this

play06:57

multin needle testing

play07:00

we'll create a key value data set and

play07:02

just say create so now you can see we've

play07:05

created a new data set it's empty um

play07:08

there no tests and no examples and what

play07:10

we're going to do is we're going to say

play07:11

add example here and what we're going to

play07:14

do is just copy over our

play07:17

question there we go we'll copy over

play07:19

that

play07:22

answer so again our questions what are

play07:24

the secret ingredients need to build the

play07:25

perfect Pizza our answer contains those

play07:27

secret ingredients we submit that so now

play07:30

we have an example so this is basically

play07:32

a data set with one example question

play07:33

answer pair no tests yet so that's kind

play07:37

of step

play07:38

one now all we need to do simply is I've

play07:42

already done this so um if we go up here

play07:45

so this is Greg's

play07:47

repo um you need to clone this you need

play07:50

to just follow these commands to clone

play07:52

setup create a virtual environment pip

play07:54

install a few things and then you're set

play07:55

to go so basically all we've done is

play08:00

set up lsmith set these environment

play08:02

variables clone the repo that's it

play08:06

create a data set we're ready to

play08:09

go so if we go

play08:13

back this is the command that we can use

play08:16

to run our valuation and there a few

play08:18

pieces so we use Langs Smith our

play08:21

evaluator we set some number of context

play08:24

length intervals to test so in this case

play08:26

I'm going to run three tests and we can

play08:29

set our context lengths minimum and our

play08:31

maximum so I'm going to say I'm want to

play08:33

go from a th000 tokens all the way up to

play08:36

120,000

play08:37

tokens um and you know three intervals

play08:41

so we'll do basically one data point in

play08:43

the middle um I'm going to set this

play08:47

document depth percent Min is the

play08:49

initial point at which I insert the

play08:51

first needle and then the other to will

play08:53

be set accordingly in equal spacing in

play08:55

the remaining context that's kind of all

play08:58

that you need there we'll use open AI

play09:01

I'll set the model I want to test I'm G

play09:03

to basically flag multi needle

play09:05

evaluation to be true and here's where

play09:07

I'm basically going to point to this

play09:09

data set that we just created this

play09:10

multin needle test data set I specified

play09:12

here in eval set and the final thing is

play09:15

I just say here's the three needles I

play09:16

want to inject into the context and you

play09:19

can see that Maps exactly to what we had

play09:20

here we had our question our answer

play09:23

those are in Langs Smith and our needles

play09:26

we just pass in so that's really it we

play09:29

can do all we can take this

play09:31

command and I can go over

play09:36

to so here we go so I'm in my Fork of

play09:42

Greg's repo right

play09:43

now um and I'm just going to go ahead

play09:46

and

play09:47

run so we should see this go ahead and

play09:49

kick off it has some nice logging so it

play09:52

shows like okay um here's like our

play09:55

experiment you know here's like what

play09:57

we're testing um here's the needles that

play09:59

we're injecting here's where we

play10:01

inserting the needles um and this is

play10:03

like our first experiment with a th000

play10:05

tokens um and it's just rolling through

play10:07

these parameters so it rolls through

play10:09

these experiments as we've just set laid

play10:11

out and what we can see is if we go over

play10:15

to Langs Smith now we're going to start

play10:17

to see experiments or tests roll in

play10:21

that's pretty

play10:22

nice and what we can do is we can kind

play10:25

of Select here different settings we

play10:27

want to look at so if you want to look

play10:28

at needles retrieve for for example we

play10:30

can see those scores here now this final

play10:32

one's still running which is fine um but

play10:35

it's all here for us and let's try to

play10:37

refresh that okay so now we have them

play10:39

all so this is pretty cool you can see

play10:43

how much it cost so again be careful

play10:45

here at

play10:46

120,000 tokens it cost about a dollar so

play10:49

it's expensive but smaller contexts very

play10:53

cheap right so you're priced per token

play10:56

um you can see the latencies here the

play10:58

P99 you see the p50 latency uh creation

play11:01

time runs and what's kind of cool is you

play11:04

can see everything about the experiment

play11:05

is logged here the the needles retrieved

play11:08

the context length the first needle

play11:10

depth percentage the insertion

play11:12

percentages model name the needles

play11:15

number of needles totals so it's all

play11:16

there for you now I'm going to show you

play11:18

something that's pretty cool you can

play11:20

click on one of these and actually opens

play11:22

up the run and here you can see here was

play11:23

like our input here was the reference or

play11:25

here was the input here's like the

play11:27

reference output this is like the

play11:28

correct answer and here's what the llm

play11:30

actually said so in this case it looks

play11:33

like the secret need to build a perfect

play11:35

Pizza include Pudo gois and figs so it

play11:37

gets it right and you can see it scores

play11:39

as three which is exactly right so

play11:41

that's pretty cool let's look at another

play11:43

one um so we can look at our this is our

play11:47

U 60,000 token experiment this so it's

play11:50

the one in the middle and in this case

play11:53

you can see the secret ingredients need

play11:55

to build the perfect Pizza are goat

play11:56

cheese and Pudo so it's missing figs so

play11:58

that's kind of interesting and we can

play11:59

see it correctly scores it as two now we

play12:02

can even go further we can actually open

play12:03

this run and you can see over here this

play12:07

is just the trace this is kind of

play12:09

everything that happened we can look at

play12:10

our prompt um and this is like the whole

play12:13

everything in our prompt now what's

play12:14

pretty nice is we can actually see right

play12:17

here this is the entire prompt that went

play12:20

into the

play12:22

llm um and here is the answer now what I

play12:25

like about this is if we want to verify

play12:28

that all the needles were actually

play12:31

present so we can actually just search

play12:33

for like secret ingredient right you can

play12:35

see figs are one of the secret green so

play12:37

it's in there it's definitely in the

play12:39

context it's not in the answer um we can

play12:41

just keep searching so Pudo GOI we can

play12:44

see all three secret ingredients are in

play12:46

the context and two are in the

play12:48

generation so it's a nice sanity check

play12:50

to say okay the needles are actually

play12:51

there they're where you expect them to

play12:53

be um I find that very useful for

play12:55

auditing and making sure that like um

play12:58

you know the needs r Place correctly so

play13:00

it's really nice to be able to do that

play13:02

uh to kind of convince yourself that

play13:03

everything's working as

play13:05

expected um so what's pretty cool then

play13:09

is when I've done this um I actually can

play13:12

do a bit more so I'm just GNA copy over

play13:16

some code here and this will all be made

play13:18

public um so I'm just paste this into

play13:21

not my notebook and what this is going

play13:22

to do this is basically just going to

play13:24

grab my data sets um I just pass in my

play13:27

eval set name and it's going to suck in

play13:29

the data that we just ran and I think

play13:33

that's probably done we can have a look

play13:36

yep so you can see nice it loads this as

play13:39

a pandage data frame it has all this uh

play13:43

you know all this run information that

play13:44

we want has the needles um and what's

play13:47

really nice is actually it has also the

play13:51

Run URL so this is now a public link

play13:54

that can be

play13:55

shared um and you can share those with

play13:58

anyone so anyone can go ahead and audit

play14:00

this run and convince thems that it was

play14:03

doing the right

play14:05

thing so with all this we can actually

play14:08

do quite a lot um and we've actually

play14:11

done uh kind of a more involved

play14:14

study um that I want to talk about a

play14:17

little bit right now so with this

play14:20

Tooling in place we're able to ask some

play14:22

interesting questions um we focus on gp4

play14:26

to start and we kind of just talk

play14:29

through usage this is a blog post that

play14:30

I'll go out tomorrow along with this

play14:32

video um we talk through usage so I

play14:35

don't want to cover that too much we

play14:36

talk through workflow but here's like

play14:38

Crux of the analysis that we did so we

play14:41

set up three eval sets uh for one three

play14:45

and 10

play14:46

needles um and what we did is we ran a a

play14:51

bunch of Trials we burned a fair number

play14:53

of tokens on this analysis but it was

play14:55

worth it because it was it's it's pretty

play14:56

interesting what we did is test the

play14:59

ability for gbd4 to retrieve needles uh

play15:03

with respect to like the number of

play15:04

needles and also the context length so

play15:08

you can see here is what we're doing is

play15:10

we're asking gbd4 to retrieve either one

play15:12

three or 10 needles again these are

play15:14

Peach ingredients in a single turn and

play15:17

what you can see is with one

play15:20

needle um whether it's a th000 tokens or

play15:24

120,000 tokens retrieval is quite good

play15:29

so if you place a single needle not a

play15:32

problem now I should note that this

play15:34

needle is placed in the middle of the

play15:35

context which may be relevant uh but at

play15:38

least for this study the single needle

play15:40

case is right in the middle and it's

play15:41

able to get that every time regardless

play15:43

of the context but here's where things

play15:45

get a bit

play15:46

interesting if we bump up the number of

play15:49

needles so for example look at three

play15:51

needles or 10 needles you can see the

play15:53

performance start to degrade quite a lot

play15:56

when we bump up the context so

play15:59

retrieving three needles out of 120,000

play16:02

tokens um you see performance drop to

play16:04

like you know around 80% of the needles

play16:07

are retrieved and it drops to more like

play16:09

60% if you go to 10 needles so there's

play16:12

an effect that happens as you add more

play16:14

needles in the long contract regime you

play16:17

miss more and more of them that's kind

play16:19

of interesting so because we have all

play16:22

the logging for every insertion point of

play16:24

each needle we can also ask like where

play16:27

are these failures actually occurring

play16:30

and this this actually shows it right

play16:33

here so what's kind of an interesting

play16:36

result is that we

play16:39

found here is an example of retrieving

play16:42

10

play16:43

needles um at different context sizes so

play16:47

on the X here you can see the context

play16:48

length going from a th000 tokens up to

play16:51

120,000 tokens and on the Y here you can

play16:55

see each of our needles so this is a

play16:58

star of the document up at the top so

play17:00

needle one 2 down to 10 are inserted in

play17:03

our document at different locations and

play17:06

each one of these is like one experiment

play17:09

so at a th000 tokens you ask the LM to

play17:13

retrieve all 10 needles and it gets them

play17:15

all you can see this green means 100%

play17:18

retrieval and what happens is as you

play17:21

increase the context window you start to

play17:23

see a real degradation in

play17:26

performance now that shouldn't be

play17:28

surprising we already know that from up

play17:30

here that's the case right getting 10

play17:33

needles from

play17:35

120,000 is a lot worse than getting than

play17:38

getting from a th a th000 you get them

play17:40

every time 120,000 it's 60% but what's

play17:43

interesting is this heat map tells you

play17:46

where they're failing and there's a real

play17:47

pattern there the degradation and

play17:50

performance is actually due to Needles

play17:51

Place towards the the top of the

play17:54

document um so that's an interesting

play17:57

result it appears that

play17:59

um needles placed earlier in the context

play18:03

uh have a lower chance of being

play18:06

remembered or retrieved um now this is a

play18:10

result that Greg also saw in the single

play18:11

needle case and it appears to carry over

play18:14

to the multi- needle case and it

play18:16

actually starts a bit earlier in terms

play18:18

of contact sizes so you can think about

play18:20

it like this if you have a document and

play18:23

you have like three different facts you

play18:25

want to retrieve to answer a question

play18:28

you're more likely to get the facts

play18:30

retrieved from the latter half of the

play18:31

document than the first so it can kind

play18:33

of like forget about the first part of

play18:36

the document or the fact in the first

play18:37

part of the document which is very

play18:39

important because of course valuable

play18:41

information is present in the earlier

play18:42

stages of the document uh but we may or

play18:45

may not be able to retrieve it the

play18:47

likely of retrieval drops a lot as you

play18:49

move into this um kind of earlier part

play18:52

of the document so so anyway that's an

play18:54

interesting result it kind of follows

play18:56

what Greg

play18:57

reported um for the the single needle

play19:00

case as well um now a final thing we

play19:04

show is that often times you don't just

play19:07

want to retrieve you also want to do

play19:09

some reasoning um so we built a second

play19:13

set of evalve challenges that ask for

play19:15

the first letter of every ingredient so

play19:17

you have to retrieve and also reason to

play19:20

return the first letter of every of

play19:21

every secret

play19:22

ingredient and we found here is

play19:24

basically that green is reasoning red is

play19:28

retriev Ral and this is all done at

play19:30

120,000 tokens as you bump up the number

play19:33

of needles you can basically see that

play19:35

both get worse as you would expect and

play19:38

reasoning lags retrieval a bit which is

play19:40

also what you kind of would expect

play19:41

retrieval kind of sets an upper bound on

play19:43

your ability to reason so again it's

play19:46

important to recognize that one

play19:49

retrieval is not guaranteed and two

play19:51

reasoning May degrade a little bit

play19:53

relative to retrieval so those are kind

play19:55

of the two points that are really

play19:56

important to recognize um

play19:59

and maybe I'll just like kind of

play20:00

underscore some of the main observations

play20:02

here and you know especially as we think

play20:04

about long context more we think about

play20:07

replacing rag in certain use cases it's

play20:09

very important to understand the

play20:10

limitations of long context retrieval um

play20:13

and like you know multi- needle

play20:15

retrieval from long context in

play20:17

particular there's no retrieval

play20:19

guarantees so multiple facts are not

play20:21

guaranteed to be retrieved especially as

play20:23

number of needles and the contact size

play20:25

increases we saw that pretty

play20:27

clearly there can be different patterns

play20:29

of retrieval failure so gbd4 tends to

play20:32

retrieve needles uh tends to fail in

play20:34

retrieval of needles towards the start

play20:36

of the document as the contact length

play20:38

increases that's point two point three

play20:41

is prompting may definitely matter here

play20:43

so I don't I don't presume to say we

play20:46

have the ideal prompt I was using kind

play20:47

of the The Prompt that Greg already had

play20:49

in the repo it absolutely may be the

play20:52

case that improved prompting is

play20:54

necessary um it's been there's been sub

play20:57

evidence that you need that for claws so

play20:58

that's like very valid to consider um

play21:03

and also that retrieval and reasoning

play21:04

are both uh very important tasks and

play21:08

retrieval might kind of set a bound on

play21:10

your ability to reason and so if you

play21:12

have a challenge that requires reasoning

play21:14

on top of retrieval um you're kind of

play21:16

like stacking two problems on one

play21:18

another and your ability to reason may

play21:20

or may not be kind of governed or

play21:21

limited by your ability to retrieve the

play21:23

facts which is pretty intuitive but you

play21:25

know it's just a good thing to

play21:26

underscore that like reasoning and

play21:27

retrieval are independent problems um

play21:30

and retrieval is like a precondition to

play21:32

reason well um those are the main things

play21:35

I want to leave you with um I also will

play21:38

have a quick note here that if we go

play21:41

back to our data set um which we can see

play21:43

here like how much did this actually

play21:45

cost I know people are kind of worried

play21:46

about that so just the three tests that

play21:48

we did are around um you know maybe

play21:51

around around $2 so again it's really

play21:54

those long context tests that are quite

play21:56

costly lsmith shows you the cost here so

play21:58

you kind of you can actually track it

play22:00

pretty carefully but the nice thing is

play22:03

for like a kind of well-designed study

play22:05

you could spend maybe $10 and you could

play22:06

actually look at you know quite a number

play22:08

of Trials especially if you care more

play22:10

about like the the middle context length

play22:13

regime uh you can you can do quite a

play22:14

number of tests within a reasonable

play22:16

budget you know so I just I just want to

play22:18

throw that out there that you don't

play22:20

necessarily have to totally break the

play22:21

bank to do some interesting work here um

play22:24

the things we did were pretty costly but

play22:26

that was mostly because we generated a

play22:27

large number of replic just to validate

play22:29

the results but if you're doing it for

play22:31

yourself you could do it just a single

play22:32

pass and this is actually not that many

play22:34

measurements you know so here it would

play22:35

be like 1 2 34 you know six measurements

play22:38

if you don't do any replicates so you

play22:40

know it's not too not too costly to to

play22:43

produce some inry results using this

play22:45

approach uh everything's open source all

play22:47

the data is open source as well um it's

play22:50

all checked into Greg's repo you can

play22:52

find it um in the viz

play22:55

section and uh yep mul data sets it'll

play22:59

all be there so yeah I encourage you to

play23:02

play with this hopefully it's useful and

play23:03

I think long Contex llms are super

play23:05

promising and interesting and Analysis

play23:08

like this hopefully will make them at

play23:09

least a little bit more understandable

play23:10

and build some intuition as to whether

play23:12

or not you can use them to actually

play23:13

replace rag or not so thanks very

play23:16

much

Rate This

5.0 / 5 (0 votes)

Related Tags
长文本检索语言模型性能分析多针检索文档理解LLM性能信息检索上下文长度事实定位数据集测试智能系统
Do you need a summary in English?