Building long context RAG with RAPTOR from scratch

LangChain
5 Mar 202421:30

Summary

TLDRランスは、新しい検索方法「Raptor」について語ります。長いコンテキストのLLM(GeminiやClaudeなど)が注目されていますが、ランスは長いコンテキストLLMの使用についていくつかの考慮事項を指摘します。特に、高コストと遅延の問題です。そこで、Raptorは、長いコンテキストを保持しながら、より軽量で使いやすい検索戦略を提供します。この方法は、ドキュメントを集約して、階層化した文書ツリーを作成し、情報を統合するものです。ランスは、このアプローチがどのように機能するかを説明し、実際にこのプロセスを実行します。

Takeaways

  • 🔍 ランチャインドのLanceが、新しい方法「Raptor」について語りました。
  • 📚 長いコンテキストのLLM(GeminiやClaudeなど)が注目されています。
  • 🤖 Lanceは、長いコンテキストLLMを使ってコードアシスタントを開発し、効果的に応答しました。
  • ⏱️ 長いコンテキストLLMの使用には、遅延とコストの考慮が必要です。
  • 🔄 Raptorは、軽量で使いやすいリトリエーション戦略を提供し、長いコンテキストを利用しながら制限を克服することを目的としています。
  • 📈 Raptorは、文書を組み合わせて、より高いレベルの要約を作り、それを再帰的に繰り返します。
  • 🌳 Raptorは文書をクラスタリングし、各クラスタを要約して、最終的に1つのクラスタにまとめます。
  • 🔗 Raptorは、要約と生の文書を一緒にインデックス化し、リトリエーションに使用することができます。
  • 🔎 Raptorは、複数の文書から情報を統合する場合に特に役立ち、KNNなどの従来の方法の限界を克服します。
  • 📝 Lanceは、Anthropicの新しいモデルCLAE 3を使用して、文書の要約を実行し、Raptorを構築することを計画しています。
  • 📚 Raptorのコードはオープンソースで公開されており、実験と活用が容易になっています。

Q & A

  • Lanceが話す「Raptor」という新しい方法とは何ですか?

    -「Raptor」は、長い文脈のLLMs(Language Models)をより効率的に扱うための方法です。文書をグループ化し、階層的に情報を要約することで、より大きな文書を扱いながらも、効率的な情報検索を実現します。

  • Lanceが使用した「Langs Smith」ダッシュボードは何のために使われましたか?

    -「Langs Smith」ダッシュボードは、評価のために使用されました。20の質問に対して、各生成の50パーセントと99パーセントの遅延を測定し、コストを分析するために使用されました。

  • Lanceが指摘した、非常に長いコンテキストLLMsを使用する際のいくつかの考慮事項は何ですか?

    -非常に長いコンテキストLLMsを使用する際には、遅延とコストが増加することが考えられます。また、文書のサイズがコンテキストウィンドウを超える場合、どのように処理するかという問題もあります。

  • Lanceが提案した「ドキュメントレベルでのインデックス作成」とは何ですか?

    -「ドキュメントレベルでのインデックス作成」は、完全な文書を直接埋め込み、KNN(最近傍neighbor検索)などの方法で検索を行うことを意味します。これにより、文書を分割せずに、より簡単で効率的な検索が可能になります。

  • 「ドキュメントツリー」の構築はどのようなアイデアですか?

    -「ドキュメントツリー」の構築は、複数の文書から必要な情報を統合する場合に役立つアイデアです。文書を階層的にグループ化し、各レベルで情報を要約することで、特定の質問に対する回答に必要な情報が複数の文書から集約されることが可能になります。

  • Raptorのアプローチはどのように文書をグループ化し、要約するのですか?

    -Raptorは、ガウス混合モデル(GMM)を使用して文書の分布をモデル化し、最適なクラスター数を自動的に決定します。UMAP(Uniform Manifold Approximation and Projection)という次元削減方法を用いてクラスタリングプロセスを改善し、ローカルとグローバルの2つのスケールでデータを分析します。

  • Raptorが提供する検索方法の利点は何ですか?

    -Raptorは、原始文書と高レベルの要約を両方索引化することで、異なるタイプの質問に対応する柔軟性と強さを提供します。これにより、詳細な回答が必要な場合や、複数の文書から情報を統合する必要がある場合に、より効果的な検索が可能になります。

  • Lanceが使用したAnthropicの新しいモデルCLAE 3はどのようなものです?

    -CLAE 3はAnthropicが開発した新しいモデルで、非常に高い性能を持ちます。Lanceはこのモデルを使用して、個々の文書の要約を生成し、Raptorのアプローチを適用しています。

  • Raptorのアプローチはどのように実装されましたか?

    -Raptorのアプローチは、文書を埋め込み、クラスタリングし、要約するプロセスを繰り返すことで実装されました。このプロセスは、文書の階層的構造を形成し、最終的に1つのクラスターにまとめます。

  • Lanceが使用した「chroma」とは何ですか?

    -「chroma」は、Lanceが使用したベクトルストアです。これは、文書とその要約を索引化し、効率的な情報検索を実現するために使用されました。

  • Raptorのアプローチが適用される可能性がある状況はどのようなものですか?

    -Raptorのアプローチは、コンテキストウィンドウを超える大型の文書集合や、詳細な回答が必要な場合に適用される可能性があります。また、複数の文書から情報を統合する必要がある質問に対しても効果的です。

Outlines

00:00

🗣️ ランスがLang Chainと長期コンテキストLLMについて話す

ランスは、新しい方法「Raptor」について説明します。最近、GeminiやClaudeなどの長期コンテキストLLMが注目されています。ランスは、長期コンテキストLLMをコードアシスタントなどのプロジェクトで使用し、約60,000トークンのコンテキストを使用して、検索なしで回答を生成しています。しかし、使用にはいくつかの考慮事項があり、例えば、生成にかかる時間やコストです。また、ローカルLLMを使用する代替方法についても議論されています。

05:01

🔍 Raptor: 軽量で使いやすい検索戦略

ランスは、Raptorという新しいアプローチについて説明します。これは、長期コンテキストLLMと組み合わせて使用するための軽量で使いやすい検索戦略です。Raptorは、ドキュメントをグループ化し、階層的に情報を要約することで、複数のドキュメントから情報を統合できる方法です。この方法は、KNNなどの従来の検索方法よりも柔軟で、異なる抽象レベルの質問に対応できます。

10:02

📊 ドキュメントのトークン数分布

ランスは、使用するドキュメントのトークン数分布を示します。ドキュメントのサイズは2,000から12,000トークンまで異なりますが、このアプローチは、これらのページを扱うことができます。ランスは、Anthropicの新しいモデルを使用して、これらのドキュメントを要約します。

15:04

📚 ドキュメントのクラスタリングと要約

ランスは、ドキュメントのクラスタリングと要約プロセスについて説明します。GMM(ガウスミックスモデル)を使用して、ドキュメントを自動的にグループ化し、UMAP(次元削減方法)を改善するために使用します。このプロセスは、ドキュメントが複数のクラスターに属する可能性を提供し、より柔軟なグループ化を実現します。

20:07

🔎 Raptorの検索と応用

ランスは、Raptorを使用した検索の設定方法と、その応用について説明します。まず、すべての生のドキュメントと要約をインデックスに追加し、Chromaを使用して検索を行います。この方法は、生のページと高レベルの要約ページの両方を含み、異なる抽象レベルの質問に対応できる柔軟性があります。

Mindmap

Keywords

💡Raptor

Raptorは、このビデオで紹介されている新しい方法論であり、長い文脈のLLMs(Language Models)と組み合わせて使用されることで、効率的な情報検索を実現します。この方法は、文書を集約し、階層的に情報を統合することで、特定の質問に対して複数の文書から情報を取得できるようにします。ビデオでは、Raptorがどのように文書を集約し、階層的構造を構築するかを詳細に説明しています。

💡Long Context LLMs

長い文脈のLLMsは、大量の文脈情報を扱えるような言語モデルであり、このビデオではその能力を最大限に活用する方法が議論されています。特に、コードアシスタントのようなプロジェクトで、長い文脈LLMsは、質問と関連する文書を処理し、回答を生成するために使用されています。

💡Retrieval

検索(Retrieval)は、情報検索において、関連する情報を見つけて取り出すプロセスを指します。このビデオでは、Raptorを使用することで、長い文脈LLMsと組み合わせて、より効率的な検索を実現する方法が提案されています。

💡GMM (Gaussian Mixture Model)

ガウス混合モデル(GMM)は、複数のガウス分布(正規分布)の重み付き和でデータをモデル化する統計モデルです。このビデオでは、GMMが文書のクラスタリングに使用され、最適なクラスタ数を自動的に決定する機能が強調されています。

💡UMAP (Uniform Manifold Approximation and Projection)

UMAPは、データの次元削減と可視化のためのアルゴリズムであり、データの構造を保持しながら、低次元の空間に投影します。このビデオでは、UMAPがクラスタリングプロセスに役立っていることが説明されています。

💡KNN (K-Nearest Neighbors)

K-最近傍アルゴリズム(KNN)は、データポイントのクラスタリングや分類に使用されるアルゴリズムで、最も近いK個の近傍を利用します。このビデオでは、KNNが文書の検索にどのように使用されるか、およびその限界について議論されています。

💡Embedding

埋め込み(Embedding)は、データ(例えば文書や単語)を高次元のベクトル空間に変換するプロセスで、これによりデータの意味や関係を数値的に表現できます。このビデオでは、文書を埋め込み、クラスタリングし、検索に使用する方法が説明されています。

💡Summarization

要約は、長いテキストを短くまとめ、主要なポイントや情報を保持するプロセスです。このビデオでは、文書を集約し、階層的な構造を構築することで、より効率的な検索と回答生成を実現する方法が提案されています。

💡Document Tree

文書ツリーは、文書の階層的な構造を表すもので、このビデオでは、文書を集約し、階層的に情報を統合するプロセスの一部として使用されています。これにより、特定の質問に対して複数の文書から情報を取得することができます。

💡Latent Variable

潜在変数(Latent Variable)は、観測できないが、モデルの他の変数に影響を与える変数です。このビデオでは、潜在変数が文書のクラスタリングと要約プロセスにどのように関連しているかが説明されています。

Highlights

Lance from Lang chain discusses retrieval and long context LLMs, focusing on a new method called Raptor.

Long context LLMs like Gemini and Claude 3 can handle up to a million tokens, raising questions about their efficiency and cost.

Lance has been using long context LLMs for projects, such as a code assistant for L expression language, without the need for retrieval.

Evaluations showed varying latencies and costs for generations, with P99 latency reaching up to 420 seconds and costs around $30 per generation.

The discussion includes the trade-offs between using long context LLMs and RAG systems, considering factors like latency, cost, and context size.

Lance explores lightweight retrieval strategies compatible with long context models, aiming to address limitations while preserving context utilization.

The Raptor method involves clustering and summarizing documents recursively to create a document tree for efficient retrieval.

Raptor's approach allows for embedding full documents and building a document abstraction tree, providing a more robust retrieval strategy.

The paper on Raptor and its open-sourced code inspired the development of a Raptor pack for LLMs.

Raptor's clustering process uses a Gaussian Mixture Model (GMM) to determine the optimal number of clusters and UMAP for dimensionality reduction.

The method includes local and global clustering and applies thresholding to allow documents to belong to multiple clusters.

Lance demonstrates the Raptor process using Anthropic's new model for summarizing individual documents, regardless of their size.

The Raptor approach can be applied to documents of any size, making it scalable and suitable for various contexts.

The process involves iterative summarization and clustering until a single cluster is achieved or a specified number of levels are reached.

The resulting index from Raptor can be used for retrieval, combining raw documents and higher-level summaries for comprehensive information access.

Lance emphasizes the practicality of Raptor for cases where document sets exceed the context limit of LLMs, providing a solution for large corpora.

The Raptor method integrates well with long context models, offering a balance between detailed and high-level information retrieval.

Transcripts

play00:01

hi this is Lance from Lang chain I'm

play00:04

going to be talking about retrieval and

play00:06

long context llms and a new method

play00:08

called

play00:09

Raptor so over the last few weeks

play00:12

there's been a lot of talk about is rag

play00:14

Dead with the Advent of new long Contex

play00:16

llms like Gemini a million tokens Claude

play00:20

3 now with up to a million tokens it's

play00:22

an interesting

play00:24

question um I've recently been using

play00:26

long Contex llms for certain projects

play00:29

like like for example this code

play00:31

assistant that I put out last week

play00:34

basically used a long context llm to

play00:36

answer coding questions about our docs

play00:39

on L expression language I'll kind of

play00:42

zoom in here so you can see it um so

play00:44

these are around 60,000 tokens of

play00:46

context we take the question we take the

play00:50

docs we produce an answer and this is

play00:52

really nice no retrieval required just

play00:56

context stuff all these docs and perform

play00:58

answer generation directly

play01:00

so I'm a big fan of using La context

play01:02

llms in this

play01:04

way but there are some considerations I

play01:06

wanted to like to to kind of point out

play01:09

here so I ran evaluations and for those

play01:12

evaluations I look at 20 questions um so

play01:16

basically it's 20

play01:17

Generations now look here so this is the

play01:20

Langs Smith dashboard that I used for

play01:23

those EV vals and you can see something

play01:25

kind of interesting the p50 latency

play01:28

tells you the 50th percentile latency

play01:31

for each of those Generations um so

play01:33

again remember there's

play01:35

20 so it's around 35 to you know 46

play01:39

seconds depending on the the trial this

play01:41

is on the same data set same 20

play01:44

Questions there's some variance run to

play01:46

run so that's kind of expected and again

play01:47

the P99 it's up to like okay 420 seconds

play01:51

that's really long in that

play01:52

case but maybe more interestingly if you

play01:55

look at the cost again there's 20

play01:57

questions so the cost is ranging from

play02:00

maybe like a dollar to you know a dollar

play02:02

a dollar like 30 per

play02:04

generation so you know C and Lanes your

play02:07

things to think about when you're

play02:08

talking about using really long Contex

play02:10

llms as opposed to like a rag system

play02:13

where you're per you're performing

play02:15

retrieval of much smaller more directed

play02:17

chunks to your

play02:19

question now the other thing that came

play02:22

up is a lot of people asked hey can you

play02:25

swap this out and use a local

play02:28

llm and my go-to local llm is mistol 7B

play02:32

V2 which actually has a 32,000 token

play02:36

context window but that's still a little

play02:38

bit big relative to my docs which are

play02:41

around 60,000 tokens so you know I

play02:44

couldn't just context stuff them as I

play02:47

did here so these three considerations

play02:51

kind of led me to think

play02:53

about I really like working with long

play02:55

context models and it's absolutely going

play02:57

to be the continuing thing but are there

play03:00

retrieval strategies that are like

play03:02

lightweight easy to use with long

play03:04

context models um that kind of like

play03:08

preserve the ability to utilize a lot of

play03:11

context uh but can address some of these

play03:14

limitations um in particular this last

play03:17

piece was important because this is

play03:19

something I want to do kind of in the

play03:20

near term and I need kind of like a nice

play03:22

lightweight retrieval strategy that

play03:25

still uses long context but can operate

play03:28

in cases where my documents are maybe

play03:31

just a little bit bigger than my context

play03:33

window in this case like around

play03:35

2x so I kind of put this out on Twitter

play03:37

and said hey has anyone come across like

play03:40

good like maybe minimalist splitting

play03:43

strategies for long contuct LMS you know

play03:45

like I wanted to graag with mrol 7B with

play03:48

a 32,000 token context window but my

play03:50

docs are 60,000 tokens I can't just

play03:53

context stuff them but I also don't want

play03:57

some like very fine scale chunking thck

play03:59

thing like I get it we don't want to

play04:01

mess with all that we want something

play04:02

simple that just can like kind of work

play04:04

across larger

play04:06

documents so one point that was raised

play04:08

which is a really good one is well just

play04:13

um just index at the document level so

play04:15

you can take full documents and just

play04:17

embed them directly it's a fair point

play04:20

and then you do something like KNN on

play04:22

those embedded documents so again no

play04:24

chunking of any documents no splitting

play04:26

of documents you have your set of

play04:28

documents embedded one and just retrieve

play04:31

at the document level that's a pretty

play04:33

good idea that's pretty

play04:35

reasonable another idea that came up

play04:38

though is this idea of building a

play04:40

document tree and part of the reason for

play04:43

that is when you talk about something

play04:45

like KNN or like you know K nearest

play04:48

neighbor retrieval on a set of embedded

play04:50

documents it is true that sometimes an

play04:52

answer requires maybe two or three

play04:54

different documents kind of integrated

play04:56

in order to answer it now if you context

play04:59

St everything that's not a problem

play05:01

because it's all there if you're doing

play05:03

retrieval well you're setting your K

play05:05

parameter to be some value it's kind of

play05:08

brittle do you need to be like four or

play05:10

five or six to capture all the context

play05:12

needed for certain particular questions

play05:14

so it's kind of hard to set that so this

play05:17

idea of building a documentary is kind

play05:19

of an interesting way to potentially

play05:21

address this challenge with like basic

play05:24

Cann so a paper Raptor came out recently

play05:29

on this exact

play05:30

idea um and their code recently open

play05:33

sourced which led the folks at llama

play05:35

index to come out with a llama pack for

play05:36

it which is great um and the idea is

play05:40

pretty interesting so I wanted to kind

play05:42

of lay it out here and talk about how it

play05:44

might benefit this exact case of kind of

play05:46

long context

play05:48

retrieval so the intuition is pretty

play05:51

simple First We Take a set of documents

play05:55

now note that these documents can be any

play05:58

sized so in their case they're just

play06:00

chunks so they're like 100 tokens but it

play06:04

doesn't matter so we start with a set of

play06:06

raw documents now what we do is we embed

play06:10

them and then we cluster them so this

play06:13

clustering process groups together like

play06:15

documents and then we do one important

play06:18

thing we summarize information in that

play06:20

cluster into what we call kind of like a

play06:23

more abstract or higher level summary of

play06:25

that

play06:26

content and we do that recursively until

play06:29

we end up with one cluster that's it so

play06:32

what's happening is you're starting with

play06:34

the set of what they call leaves or like

play06:36

raw documents you do a

play06:38

grouping uh via clustering you do a

play06:41

summarization steps you're kind of

play06:42

compressing and then you do it again and

play06:45

the idea is that these kind of midlevel

play06:49

or eventually like root level or highest

play06:51

level summaries can consolidate

play06:53

information from different places in

play06:54

your documents now what they do is they

play06:58

basically just embed those summaries

play07:00

along with the raw leavs and they

play07:01

perform retrieval and we'll talk about

play07:03

that a little bit later but what they

play07:05

show is actually just doing retrieval on

play07:08

all of these together like as a whole

play07:10

pool performs best um and that's kind of

play07:13

a nice result it's pretty easy then to

play07:16

basically just index that and and use

play07:19

it I will make a note that their paper

play07:22

talked about you know these leavs being

play07:24

chunks which I didn't love because look

play07:26

I want to work with long context models

play07:29

and like I don't want to deal with

play07:30

chunking at all and I've replied you

play07:33

know I replied to Jerry's tweet on this

play07:35

and Jerry made a fair point that you

play07:36

know this can scale to anything so for

play07:40

example those leavs can be full

play07:43

documents they don't have to be chunks

play07:45

that's completely reasonable Point um so

play07:48

again you can kind of think about this

play07:49

as if idea one was let's just take each

play07:52

document and embed it idea two is well

play07:56

let's embed each document like we did

play07:58

and we can also build kind of like a a

play08:00

document abstraction Tree on top and

play08:02

embed those so we have these like higher

play08:04

level summaries in our embeddings which

play08:07

we can retrieve from if we need an

play08:08

answer to conate information from like a

play08:10

small set of documents right so it's a

play08:13

little bit more robust maybe to this

play08:15

problem which is that if I'm just doing

play08:18

KNN on Raw documents and I need

play08:20

information from like two or three

play08:21

documents I'm not guaranteed to always

play08:24

get that because of this K parameter

play08:26

that I set I'm only retrieving k docks

play08:30

whereas here I'm building these docks

play08:33

that contain information from multiple

play08:36

leaves or multiple you know suboc so to

play08:39

speak um and it can actually just

play08:41

capture that information uh in in a in

play08:44

kind of a a nice way um such that it can

play08:49

it can basically integrate information

play08:51

across different individual leads or

play08:53

individual documents so that's the key

play08:56

Point um and so we can you can kind of

play08:59

see when you think about like working

play09:01

long context models of course context

play09:03

stuffing is a great option if you can do

play09:04

it but there are some other interesting

play09:07

ideas one is actually just embedding

play09:09

full documents and another is this idea

play09:10

of again documents and an abstraction

play09:13

tree so let's go ahead and just build

play09:16

Raptor because it's pretty interesting

play09:18

and to do this I'm actually going to

play09:20

going to look at clae 3 which just came

play09:22

out today it's a new set of model Str

play09:24

anthropic really strong performance and

play09:26

should be really good for this use case

play09:29

because what I want to do is I want to

play09:31

perform summaries of individual

play09:33

documents and I don't really want to

play09:34

worry about the size of those

play09:36

documents um so I'm going to use the

play09:39

same set of documents that I previously

play09:42

did with the code generation example

play09:45

that video came out last week and I have

play09:46

an empty notebook here um it we just do

play09:50

a few pip installs I'm setting a few

play09:52

environment variables for lsmith and now

play09:54

I'm just going to say grab my docs so

play09:57

that's right here

play09:59

and this is going to grab around 33 web

play10:02

pages of documentation from for Lang

play10:04

chain related to Lang chain expression

play10:06

language okay and what I'm going to plot

play10:09

here is a histogram of the token counts

play10:11

of every page so a bunch are kind of

play10:13

small that's find easy to work with so

play10:15

less than 2,000 tokens a few are pretty

play10:17

big so up to like 12,000

play10:20

tokens so that kind of gives you a sense

play10:22

of the distribution of pages that we

play10:23

want to work with and we're going to

play10:25

apply this approach to those pages um

play10:28

now I'm going to use anthropics new

play10:30

model to do that um and I'll use open I

play10:32

embeddings so that's fine I set those

play10:35

and now what I'm going to do so this

play10:38

code was released uh by the authors of

play10:40

the paper and I'm going to explain how

play10:42

this works in a little bit but for right

play10:45

now I'm just going to copy this over and

play10:46

this is all going to be accessible to

play10:47

you in the notebook that we're going to

play10:48

make public uh so this is all the

play10:51

clustering code and we're going to talk

play10:52

about what it's doing later I added

play10:54

comments and Doc strings to this um so

play10:57

it's it's a little bit more

play10:58

understandable

play10:59

here's some code that I wrote um that

play11:01

basically is going to do like

play11:03

orchestrate the process of the cluster

play11:06

summarize um and then like iteratively

play11:08

perform that until you end up with a

play11:10

single

play11:11

cluster um so there we go I'm going to

play11:14

copy this code over and I'm going to

play11:16

kick this process off and then I'm going

play11:18

to walk through it while this is running

play11:20

so that's running now now first I want

play11:23

to kind of explain how this clustering

play11:24

process works it's it's kind of

play11:26

interesting um so the idea actually

play11:31

incorporates three important actually

play11:33

Four important

play11:35

points so it's using this GMM this

play11:37

gussian mixture model to model the

play11:40

distribution of the different clusters

play11:43

so what's kind of cool about this

play11:45

approach is that you don't actually tell

play11:47

it group the data into some number of

play11:50

clusters like you do some of approaches

play11:52

you kind of set the parameter you want

play11:54

like n clusters here um it actually can

play11:57

kind of infer or deter determine the

play11:59

optimal number of clusters and it uses

play12:01

this like Bic again you can dig into

play12:03

this in more detail if you want but the

play12:05

intuition is that uh this approach will

play12:08

kind of guess or attempt to to determine

play12:10

the number of clusters automatically for

play12:13

you um and it's also modeling the

play12:16

distribution of your individual

play12:18

documents across the

play12:20

Clusters um it uses this umap or

play12:24

basically it's a dimensionality

play12:25

reduction approach to improve the

play12:27

clustering process so if you want to

play12:29

like really read into this that you

play12:31

should actually just go and do that um

play12:33

the intuition is that this actually

play12:35

helps improve

play12:36

clustering um it also does clustering

play12:39

what they call like local and Global so

play12:41

it tries to analyze the data at two

play12:43

different scales um like kind of look at

play12:46

like patterns kind of within smaller

play12:48

groups and then like within the full

play12:50

data set to try to improve how you're

play12:52

going to group these documents uh

play12:54

together and it applies thresholding to

play12:59

assign the basically the group

play13:01

assignment for every document or the

play13:03

cluster assignment for every document so

play13:05

this is really the

play13:07

idea here's all my documents let's look

play13:10

at this one what's happening is it's

play13:12

using this GMM to basically assign of

play13:15

probability that this document belongs

play13:17

to each one of our clusters so like

play13:20

here's cluster one here's cluster two

play13:22

here's cluster three each cluster will

play13:24

get a

play13:25

probability and this thresholding then

play13:27

is applied to those

play13:29

probabilities such that a document can

play13:32

actually belong to more than one cluster

play13:35

so that's actually really nice cuz in a

play13:37

lot of other approaches it's kind of

play13:39

mutually exclusive so document can only

play13:41

live in one or another cluster but with

play13:43

this approach it can actually be long to

play13:44

multiple clusters so that's like a nice

play13:46

benefit of this

play13:48

approach um I think that's kind of all I

play13:51

want to say initially about this

play13:53

clustering strategy uh but you should

play13:56

absolutely have a look at the paper

play13:58

which I uh will also ensure that we

play14:01

link um so right now let's actually go

play14:05

look at the code so we can see that it's

play14:07

performing this this cluster

play14:09

generation and let's actually look at

play14:11

kind of what it's doing so it really

play14:14

follows what we just talked

play14:16

about we have a set of texts and in this

play14:20

case my texts are just that those you

play14:22

know those 33 web pages uh that I'm

play14:26

passing in so we can actually look at

play14:27

that so what I passed in these Leaf text

play14:31

Leaf text I Define as my docs again

play14:33

let's actually go back and look at our

play14:35

diagram so we can like follow

play14:37

along uh here we go so these leaves are

play14:41

my web pages that's it so here's my leaf

play14:44

text and you can see let's look at the

play14:46

length there uh okay there's 31 of them

play14:50

so that's

play14:51

fine um and what's happening is those

play14:57

first

play14:59

get embedded as

play15:01

expected and then so here's the

play15:04

embeddings and then they get clustered

play15:06

and this perform clustering is taken

play15:08

directly from basically the results uh

play15:10

or the code provided by the authors of

play15:12

the paper so it's doing that process I

play15:15

just talked about um of clustering

play15:19

basically cluster assignment um and we

play15:22

get our cluster labels out we put those

play15:24

in a data frame um and so so then we

play15:28

have our clusters you can see that here

play15:31

and because each docking can belong to

play15:34

more than one cluster we actually expand

play15:36

out the data frame um so that the

play15:40

cluster column um Can may contain

play15:45

duplicates uh for a single document so

play15:48

one document can live in multiple

play15:50

clusters and we just flatten it out to

play15:51

show that then all we do is we get the

play15:55

whole list of clusters here um the

play15:58

Define a summarization prompt pretty

play16:00

simple and all we do is we have our data

play16:02

frame Just Fish Out give me all the

play16:05

texts within each cluster and that's all

play16:07

we're doing here for each cluster get

play16:10

all the text Plum it into our

play16:12

summarization prompt generate the

play16:14

summary here's our summary data frame

play16:17

that's really it so again iterate

play16:19

through our clusters get the text for

play16:22

every cluster summarize it write that

play16:25

out to a data frame and that's all we do

play16:28

here's our cluster data frame here's our

play16:30

summary data frame from that function um

play16:33

and this is just orchestrating that

play16:35

process of like iteration so we just

play16:37

keep doing this until I provide like a

play16:41

level or n levels parameter you can say

play16:42

do this end times or um you know uh or

play16:48

um the number of clusters is is equal to

play16:51

one so so basically this is saying

play16:54

continue until either we've done n

play16:56

levels or like n number of of of

play16:59

attempts um in our tree or the number of

play17:02

clusters is one keep doing that and

play17:04

that's it so you can see we've actually

play17:05

run that process we have our results

play17:08

Here and Now what we can do is pretty

play17:12

simply um we can just put those

play17:15

into uh an index like we can use chroma

play17:18

as a vector store um so here's just some

play17:21

really simple code to do that or just

play17:22

iterating through our results we're

play17:24

getting all our summaries out so first

play17:27

maybe I should make this a little B more

play17:28

clear we take all those raw

play17:31

documents and we add we create like we

play17:34

add those to our text lists we then get

play17:36

all of our summaries from our tree we

play17:38

add those and we just index all of them

play17:40

so let's do that so these These are

play17:43

going to all be added them to chroma and

play17:46

very finally we can set up a retrieval

play17:48

chain that is

play17:50

using this index which contains both our

play17:53

leaves so all those raw web pages and

play17:56

these higher level summary pages that's

play17:58

all that's happening here we pull in a

play18:00

rag prompt um here's our retriever

play18:04

here's our question so let's give this a

play18:05

shot so this is running and I want to

play18:08

just bring you back to the diagram so

play18:10

again you can kind of see what's going

play18:11

on

play18:12

here

play18:14

um right here so again we took our web

play18:19

pages uh again 31 of them we cluster

play18:23

them we summarize them we do that

play18:25

iteratively um then what we do is we

play18:28

take those summaries that we generated

play18:31

along with the raw web pages and we

play18:33

index all of them that's it and we can

play18:36

use that index for retrieval so this is

play18:38

like a nice what we might call a long

play18:40

context index because it contains just

play18:43

raw web pages which vary from 2,000 to

play18:45

12,000 tokens and it contains in our

play18:48

case these higher level summaries in

play18:51

case we need an integration of

play18:52

information across those pages um which

play18:55

may or may not be captured just using

play18:58

K&N retrieval so that's the big idea

play19:01

okay this ran we got our answer we can

play19:03

check Langs Smith and we can see here's

play19:07

our

play19:07

retriever um and let's see here's the

play19:11

raw

play19:12

documents so it looks like it retrieved

play19:15

some kind of higher level summaries as

play19:18

well as some raw leavs so this is like a

play19:21

raw web page and then some of these are

play19:24

more like summary Pages which looks like

play19:26

we produced so what's kind of cool about

play19:29

this is you can retrieve from a

play19:31

combination of like your raw Pages as

play19:35

well as these higher level summaries

play19:37

which gives you some robustness and

play19:38

cement the coverage for different types

play19:40

of questions that require like different

play19:42

resolutions of of abstraction or

play19:44

detailed answer like a really detailed

play19:47

code question you might retrieve

play19:50

directly from your raw pages but like a

play19:52

higher level question that integrates

play19:54

information from a bunch of pages you

play19:56

might retrieve from these midlevel or

play19:58

even top level summaries so it's a cool

play20:00

approach it integrates kind of nicely

play20:03

with long context models and I know one

play20:06

thing that will come up here is well

play20:08

look your full context was only 60,000

play20:11

tokens you could just stuff all of that

play20:13

into one of these models you didn't need

play20:15

to do any of this that is absolutely

play20:18

true for this case but what I think the

play20:20

high level point is that's not true for

play20:22

every case for example this exact set of

play20:26

documents I want to use with mraw mraw

play20:30

is only 33 32,000 tokens so this is a

play20:34

really nice approach for that case where

play20:36

I can kind of guarantee that I can index

play20:39

across all these pages but I won't

play20:41

exceed the context limit or and likely

play20:43

to exceed the context limit of my llm

play20:46

because none of these individual Pages

play20:48

exceed 32,000 tokens so you know you can

play20:51

see and again this scale is arbitrarily

play20:54

large so it is true that this set of

play20:55

documents is only 62,000 tokens

play20:58

but of course there's much larger

play20:59

corpuses which could extend beyond even

play21:01

the 200,000 of CLA 3 in which case this

play21:05

type of approach of kind of indexing

play21:07

across documents um and building these

play21:10

like kind of mid-level high level

play21:11

summaries can be applicable so it's a

play21:14

cool method it's a neat paper um I

play21:16

definitely encourage you to experiment

play21:18

with it um and all this code will be

play21:21

available um for you to to work with and

play21:25

um I think that's about it thanks very

play21:28

much

Rate This

5.0 / 5 (0 votes)

Related Tags
RaptorLLM検索戦略コンテキスト情報整合RagGeminiClaude技術解説AI検索効率
Do you need a summary in English?