Going beyond RAG: Extended Mind Transformers - Phoebe Klett

AI Engineer
11 Sept 202416:04

Summary

TLDRPhoebeは、ノーマルコンピューティングの機械学習エンジニアであり、最近の研究「Extended Mind Transformers」について語ります。このトークでは、問題の紹介から方法論、実験結果、そして実装時のパラメータ調整方法に至るまでの詳細を説明します。Extended Mind Transformersは、記憶と推論クエリを明確に区別し、より詳細な因果関係の引用や、モデルが不確かであると判断したときのアクティブラーニングにインスパイアされた新しい生成パラダイムを可能にします。このモデルはファインチューニングを必要とせず、オープンソースのモデルとコードを使用して簡単に実行できます。

Takeaways

  • 🤖 拡張された心のTransformer(EMT)は、モデルに外部のメモリを統合し、より良いリトライバルタスクのパフォーマンスを実現します。
  • 🔍 EMTは、従来のTransformerモデルに単純な編集を加えることで機能し、内部のリトライバルメカニズムを活用します。
  • 📈 実験結果によると、EMTは長い文脈でのファインチューニングされたモデルと比較して、競争力のあるパフォーマンスを示しました。
  • 📝 EMTは、モデルが生成時に使用した情報に基づいて、新しいタイプの引用を可能にします。
  • 🧠 モデルの不確実性を検出すると、EMTはアクティブラーニングにインスパイアされた技術を用いて、より多くのメモリからの情報を使用して再生成できます。
  • 🛠️ ストライド長とtop Kは、EMTを実装する際に調整する重要なパラメータです。これらはメモリの生成とクエリトークンのリトライバルに影響を与えます。
  • 🔗 正確な位置情報を割り当てるためには、相対的な位置埋め込みが重要であり、EMTはそのモデルを通じて一般化することができます。
  • 📊 正規化技術として、類似マスクと未知トークンの排除が有効であり、モデルの混乱を避けることができます。
  • 💻 Hugging Faceにモデルが用意されており、GitHubにコードが公開されているため、誰でも簡単にEMTを利用することができます。
  • 📑 発表者は、技術的な詳細について興味がある人々が論文を読むことをお勧めしています。

Q & A

  • Phoebeはどのような職業をしていますか?

    -Phoebeは機械学習エンジニアで、ノーマルコンピューティングで働いています。

  • Extended Mind Transformersとは何ですか?

    -Extended Mind Transformersは、Transformerモデルに埋め込まれたリトリーバルメカニズムを通じて、より多くのコンテキスト情報を扱えるようにする研究です。

  • Extended Mind Transformersが解決しようとしている問題とは何ですか?

    -Extended Mind Transformersは、言語モデルが特定のアプリケーションやトピックに関連する詳細な情報を扱えるようにする問題を解決しようとしています。

  • Extended Mind Transformersが実装するリトリーバルメカニズムとは何ですか?

    -Extended Mind Transformersは、デコーダーレイヤー内のデータをキーバリューペアとして表現し、クエリトークンがそのキーバリューペアに基づいてメモリートークンを取得し、それらに注意を向けることができます。

  • Extended Mind Transformersが提供する新しい引用の種類とは何ですか?

    -Extended Mind Transformersは、モデルが生成時に使用した特定のトークンを特定でき、その結果を生成する情報源を示す新しいタイプの引用を提供します。

  • Extended Mind Transformersが活性学習にインスパイアされた新しい生成パラダイムとは何ですか?

    -Extended Mind Transformersは、モデルが生成されたトークンについて不確実な場合に、より多くのメモリーからの情報を使用して再生成することで、活性学習にインスパイアされた新しい生成パラダイムを提供します。

  • Extended Mind Transformersを実装する際に調整する重要なパラメーターとは何ですか?

    -重要なパラメーターにはストライド長とトークンの数(Top K)があります。ストライド長はメモリーを生成する際のコンテキストの量を決定し、Top Kは各クエリトークンが取得し、注意を向けることができるメモリーの数を決定します。

  • Extended Mind Transformersが実装する正則化技術とは何ですか?

    -Extended Mind Transformersでは、類似度マスクと未知トークンの排除という2つの正則化技術を実装しています。これにより、モデルが混乱を招く情報の量を減らし、精度を高めることができます。

  • Extended Mind Transformersがオープンソースモデルとして提供されているのはなぜですか?

    -Extended Mind Transformersがオープンソースモデルとして提供されているのは、研究者や開発者が簡単にアクセスし、実験し、モデルを改善できるようにするためです。

  • Extended Mind TransformersがFine-tuningを必要としない理由とは何ですか?

    -Extended Mind Transformersは、Fine-tuningを必要としない理由は、モデルが内部的にリトリーブされたトークンを扱えるように設計されているためです。これにより、長いコンテキストに対するFine-tuningによる注意の質の低下を避けることができます。

Outlines

00:00

🤖 拡張心Transformerの紹介

Phoebeがノーマルコンピューティングで行っている機械学習に関する研究について語ります。特に、拡張心Transformer(EMT)について説明し、この技術がどのようにして問題解決に役立つかを概説します。EMTは、モデルに一般知識を学習させた後、特定のアプリケーションに必要な情報を追加する方法を探求しています。これまでの研究では、ロングコンテキストやRAGなどの方法が提案されてきましたが、それらにはそれぞれ問題点があります。EMTは、これらの問題を解決する新しいアプローチを提供します。

05:01

🔍 拡張心Transformerの評価

拡張心Transformerの性能を評価するために、新しいカウンターファクトリievalベンチマークを公開しました。このベンチマークでは、モデルが事前学習で記憶した事実だけでなく、ファインチューニングでも学習した事実を制御し、モデルが推論時に提供される情報に依存しているかどうかをテストします。EMTは、ファインチューニングされたモデルやベースのLLaMAモデルと比較して、非常に長いコンテキストでも競争力があることがわかります。また、EMTは、モデルが生成時に使用したメモリから正確な情報を特定できるため、新しいタイプの引用を可能にしています。

10:01

🧠 拡張心Transformerによる詐欺の減少

拡張心Transformerは、モデルが生成するトークンの不確実性を検知することで、詐欺を減らす技術を提供しています。モデルが特定のトークンについて不確実であると検知された場合、より多くのメモリからの情報を使用してそのステップを再生成することができます。これにより、モデルが正しい情報を提供する確率が高まり、効率的で信頼性の高い出力が得られます。

15:03

🛠 拡張心Transformerのパラメータ設定

拡張心Transformerを使用する際の主要なパラメータについて説明します。ストライド長は、メモリの生成時にトークンが適切な文脈を持って生成されるようにするパラメータです。また、トプKは、各クエリトークンが取得し、注意するメモリの数です。メモリが長くなればなるほど、取得する情報は多くなりますが、モデルが混乱するリスクもあります。正則化技術として、類似度マスクと未知トークンの排除が提案されています。これらの技術は、モデルが適切な情報を取得し、混乱を避けながらも最適なパフォーマンスを発揮するのに役立ちます。

📈 拡張心Transformerのまとめ

拡張心Transformerは、検索タスクにおいて優れたパフォーマンスを発揮し、新しいタイプの引用と詐欺削減技術を可能にしています。ファインチューニングを必要としないことが特徴で、オープンソースのモデルとコードを使用して簡単に実装できます。この技術は、AIエンジニアにとって信頼性と効率性を高める重要なツールとなります。

Mindmap

Keywords

💡機械学習エンジニア

機械学習エンジニアとは、人工知能や機械学習技術を用いてソフトウェアやシステムを開発する専門家です。ビデオでは、Phoebeという機械学習エンジニアが、Extended Mind Transformersという研究について話しています。このキーワードは、ビデオのテーマと内容を理解するための基礎知識として重要です。

💡Extended Mind Transformers

Extended Mind Transformersは、ビデオの中心的な研究テーマです。これは、Transformerモデルに外部メモリを追加することで、モデルがより多くの情報を扱えるようにする技術です。ビデオでは、この技術がどのようにして問題を解決し、新しいタイプの引用や生成パラダイムを可能にするかについて説明されています。

💡リカバリメカニズム

リカバリメカニズムとは、モデルが外部の情報源から必要な情報を取得するプロセスです。Extended Mind Transformersは、このメカニズムを通じて、モデルの内部に組み込まれたキーバリュペアを用いてデータの取得と使用を行っています。ビデオでは、このメカニズムがモデルのパフォーマンスに与える影響について議論されています。

💡トークン

トークンは、自然言語処理において文を構成する最小単位です。ビデオでは、モデルが内部でデータをキーバリュペアとして表現し、トークンを通じてメモリから情報をリカバリする様子が説明されています。トークンは、Extended Mind Transformersの機能とパフォーマンスを理解する上で核心的な概念です。

💡エントロピー

エントロピーは、モデルの出力分布の不確実性を測る指標です。ビデオでは、モデルが生成するトークンのエントロピーを用いて、モデルが特定のトークンについて不確かであると判断した場合、より多くの情報を使用して再生成を行う方法が説明されています。これは、モデルのハローケーションを減らす技術の一環です。

💡正規化技術

正規化技術とは、モデルの学習過程で使用される技術で、モデルが過剰適合を避けながら学習を進められるように支援します。ビデオでは、類似度マスクや未知トークンの排除など、Extended Mind Transformersで使用される正規化技術が紹介されています。これらの技術は、モデルのパフォーマンスと精度を高めるために重要です。

💡ストライド長

ストライド長は、メモリの生成時に計算されるパラメーターで、モデルがトークンに適切な文脈量を生成するために使用されます。ビデオでは、ストライド長がメモリの質や計算コストに与える影響について説明されており、Extended Mind Transformersの実装時に重要なパラメーターとされています。

💡top K

top Kは、各クエリトークンがメモリから取得し、attendedすることができるキーバリュペアの数です。ビデオでは、メモリが長くなればなるほど多くの情報を取り入れることができ、モデルのパフォーマンスに影響を与えると説明されています。これは、Extended Mind Transformersの重要なパラメーターの一つです。

💡ハローケーション

ハローケーションとは、モデルが記憶していない情報や事実を生成することです。ビデオでは、Extended Mind Transformersが不確実性を検知してより多くのメモリ情報を使用することで、ハローケーションを減らす方法が議論されています。これは、AIの信頼性向上に向けた重要な技術です。

💡アクティブラーニング

アクティブラーニングは、モデルが学習プロセスで不確実な点に焦点を当てて学習を進める技術です。ビデオでは、モデルが生成時に不確実であると判断された場合、より多くのメモリ情報を使用して再生成を行う方法がアクティブラーニングにインスパイアされた技術と説明されています。

Highlights

介绍扩展心智Transformers及其在机器学习领域的最新研究

解释了扩展心智Transformers实现的检索机制

讨论了预训练语言模型的局限性和对特定应用信息的需求

对比了长上下文和RAG两种方法的不同及其各自的缺点

提出了扩展心智注意力机制,对Transformer模型的简单修改

解释了如何在不进行额外微调的情况下,利用相对位置编码来处理检索到的token

介绍了两种不同的相对位置编码方法:旋转位置编码和线性偏置

展示了扩展心智Transformers在长上下文基准测试中的表现

讨论了如何通过内部检索机制减少模型的幻觉(hallucinations)

介绍了扩展心智Transformers如何实现更细粒度的因果引用

讨论了在实现扩展心智Transformers时需要调整的重要参数

解释了步幅长度(stride length)对生成记忆表示的影响

讨论了top K参数对检索和注意力机制的影响

介绍了两种正则化技术:相似性掩蔽和消除未知token

提供了扩展心智Transformers模型在Hugging Face上的资源链接

总结了扩展心智Transformers的优势,包括无需微调、易于使用和提高检索任务性能

鼓励观众阅读论文以了解更多技术细节

Transcripts

play00:00

[Music]

play00:13

I'm Phoebe I'm a machine learning

play00:15

engineer at normal Computing and I'm

play00:16

really excited to tell you guys about

play00:18

some of our recent research uh and in

play00:19

particular extended mind

play00:22

Transformers all right so just to

play00:24

briefly cover what we're going to go

play00:25

over in today's talk uh we'll introduce

play00:27

the problem which I think will be quite

play00:28

familiar given the amazing talk which

play00:30

came before mine uh and then dive right

play00:32

into the method so uh what is the

play00:34

retrieval mechanism that extended mind

play00:36

Transformers Implement uh and then we'll

play00:38

dive into some experiments which give us

play00:39

confidence that these methods are

play00:41

actually performant after that we'll get

play00:43

into two of my favorite and I think most

play00:44

compelling features that extended my

play00:46

Transformers enable this is a new kind

play00:48

of citation uh as well as a new kind of

play00:50

generation Paradigm which is active

play00:52

learning inspired uh and then we'll go

play00:55

over the most important parameters to

play00:56

tune when implementing uh EMTs in your

play00:59

applications and generally how to use

play01:03

them all right so we pre-train language

play01:06

models uh so that they have general

play01:08

knowledge but as we've been discussing

play01:10

all this conference that's not enough we

play01:13

need a lot of application specific

play01:14

information and a topical uh description

play01:17

of the world in order to make these

play01:18

things useful um I'm not going to

play01:21

belabor the two most popular methods um

play01:24

which try to load this description into

play01:26

the language model those being long

play01:28

context and rag as a I think yeah we've

play01:30

heard a lot about those um great methods

play01:33

already but I'd like to point out that

play01:34

they solve the problem in different ways

play01:36

and th suffer from different downsides

play01:39

so long context seeks to extend the

play01:42

context window of the Transformer model

play01:44

so we train language models we train

play01:46

them on sequences of a fixed length and

play01:48

then we're trying to say well can we can

play01:51

we extend that so we can include more in

play01:53

the context more in the prompt during

play01:55

inference time uh fine tuning is usually

play01:57

how this is done and that's awfully

play01:59

expensive

play02:00

uh and more so than that including all

play02:02

of that context in your prompt can

play02:05

confuse the model with a lot of

play02:06

irrelevant information um and kind of

play02:09

beyond that just conceptually speaking

play02:11

it seems a little like wasteful right

play02:12

like if we're trying to do question

play02:13

answering over a big code base uh our

play02:16

query is most usually does not need to

play02:19

reference like all of those different

play02:20

function definitions but just need some

play02:22

subset of them to answer the query

play02:24

correctly um okay so this is what rag

play02:26

tries to do right let's try to subset

play02:28

that information down and just include

play02:30

the most relevant context in our prompt

play02:34

um so what are the issues here well

play02:37

these these mechanisms which are

play02:39

external to the Transformer are kind of

play02:40

like necessarily limited by being

play02:43

external to the model so we make this

play02:45

choice of what's relevant once and

play02:47

upfront before the generation starts and

play02:49

we're also making this choice about

play02:51

what's relevant using kind of the least

play02:53

granular representation of that data and

play02:56

often ones that are disjoint from the

play02:58

way that the model will reason about

play02:59

that data um kind of also just

play03:03

conceptually neither of these methods

play03:05

make a difference uh or make a

play03:07

distinction between things that should

play03:09

go in memory and things that should be

play03:11

included along with your inference query

play03:13

and this is more than just Aesthetics

play03:14

it's actually going to enable us to

play03:18

oh it's going to enable us to have these

play03:21

like more granular causal citations uh

play03:24

and allow the model to retrieve more

play03:25

information when we can tell it's

play03:27

uncertain kind of actively within the

play03:29

generation

play03:32

all right so how do we do this extended

play03:34

mind attention is a very simple edit to

play03:36

the attention mechanism of the

play03:38

Transformer I'm not going to get too

play03:39

much into the math because we don't have

play03:41

a ton of time today but would love for

play03:42

anyone to check out the paper and let me

play03:44

know what you think um so but I'll just

play03:47

go over kind of yeah from a qualitative

play03:49

perspective how this works so the model

play03:52

represents data within each decoder

play03:54

layer most of the Transformers that

play03:56

we're using today are decoder only

play03:58

Transformers and within each of those

play03:59

decoder layers the model will represent

play04:02

that data as a key value pair so it

play04:04

actually already has this retrieval

play04:06

mechanism built into the Transformer all

play04:08

we have to do is kind of hack around it

play04:11

and so we pass all of the memory tokens

play04:14

through the model and save off those key

play04:16

value representations and then during

play04:18

generation time we allow each query

play04:21

token just like rag using cosine

play04:23

similarity to go retrieve a particular

play04:26

number of those memory tokens and attend

play04:28

to them so this in this picture these

play04:31

kind of red tokens red highlighted

play04:33

tokens are meant to uh represent those

play04:35

retrieved

play04:37

tokens uh again this actually ends up

play04:39

being a very simple change to the

play04:41

Transformer model what's difficult uh is

play04:44

figuring out how to assign position

play04:46

information to those tokens so this uh

play04:49

work is based on Research from a couple

play04:51

years ago but they needed to fine-tune

play04:53

their model in order to kind of teach

play04:55

the model how to leverage these

play04:56

retrieved tokens and that's in large

play04:59

part due to absolute position embeddings

play05:01

that were popular during that time so

play05:03

because Transformer models are position

play05:05

agnostic we have to figure out how to

play05:07

kind of tell them like okay this token

play05:09

is position zero this one is to position

play05:11

one etc etc um but due to today's more

play05:16

kind of like their softer position

play05:18

embeddings this allows us to really

play05:20

leverage this method without any further

play05:22

fine-tuning so in particular these

play05:25

relative position medings that have

play05:26

become popular and I'll talk about two

play05:28

different methods that we've tested and

play05:30

implemented this on um really enable the

play05:33

model to kind of generalize um to these

play05:36

retrieved tokens the first one uh that

play05:39

we tested on is present in all of the

play05:40

Llama models these are the rotary

play05:42

position embeddings and this generalizes

play05:45

the principle of using kind of like an

play05:47

angle between two vectors as a distance

play05:49

metric so we kind of take the whole

play05:51

embedding and we rotate kind of two

play05:52

positions at a time the other one that

play05:55

we implemented um this method into is

play05:58

The Alibi uh linear biases so these

play06:01

actually aren't positioning embeddings

play06:03

at all it's just kind of linearly damps

play06:06

down uh information which is further

play06:09

away and these are uh the way that all

play06:11

of the mosaics MPT models are

play06:16

trained okay so let's talk about some

play06:18

evaluations um we also just open sourced

play06:20

a new counterfactual retrieval Benchmark

play06:23

and I'm just going to briefly describe

play06:24

what that Benchmark looks like so this

play06:26

is a long context Benchmark so our input

play06:29

context is are query answer pairs uh and

play06:32

the context to answer those questions

play06:34

range from about 2,000 tokens to all the

play06:36

way up to 16,000 tokens and the again

play06:39

these are like query so like the

play06:41

question might be who wrote the song

play06:42

these shoes were made for walking and

play06:44

then the corresponding Wikipedia snippet

play06:47

um we wanted to control for facts

play06:49

memorized during pre-training though and

play06:51

actually any fine tuning also so what we

play06:54

did was we looked up for instance in

play06:56

this case the answer is Lee Hazelwood we

play06:58

did a little bit of research we figured

play06:59

out okay well Terry Allen is a similar

play07:02

songwriter this is a plausible answer

play07:04

but it's wrong we went in and we

play07:06

replaced all the instances of Lee

play07:08

Hazelwood with Terry Allen and now we

play07:10

asked the model to retrieve this new you

play07:12

know not factually correct but in the

play07:15

sense it we're trying to test whether

play07:17

it's prioritizing what's being provided

play07:19

at inference time um so now we're asking

play07:22

it to retrieve this Terry Allen

play07:26

answer all right so how to extend M

play07:29

Transformers stack up here we're

play07:31

comparing it with fine-tuned models as

play07:34

well as the base llama model with uh

play07:36

interpolated position embeddings so we

play07:39

can see here in the green that the base

play07:41

model does a pretty good job

play07:42

extrapolating even like many times more

play07:46

this is a model trained up to like 20 48

play07:49

tokens um during pre-training and you

play07:51

can see even up to 8K it's like doing

play07:53

okay 16k it really falls off the

play07:56

position embeddings can't extrapolate

play07:58

that far the tune models you can see

play08:00

actually perform worse than the extended

play08:02

mind model on these shorter inputs and

play08:05

this is another data point that suggests

play08:06

that fine tuning on super long contexts

play08:09

actually degrades the quality of

play08:10

attention that you get on shorter inputs

play08:13

and extended mind Transformers continue

play08:14

to be competitive with those fine-tuned

play08:16

models all the way up to 16k again our

play08:19

models are not fine-tuned at

play08:21

all and in this particular experiment so

play08:24

what the extended mind model sees in

play08:27

context is the query only so it only

play08:29

sees the like who wrote the song these

play08:31

uses made for walking and relies heavily

play08:33

on that internal retrieval mechanism to

play08:36

go look up that new information in this

play08:39

second experiment we seed it with a

play08:41

little bit more information in context

play08:43

uh using rag but again mostly relying on

play08:47

that uh internal mechanism still uh and

play08:49

you can see we're outperforming dpt4

play08:51

here now when we combine it with that

play08:52

more information and context as

play08:57

well okay now we're going to talk about

play08:59

citations so I think uh this would be a

play09:01

topic that lots of you here can

play09:03

empathize with uh as AI Engineers I

play09:06

think this is one of the most important

play09:08

things to provide in an application such

play09:10

that people can learn to trust the model

play09:12

outputs in fact you might actually use

play09:14

rag just to get citations um so with rag

play09:18

though the citations that you get are a

play09:19

little bit kind of like post talk

play09:22

rationalization so maybe if like the

play09:24

date appears in the output and we knew

play09:25

it was also in the input to the language

play09:27

model we feel pretty confid that that

play09:29

date is not hallucinated um but again

play09:32

this is not really like a causally

play09:33

related to what information the model

play09:35

used during the generation now with

play09:38

extended mind Transformers We can look

play09:39

up exactly which tokens were retrieved

play09:42

from the from those memories and used

play09:45

during generation so in this example on

play09:47

the top left here we have the memories

play09:49

this is a sampit from Wikipedia about

play09:51

one of my favorite mathematicians

play09:53

Alexander grend and the query is when

play09:55

did he get his friend citizenship and

play09:58

then in the bottom you can see the

play09:59

completion with a correct date I think

play10:01

he got it in

play10:02

1971 so the blue highlighted tokens here

play10:06

uh importantly the 1971 as well as some

play10:09

of the Alexander growth and de tokens um

play10:11

those are the ones that the model

play10:13

retrieved and attended to when

play10:14

generating that 1971 correct token and

play10:17

so being able to report that uh gives a

play10:19

lot of confidence and also just insight

play10:21

into how the model is using those

play10:23

retrieved

play10:26

tokens okay we can also use extended

play10:29

mind Transformers to reduce

play10:31

hallucinations so how do we do this so

play10:33

right now we have access to in the like

play10:36

simplest case just kind of token level

play10:38

entropy over that output distribution

play10:41

and if you wanted to get fancier we're

play10:42

also doing some basy and fine-tuning of

play10:44

language models at normal but you can

play10:46

use any uncertainty metric to determine

play10:49

kind of how certain the model is about a

play10:51

generated token and if we kind of can

play10:54

detect that the model is uncertain about

play10:55

that token we can regenerate that step

play10:58

using more information from these

play11:01

memories uh okay so in the top right

play11:03

here we can see this is we just set like

play11:04

a baseline default number of memories

play11:07

that each query token is allowed to

play11:09

retrieve and attend to and you can see

play11:11

it wasn't quite enough information uh to

play11:13

get this query right so if you remember

play11:15

from the previous Slide the correct

play11:16

answer here is

play11:17

1971 and you can see we've got 1993 here

play11:20

so wasn't enough we didn't attend to

play11:23

that memory quite enough to get this

play11:24

question right and in the bottom example

play11:28

we allow it to read generate some subset

play11:30

of those tokens using more information

play11:32

from the cache when we can tell the

play11:34

model was

play11:36

uncertain and again got this right so

play11:39

it's kind of like kind of a nice

play11:41

intuition for uh when the model's

play11:43

uncertain and then okay if it's really

play11:45

uncertain let's go use more information

play11:47

and also can be more efficient kind of

play11:49

depending on how the math works

play11:52

out all right so now I'm going to tell

play11:54

you guys about the most important uh

play11:56

parameters to set when using extended

play11:58

mind Transformers so you may have heard

play12:00

of something called stride length before

play12:02

uh and this is um a parameter that comes

play12:05

up a lot even just kind of in regular

play12:06

perplexity computations so when we

play12:09

compute the memories that we're going to

play12:11

attend to we pass them through the model

play12:13

and then again save off these key value

play12:15

representations that the model saves

play12:17

internally um but again the models that

play12:20

we're using are trained on this fixed

play12:22

context length so we need to kind of

play12:25

pass over them with some stride such

play12:27

that each of those tokens has an

play12:29

appropriate amount of context um to

play12:32

generate the representation so if the

play12:34

stride is smaller uh you're going to get

play12:37

more uh high quality representations but

play12:40

also will require more computations um

play12:42

so you can kind of tune this and there

play12:44

are some graphs in the paper as well

play12:46

that kind of represent this tradeoff um

play12:48

but this is an important parameter to

play12:50

set when yeah generating the memories

play12:52

themselves top K is uh probably the most

play12:55

important parameter to think about so

play12:56

this is the number of key Valu pairs or

play12:59

memories that each query token is

play13:01

allowed to retrieve and attend to um

play13:04

when your memory is quite long kind of

play13:06

the more the better um but again uh yeah

play13:10

this it's kind of should be dynamically

play13:11

set based on how long your memory

play13:14

is um okay yeah so lastly uh we want to

play13:19

retrieve as much information as we can

play13:20

from the memory without confusing the

play13:22

model it's making analogy back to kind

play13:25

of putting everything into context we

play13:27

don't want to just throw everything in

play13:28

there because that will be confusing to

play13:30

the model um so we have two different

play13:32

regularization techniques that we

play13:33

Implement that we have found to be

play13:35

especially effective um the first one is

play13:38

called similarity masking so again we we

play13:42

retrieve these tokens uh based on

play13:44

similarity with our query token and the

play13:46

key that we are retrieving from and so

play13:49

we might say like well if we don't hit

play13:50

some similarity threshold like we'll

play13:53

retrieve a lot of them but then if they

play13:54

you know if they're not at least like

play13:56

0.25 similar then we'll just throw them

play13:58

out so we can retrieve and then just

play14:00

mask the ones that end up being less

play14:01

important uh

play14:03

another another important regularization

play14:06

technique in particular for models that

play14:08

are trained using rope uh is to

play14:10

eliminate tokens from the memory that

play14:12

correspond to unknown tokens so

play14:14

especially if your data is super messy a

play14:16

lot of the Wikipedia based benchmarks

play14:18

are like really way more messy than I

play14:20

even knew before I started working on

play14:21

this stuff uh they have a lot of like

play14:23

just unknown tokens and so they're kind

play14:26

of like poorly represented by the models

play14:27

often because they're unknown

play14:29

they end up having a lot of matches with

play14:31

your query tokens but then they're not

play14:33

actually containing a lot of useful

play14:34

information um so we just eliminate

play14:36

those from the memory before we allow it

play14:38

to start

play14:41

retrieving all right so we have a whole

play14:43

collection of these models on hugging

play14:45

face all of the uh code is on GitHub as

play14:47

well as that data set um and encourage

play14:50

you all to read the paper if you're

play14:51

curious about more of the technical

play14:52

details uh as I hope you can see here

play14:55

it's actually pretty easy to use these

play14:56

things so it's as simple as passing

play14:58

those memories in in as inputs uh as

play15:01

tokens into the model during

play15:03

instantiation um you can dynamically

play15:05

change them after that as well but it's

play15:06

the easiest way to do it uh and then

play15:08

making sure your config is set up

play15:12

correctly all right so just to conclude

play15:14

here uh I hope you all will take away

play15:16

that these new kinds of models um

play15:19

improve achieve impressive performance

play15:21

on retrieval tasks they enable these

play15:23

great new kind of citations um they also

play15:26

enable this new kind of hallucination

play15:28

reduction Tech technique which is

play15:29

inspired by Active Learning they do not

play15:32

require fine-tuning unlike kind of long

play15:34

context methods uh and they can easily

play15:37

run using our open source models and

play15:40

code thanks so much and uh find me after

play15:43

four questions

play15:48

[Music]

Rate This

5.0 / 5 (0 votes)

الوسوم ذات الصلة
機械学習AIモデルトランスフォーマー情報取得生成手法アクティブラーニング性能最適化データ処理ハルシネーション防止オープンソース
هل تحتاج إلى تلخيص باللغة الإنجليزية؟