Going beyond RAG: Extended Mind Transformers - Phoebe Klett
Summary
TLDRPhoebeは、ノーマルコンピューティングの機械学習エンジニアであり、最近の研究「Extended Mind Transformers」について語ります。このトークでは、問題の紹介から方法論、実験結果、そして実装時のパラメータ調整方法に至るまでの詳細を説明します。Extended Mind Transformersは、記憶と推論クエリを明確に区別し、より詳細な因果関係の引用や、モデルが不確かであると判断したときのアクティブラーニングにインスパイアされた新しい生成パラダイムを可能にします。このモデルはファインチューニングを必要とせず、オープンソースのモデルとコードを使用して簡単に実行できます。
Takeaways
- 🤖 拡張された心のTransformer(EMT)は、モデルに外部のメモリを統合し、より良いリトライバルタスクのパフォーマンスを実現します。
- 🔍 EMTは、従来のTransformerモデルに単純な編集を加えることで機能し、内部のリトライバルメカニズムを活用します。
- 📈 実験結果によると、EMTは長い文脈でのファインチューニングされたモデルと比較して、競争力のあるパフォーマンスを示しました。
- 📝 EMTは、モデルが生成時に使用した情報に基づいて、新しいタイプの引用を可能にします。
- 🧠 モデルの不確実性を検出すると、EMTはアクティブラーニングにインスパイアされた技術を用いて、より多くのメモリからの情報を使用して再生成できます。
- 🛠️ ストライド長とtop Kは、EMTを実装する際に調整する重要なパラメータです。これらはメモリの生成とクエリトークンのリトライバルに影響を与えます。
- 🔗 正確な位置情報を割り当てるためには、相対的な位置埋め込みが重要であり、EMTはそのモデルを通じて一般化することができます。
- 📊 正規化技術として、類似マスクと未知トークンの排除が有効であり、モデルの混乱を避けることができます。
- 💻 Hugging Faceにモデルが用意されており、GitHubにコードが公開されているため、誰でも簡単にEMTを利用することができます。
- 📑 発表者は、技術的な詳細について興味がある人々が論文を読むことをお勧めしています。
Q & A
Phoebeはどのような職業をしていますか?
-Phoebeは機械学習エンジニアで、ノーマルコンピューティングで働いています。
Extended Mind Transformersとは何ですか?
-Extended Mind Transformersは、Transformerモデルに埋め込まれたリトリーバルメカニズムを通じて、より多くのコンテキスト情報を扱えるようにする研究です。
Extended Mind Transformersが解決しようとしている問題とは何ですか?
-Extended Mind Transformersは、言語モデルが特定のアプリケーションやトピックに関連する詳細な情報を扱えるようにする問題を解決しようとしています。
Extended Mind Transformersが実装するリトリーバルメカニズムとは何ですか?
-Extended Mind Transformersは、デコーダーレイヤー内のデータをキーバリューペアとして表現し、クエリトークンがそのキーバリューペアに基づいてメモリートークンを取得し、それらに注意を向けることができます。
Extended Mind Transformersが提供する新しい引用の種類とは何ですか?
-Extended Mind Transformersは、モデルが生成時に使用した特定のトークンを特定でき、その結果を生成する情報源を示す新しいタイプの引用を提供します。
Extended Mind Transformersが活性学習にインスパイアされた新しい生成パラダイムとは何ですか?
-Extended Mind Transformersは、モデルが生成されたトークンについて不確実な場合に、より多くのメモリーからの情報を使用して再生成することで、活性学習にインスパイアされた新しい生成パラダイムを提供します。
Extended Mind Transformersを実装する際に調整する重要なパラメーターとは何ですか?
-重要なパラメーターにはストライド長とトークンの数(Top K)があります。ストライド長はメモリーを生成する際のコンテキストの量を決定し、Top Kは各クエリトークンが取得し、注意を向けることができるメモリーの数を決定します。
Extended Mind Transformersが実装する正則化技術とは何ですか?
-Extended Mind Transformersでは、類似度マスクと未知トークンの排除という2つの正則化技術を実装しています。これにより、モデルが混乱を招く情報の量を減らし、精度を高めることができます。
Extended Mind Transformersがオープンソースモデルとして提供されているのはなぜですか?
-Extended Mind Transformersがオープンソースモデルとして提供されているのは、研究者や開発者が簡単にアクセスし、実験し、モデルを改善できるようにするためです。
Extended Mind TransformersがFine-tuningを必要としない理由とは何ですか?
-Extended Mind Transformersは、Fine-tuningを必要としない理由は、モデルが内部的にリトリーブされたトークンを扱えるように設計されているためです。これにより、長いコンテキストに対するFine-tuningによる注意の質の低下を避けることができます。
Outlines
🤖 拡張心Transformerの紹介
Phoebeがノーマルコンピューティングで行っている機械学習に関する研究について語ります。特に、拡張心Transformer(EMT)について説明し、この技術がどのようにして問題解決に役立つかを概説します。EMTは、モデルに一般知識を学習させた後、特定のアプリケーションに必要な情報を追加する方法を探求しています。これまでの研究では、ロングコンテキストやRAGなどの方法が提案されてきましたが、それらにはそれぞれ問題点があります。EMTは、これらの問題を解決する新しいアプローチを提供します。
🔍 拡張心Transformerの評価
拡張心Transformerの性能を評価するために、新しいカウンターファクトリievalベンチマークを公開しました。このベンチマークでは、モデルが事前学習で記憶した事実だけでなく、ファインチューニングでも学習した事実を制御し、モデルが推論時に提供される情報に依存しているかどうかをテストします。EMTは、ファインチューニングされたモデルやベースのLLaMAモデルと比較して、非常に長いコンテキストでも競争力があることがわかります。また、EMTは、モデルが生成時に使用したメモリから正確な情報を特定できるため、新しいタイプの引用を可能にしています。
🧠 拡張心Transformerによる詐欺の減少
拡張心Transformerは、モデルが生成するトークンの不確実性を検知することで、詐欺を減らす技術を提供しています。モデルが特定のトークンについて不確実であると検知された場合、より多くのメモリからの情報を使用してそのステップを再生成することができます。これにより、モデルが正しい情報を提供する確率が高まり、効率的で信頼性の高い出力が得られます。
🛠 拡張心Transformerのパラメータ設定
拡張心Transformerを使用する際の主要なパラメータについて説明します。ストライド長は、メモリの生成時にトークンが適切な文脈を持って生成されるようにするパラメータです。また、トプKは、各クエリトークンが取得し、注意するメモリの数です。メモリが長くなればなるほど、取得する情報は多くなりますが、モデルが混乱するリスクもあります。正則化技術として、類似度マスクと未知トークンの排除が提案されています。これらの技術は、モデルが適切な情報を取得し、混乱を避けながらも最適なパフォーマンスを発揮するのに役立ちます。
📈 拡張心Transformerのまとめ
拡張心Transformerは、検索タスクにおいて優れたパフォーマンスを発揮し、新しいタイプの引用と詐欺削減技術を可能にしています。ファインチューニングを必要としないことが特徴で、オープンソースのモデルとコードを使用して簡単に実装できます。この技術は、AIエンジニアにとって信頼性と効率性を高める重要なツールとなります。
Mindmap
Keywords
💡機械学習エンジニア
💡Extended Mind Transformers
💡リカバリメカニズム
💡トークン
💡エントロピー
💡正規化技術
💡ストライド長
💡top K
💡ハローケーション
💡アクティブラーニング
Highlights
介绍扩展心智Transformers及其在机器学习领域的最新研究
解释了扩展心智Transformers实现的检索机制
讨论了预训练语言模型的局限性和对特定应用信息的需求
对比了长上下文和RAG两种方法的不同及其各自的缺点
提出了扩展心智注意力机制,对Transformer模型的简单修改
解释了如何在不进行额外微调的情况下,利用相对位置编码来处理检索到的token
介绍了两种不同的相对位置编码方法:旋转位置编码和线性偏置
展示了扩展心智Transformers在长上下文基准测试中的表现
讨论了如何通过内部检索机制减少模型的幻觉(hallucinations)
介绍了扩展心智Transformers如何实现更细粒度的因果引用
讨论了在实现扩展心智Transformers时需要调整的重要参数
解释了步幅长度(stride length)对生成记忆表示的影响
讨论了top K参数对检索和注意力机制的影响
介绍了两种正则化技术:相似性掩蔽和消除未知token
提供了扩展心智Transformers模型在Hugging Face上的资源链接
总结了扩展心智Transformers的优势,包括无需微调、易于使用和提高检索任务性能
鼓励观众阅读论文以了解更多技术细节
Transcripts
[Music]
I'm Phoebe I'm a machine learning
engineer at normal Computing and I'm
really excited to tell you guys about
some of our recent research uh and in
particular extended mind
Transformers all right so just to
briefly cover what we're going to go
over in today's talk uh we'll introduce
the problem which I think will be quite
familiar given the amazing talk which
came before mine uh and then dive right
into the method so uh what is the
retrieval mechanism that extended mind
Transformers Implement uh and then we'll
dive into some experiments which give us
confidence that these methods are
actually performant after that we'll get
into two of my favorite and I think most
compelling features that extended my
Transformers enable this is a new kind
of citation uh as well as a new kind of
generation Paradigm which is active
learning inspired uh and then we'll go
over the most important parameters to
tune when implementing uh EMTs in your
applications and generally how to use
them all right so we pre-train language
models uh so that they have general
knowledge but as we've been discussing
all this conference that's not enough we
need a lot of application specific
information and a topical uh description
of the world in order to make these
things useful um I'm not going to
belabor the two most popular methods um
which try to load this description into
the language model those being long
context and rag as a I think yeah we've
heard a lot about those um great methods
already but I'd like to point out that
they solve the problem in different ways
and th suffer from different downsides
so long context seeks to extend the
context window of the Transformer model
so we train language models we train
them on sequences of a fixed length and
then we're trying to say well can we can
we extend that so we can include more in
the context more in the prompt during
inference time uh fine tuning is usually
how this is done and that's awfully
expensive
uh and more so than that including all
of that context in your prompt can
confuse the model with a lot of
irrelevant information um and kind of
beyond that just conceptually speaking
it seems a little like wasteful right
like if we're trying to do question
answering over a big code base uh our
query is most usually does not need to
reference like all of those different
function definitions but just need some
subset of them to answer the query
correctly um okay so this is what rag
tries to do right let's try to subset
that information down and just include
the most relevant context in our prompt
um so what are the issues here well
these these mechanisms which are
external to the Transformer are kind of
like necessarily limited by being
external to the model so we make this
choice of what's relevant once and
upfront before the generation starts and
we're also making this choice about
what's relevant using kind of the least
granular representation of that data and
often ones that are disjoint from the
way that the model will reason about
that data um kind of also just
conceptually neither of these methods
make a difference uh or make a
distinction between things that should
go in memory and things that should be
included along with your inference query
and this is more than just Aesthetics
it's actually going to enable us to
oh it's going to enable us to have these
like more granular causal citations uh
and allow the model to retrieve more
information when we can tell it's
uncertain kind of actively within the
generation
all right so how do we do this extended
mind attention is a very simple edit to
the attention mechanism of the
Transformer I'm not going to get too
much into the math because we don't have
a ton of time today but would love for
anyone to check out the paper and let me
know what you think um so but I'll just
go over kind of yeah from a qualitative
perspective how this works so the model
represents data within each decoder
layer most of the Transformers that
we're using today are decoder only
Transformers and within each of those
decoder layers the model will represent
that data as a key value pair so it
actually already has this retrieval
mechanism built into the Transformer all
we have to do is kind of hack around it
and so we pass all of the memory tokens
through the model and save off those key
value representations and then during
generation time we allow each query
token just like rag using cosine
similarity to go retrieve a particular
number of those memory tokens and attend
to them so this in this picture these
kind of red tokens red highlighted
tokens are meant to uh represent those
retrieved
tokens uh again this actually ends up
being a very simple change to the
Transformer model what's difficult uh is
figuring out how to assign position
information to those tokens so this uh
work is based on Research from a couple
years ago but they needed to fine-tune
their model in order to kind of teach
the model how to leverage these
retrieved tokens and that's in large
part due to absolute position embeddings
that were popular during that time so
because Transformer models are position
agnostic we have to figure out how to
kind of tell them like okay this token
is position zero this one is to position
one etc etc um but due to today's more
kind of like their softer position
embeddings this allows us to really
leverage this method without any further
fine-tuning so in particular these
relative position medings that have
become popular and I'll talk about two
different methods that we've tested and
implemented this on um really enable the
model to kind of generalize um to these
retrieved tokens the first one uh that
we tested on is present in all of the
Llama models these are the rotary
position embeddings and this generalizes
the principle of using kind of like an
angle between two vectors as a distance
metric so we kind of take the whole
embedding and we rotate kind of two
positions at a time the other one that
we implemented um this method into is
The Alibi uh linear biases so these
actually aren't positioning embeddings
at all it's just kind of linearly damps
down uh information which is further
away and these are uh the way that all
of the mosaics MPT models are
trained okay so let's talk about some
evaluations um we also just open sourced
a new counterfactual retrieval Benchmark
and I'm just going to briefly describe
what that Benchmark looks like so this
is a long context Benchmark so our input
context is are query answer pairs uh and
the context to answer those questions
range from about 2,000 tokens to all the
way up to 16,000 tokens and the again
these are like query so like the
question might be who wrote the song
these shoes were made for walking and
then the corresponding Wikipedia snippet
um we wanted to control for facts
memorized during pre-training though and
actually any fine tuning also so what we
did was we looked up for instance in
this case the answer is Lee Hazelwood we
did a little bit of research we figured
out okay well Terry Allen is a similar
songwriter this is a plausible answer
but it's wrong we went in and we
replaced all the instances of Lee
Hazelwood with Terry Allen and now we
asked the model to retrieve this new you
know not factually correct but in the
sense it we're trying to test whether
it's prioritizing what's being provided
at inference time um so now we're asking
it to retrieve this Terry Allen
answer all right so how to extend M
Transformers stack up here we're
comparing it with fine-tuned models as
well as the base llama model with uh
interpolated position embeddings so we
can see here in the green that the base
model does a pretty good job
extrapolating even like many times more
this is a model trained up to like 20 48
tokens um during pre-training and you
can see even up to 8K it's like doing
okay 16k it really falls off the
position embeddings can't extrapolate
that far the tune models you can see
actually perform worse than the extended
mind model on these shorter inputs and
this is another data point that suggests
that fine tuning on super long contexts
actually degrades the quality of
attention that you get on shorter inputs
and extended mind Transformers continue
to be competitive with those fine-tuned
models all the way up to 16k again our
models are not fine-tuned at
all and in this particular experiment so
what the extended mind model sees in
context is the query only so it only
sees the like who wrote the song these
uses made for walking and relies heavily
on that internal retrieval mechanism to
go look up that new information in this
second experiment we seed it with a
little bit more information in context
uh using rag but again mostly relying on
that uh internal mechanism still uh and
you can see we're outperforming dpt4
here now when we combine it with that
more information and context as
well okay now we're going to talk about
citations so I think uh this would be a
topic that lots of you here can
empathize with uh as AI Engineers I
think this is one of the most important
things to provide in an application such
that people can learn to trust the model
outputs in fact you might actually use
rag just to get citations um so with rag
though the citations that you get are a
little bit kind of like post talk
rationalization so maybe if like the
date appears in the output and we knew
it was also in the input to the language
model we feel pretty confid that that
date is not hallucinated um but again
this is not really like a causally
related to what information the model
used during the generation now with
extended mind Transformers We can look
up exactly which tokens were retrieved
from the from those memories and used
during generation so in this example on
the top left here we have the memories
this is a sampit from Wikipedia about
one of my favorite mathematicians
Alexander grend and the query is when
did he get his friend citizenship and
then in the bottom you can see the
completion with a correct date I think
he got it in
1971 so the blue highlighted tokens here
uh importantly the 1971 as well as some
of the Alexander growth and de tokens um
those are the ones that the model
retrieved and attended to when
generating that 1971 correct token and
so being able to report that uh gives a
lot of confidence and also just insight
into how the model is using those
retrieved
tokens okay we can also use extended
mind Transformers to reduce
hallucinations so how do we do this so
right now we have access to in the like
simplest case just kind of token level
entropy over that output distribution
and if you wanted to get fancier we're
also doing some basy and fine-tuning of
language models at normal but you can
use any uncertainty metric to determine
kind of how certain the model is about a
generated token and if we kind of can
detect that the model is uncertain about
that token we can regenerate that step
using more information from these
memories uh okay so in the top right
here we can see this is we just set like
a baseline default number of memories
that each query token is allowed to
retrieve and attend to and you can see
it wasn't quite enough information uh to
get this query right so if you remember
from the previous Slide the correct
answer here is
1971 and you can see we've got 1993 here
so wasn't enough we didn't attend to
that memory quite enough to get this
question right and in the bottom example
we allow it to read generate some subset
of those tokens using more information
from the cache when we can tell the
model was
uncertain and again got this right so
it's kind of like kind of a nice
intuition for uh when the model's
uncertain and then okay if it's really
uncertain let's go use more information
and also can be more efficient kind of
depending on how the math works
out all right so now I'm going to tell
you guys about the most important uh
parameters to set when using extended
mind Transformers so you may have heard
of something called stride length before
uh and this is um a parameter that comes
up a lot even just kind of in regular
perplexity computations so when we
compute the memories that we're going to
attend to we pass them through the model
and then again save off these key value
representations that the model saves
internally um but again the models that
we're using are trained on this fixed
context length so we need to kind of
pass over them with some stride such
that each of those tokens has an
appropriate amount of context um to
generate the representation so if the
stride is smaller uh you're going to get
more uh high quality representations but
also will require more computations um
so you can kind of tune this and there
are some graphs in the paper as well
that kind of represent this tradeoff um
but this is an important parameter to
set when yeah generating the memories
themselves top K is uh probably the most
important parameter to think about so
this is the number of key Valu pairs or
memories that each query token is
allowed to retrieve and attend to um
when your memory is quite long kind of
the more the better um but again uh yeah
this it's kind of should be dynamically
set based on how long your memory
is um okay yeah so lastly uh we want to
retrieve as much information as we can
from the memory without confusing the
model it's making analogy back to kind
of putting everything into context we
don't want to just throw everything in
there because that will be confusing to
the model um so we have two different
regularization techniques that we
Implement that we have found to be
especially effective um the first one is
called similarity masking so again we we
retrieve these tokens uh based on
similarity with our query token and the
key that we are retrieving from and so
we might say like well if we don't hit
some similarity threshold like we'll
retrieve a lot of them but then if they
you know if they're not at least like
0.25 similar then we'll just throw them
out so we can retrieve and then just
mask the ones that end up being less
important uh
another another important regularization
technique in particular for models that
are trained using rope uh is to
eliminate tokens from the memory that
correspond to unknown tokens so
especially if your data is super messy a
lot of the Wikipedia based benchmarks
are like really way more messy than I
even knew before I started working on
this stuff uh they have a lot of like
just unknown tokens and so they're kind
of like poorly represented by the models
often because they're unknown
they end up having a lot of matches with
your query tokens but then they're not
actually containing a lot of useful
information um so we just eliminate
those from the memory before we allow it
to start
retrieving all right so we have a whole
collection of these models on hugging
face all of the uh code is on GitHub as
well as that data set um and encourage
you all to read the paper if you're
curious about more of the technical
details uh as I hope you can see here
it's actually pretty easy to use these
things so it's as simple as passing
those memories in in as inputs uh as
tokens into the model during
instantiation um you can dynamically
change them after that as well but it's
the easiest way to do it uh and then
making sure your config is set up
correctly all right so just to conclude
here uh I hope you all will take away
that these new kinds of models um
improve achieve impressive performance
on retrieval tasks they enable these
great new kind of citations um they also
enable this new kind of hallucination
reduction Tech technique which is
inspired by Active Learning they do not
require fine-tuning unlike kind of long
context methods uh and they can easily
run using our open source models and
code thanks so much and uh find me after
four questions
[Music]
5.0 / 5 (0 votes)