Massive Activations in Large Language Models
Summary
TLDRこの研究では、大規模言語モデル(LLM)の内部表現における驚くべき現象を発見しました。一部の活性化値が他の値に比べて桁違いに大きくなり、これらを「巨大活性化」と呼んでいます。この巨大活性化は固定的なバイアス項として機能し、モデルの性能や注意メカニズムに大きな影響を与えていることがわかりました。さらに、この巨大活性化を明示的なバイアス項で置き換えることで、モデルが学習する必要がなくなることを示しています。また、ビジョントランスフォーマー(ViT)にも同様の現象が見られ、レジスタートークンとしての役割を果たしていることを発見しました。この研究によって、LLMやViTの内部メカニズムに対するより深い理解が得られたと考えられます。
Takeaways
- 🤖 大規模言語モデル(LLM)の内部メカニズムを理解することは重要である。
- 🔍 LLMの一部の活性化が非常に大きいことが発見され、これらを「大規模活性化」と名付けた。
- ✨ 大規模活性化は、数が少ないにもかかわらず、LLMのパフォーマンスに重大な影響を与える。
- 📚 大規模活性化は、様々なLLMのさまざまなレイヤーで発生し、入力に依存しない固定バイアス成分として機能する。
- 🌉 LLMは大規模活性化を使って、特定のトークンに注意を集中させるため、内部的な注意バイアスを導入している。
- 🔧 LLMに明示的な注意バイアスを導入することで、大規模活性化の必要性を排除できる可能性がある。
- 👁️ Vision Transformerにも大規模活性化が存在し、register tokensと類似の役割を果たしている。
- 🔬 大規模活性化の発見は、LLMの内部メカニズムに関する新たな理解を提供する。
- 💡 大規模活性化の研究は、将来のモデルの最適化とデザインにつながる可能性がある。
- 🚀 内部メカニズムの理解を深めることで、LLMの性能と応用範囲がさらに広がる可能性がある。
Q & A
大規模言語モデル(LLMs)の内部機構を探るために、どのような発見がありましたか?
-LLMの内部表現において、中央値よりも4桁大きい「巨大なアクティベーション」が存在することを発見しました。これらはモデルのパフォーマンスに重要な役割を果たしています。
巨大なアクティベーションはLLMのどの層で発生しますか?
-巨大なアクティベーションは初期の層で突然出現し、最終層で減少する傾向があります。多くのLLMで中間層で一定に維持されます。例えば、Llama 2 to 7Bでは2層目から出現し、30層目まで持続します。
巨大なアクティベーションの機能は何ですか?
-巨大なアクティベーションは、LLMの内部計算において、固定されたバイアス項として機能しています。これらを操作するとモデルのパフォーマンスが大きく低下することから、計算プロセスに不可欠な役割を果たしていることがわかります。
LLMの自己注目メカニズムにおける巨大なアクティベーションの影響は?
-巨大なアクティベーションが出現した後の層では、それらに関連するトークンに注目が集中する傾向があります。つまり、巨大なアクティベーションは注目の分布にインプリシットなバイアスをもたらしています。
LLMに明示的な注目バイアスを導入するとどうなりますか?
-LLMに明示的な注目バイアスを追加すると、モデルは巨大なアクティベーションを発生させなくなります。これにより、明示的なバイアスが巨大なアクティベーションの必要性を取り除ける可能性があることがわかります。
Vision Transformers (ViTs)における巨大なアクティベーションはどのような役割を果たしますか?
-ViTsにも巨大なアクティベーションが存在し、それらは固定的なバイアス項として機能しています。またRegisterトークンと巨大なアクティベーションには類似性があり、両方ともバイアスの役割を果たしてモデルのパフォーマンスを向上させます。
巨大なアクティベーションと外れ値特徴の違いは何ですか?
-巨大なアクティベーションは特定のトークンに関連する大きなスカラー値ですが、外れ値特徴はすべてのトークンに影響を与えるベクトルです。また、外れ値特徴は巨大なアクティベーションよりもはるかに多く発生します。
GPT-2やその他のLLMにおいて、どのような重要な観測が行われてきましたか?
-GPT-2の最終層の手前の層で、特定の次元の特徴値が3,000程度に達することが確認されています。また、初期トークンの特徴値が他のトークンよりもはるかに速く増加することが観測されています。
注目集中パターンとは何ですか? LLMではどのような傾向がありますか?
-注目集中パターンとは、自己注目メカニズムが特定のトークンに注目を集中させる現象です。LLMでは、初期のトークンや区切り文字に注目が集中する傾向があります。
自己注目メカニズムにおけるインプリシットなバイアスはどのように導入されていますか?
-標準的な自己注目を使用するLLMでさえ、巨大なアクティベーションを介して注目計算にインプリシットなバイアス成分を導入しています。これにより、モデルの内部メカニズムにおいて注目がある特定のトークンに集中することになります。
Outlines
🚀 大規模言語モデルの内部動態
この段落では、大規模言語モデル(LLM)の内部メカニズムを探求し、驚くべき発見として「巨大活性化」を紹介しています。巨大活性化は他の活性化と比べて桁違いに大きな値を持ち、モデル内の重要な役割を果たすことが分かりました。これらは特定のトークンと関連し、LLMの様々な層に現れ、モデルの性能に大きな影響を及ぼすバイアス項として機能することが明らかになりました。
👾 巨大活性化が集中するLLMの層
この段落では、巨大活性化が最も顕著に現れる層を調査しました。LLMのいくつかの層で、これらの巨大活性化が突如現れ、中間層で比較的一定に維持され、最後の層で減少することがわかりました。また、特定の特徴次元とトークンシーケンスに関連して巨大活性化が発生することも明らかになりました。これらは異常値の特徴とは異なる現象であり、スカラー値のように機能することが分かりました。
💡 巨大活性化がLLMにおけるバイアス項として機能する
この段落では、巨大活性化がLLMの内部計算にとって不可欠なバイアス項として機能することを示しています。巨大活性化をゼロに設定すると、モデルの性能が大幅に低下しました。一方、平均値に調整するとほとんど影響がありませんでした。さらに、巨大活性化が現れた層以降、それらに関連するトークンへの注意が集中することが分かりました。これにより、LLMが固定バイアスを学び、自己注意を介して注意を集中させる仕組みが明らかになりました。
🧩 巨大活性化による自己注意のインプリシットバイアス
この段落では、LLMがどのように巨大活性化を利用して自己注意メカニズムに暗黙的にバイアスを付与するかを調査しています。巨大活性化に関連するトークンへの注意が集中し、注意の計算にインプリシットなバイアス項が導入されていることがわかりました。さらに、明示的な注意のバイアスを導入することで、巨大活性化を回避できることも示しています。Vision Transformersにおいても、巨大活性化が見られ、register tokensと同様の役割を果たすことが明らかになりました。
🔍 関連する先行研究
この段落では、自己回帰Transformerにおける先行研究と発見を概観しています。GPT-2やその他のLLMにおいて、特定の次元で非常に大きな活性化が観察されていたことが分かりました。また、outlier featuresの存在や、注意集中パターン、自己注意メカニズム内のバイアスタームなど、様々な関連トピックについても言及しています。これらの発見は、本研究におけるマッシブ活性化の発見と関連付けられ、LLMの内部メカニズムの理解を深める上で重要な知見となっています。
Mindmap
Keywords
💡大規模言語モデル (Large Language Model, LLM)
💡マッシブアクティベーション (Massive Activations)
💡自己注目メカニズム (Self-Attention Mechanism)
💡正規化 (Normalization)
💡事前学習 (Pre-training)
💡視覚Transformer (Vision Transformer, ViT)
💡注目集中 (Attention Concentration)
💡パーミュテーション (Permutation)
💡外れ値特徴 (Outlier Features)
💡固定バイアス (Fixed Bias)
Highlights
We found instances of certain activations that were astonishingly large, more than four orders of magnitude larger than the median, and sometimes exceeding absolute values of 15,000, even in models like llama 2 to 70 B that incorporate normalization layers.
These massive activations are not just large, they're also incredibly rare, often appearing fewer than 10 times among tens of millions of activations.
We've aptly named them 'massive activations' and our research shows that these are not isolated incidents but occur across a wide variety of llms, regardless of their size or family.
Massive activations are distinct from outlier features previously identified in llms. We discovered that they act as fixed but essential bias terms within the models, similar to the bias term in a linear layer.
For example, in llama 2 to 7B, nullifying just four of these massive activations led to a dramatic drop in model performance, underscoring their critical role.
However, adjusting them to their mean values did not adversely affect the model, suggesting they function as simple constant biases.
We found a strong connection between massive activations and self-attention mechanisms, with massive activations drawing attention to their associated tokens.
This extends the concept of attention sinks and provides a deeper understanding of how attention concentration patterns develop in llms.
We hypothesize that llms attempt to learn implicit bias components and self-attention through massive activations during their pre-training phase.
We also found massive activations in Vision Transformers (ViTs), albeit less frequently, and they act as fixed biases, found at fixed feature dimensions but vary across patch tokens.
This similarity led us to draw connections between massive activations and the concept of register tokens in ViTs, suggesting an alternative interpretation of their function as fixed biases rather than aggregators of global image information.
We identified layers in llms where massive activations occur, showing that these activations remain constant across most intermediate layers and emerge rapidly in specific layers like two and four.
We determined the feature and sequence dimensions of these massive activations in models like llama 2 to 7B, 13B, and mixl 8x 7B, finding consistent patterns such as activations at starting tokens and specific word tokens.
Massive activations are distinct from outlier features, as massive activations are scalar values at few tokens, while outlier features are vectors across all tokens and layers.
By modifying the massive activations and observing the impact on model performance, we found that they act as crucial fixed biases in llms, significantly influencing their computational processes.
Attention in llms is concentrated on tokens associated with massive activations, influencing the distribution of attention logit values and shedding light on the internal mechanism of llms.
Llms use massive activations to concentrate attention on specific tokens, introducing implicit bias terms in attention computations.
By augmenting llms with explicit attention biases, we can eliminate massive activations, showcasing an alternative approach to address attention biases in pre-training.
In ViTs, massive activations are observed in certain models like CLIP and D2, serving as fixed biases that influence attention patterns similar to the function of register tokens proposed in ViTs.
Transcripts
section
introduction in this section we delve
into the fascinating world of large
language models llms and their internal
Dynamics our journey begins with an
acknowledgement of the impressive Feats
achieved by these models which have been
primarily assessed through their
external behaviors such as task
performance and response
accuracy however we believe it's equally
important to understand what goes on
under the hood especially as these
models find their way into various real
World
applications despite the significance
the exploration of ilm's internal
mechanisms has been somewhat
limited our investigation led us to a
surprising discovery within the internal
representations of
llms when we looked into the hidden
states of these models we found
instances of certain activations that
were astonishingly large more than four
orders of magnitude larger than the
median and sometimes exceeding absolute
values of 15,000 even in models like
llama 2 to 70 B that incorporate
normalization
layers these massive activations are not
just large they're also incredibly rare
often appearing fewer than 10 times
among tens of millions of
activations given their significant size
difference compared to other activations
we've aptly named them massive
activations our research shows that
these are not isolated incidents but
occur across a wide variety of llms
regardless of their size or family we
took a closer look at where these
massive activations are situated within
the llms and found that their emergence
is quite sudden appearing abruptly after
a single layer of computation and then
diminishing in the final
layers interestingly these activations
are not tied to specific inputs but
occur in a small number of feature
Dimensions often associated with the
starting word token and delimiter
tokens it's crucial to note that massive
activations are distinct from outlier
features previously identified in llms
we discovered that they act as fixed but
essential bias terms within the models
similar to the bias term in a linear
layer
equation for example in llama 2 to 7B
nullifying just four of these massive
activations led to a dramatic drop in
model performance underscoring their
critical
role however adjusting them to their
mean values did not adversely affect the
model suggesting they function as simple
constant
biases our analysis further reveals that
after the initial layers llms repurpose
tokens associated with massive
activations to store these crucial
biases intriguingly we found a strong
connection between massive activations
and self- attention mechanisms with
massive activations drawing attention to
their Associated
tokens this extends the concept of
attention syncs and provides a deeper
understanding of how attention
concentration patterns develop in
llms we hypothesize that that llms
attempt to learn implicit bias
components and self- attention through
massive activations during their
pre-training phase to test this we
experimented with augmenting self-
attention with additional key and value
embeddings explicitly designed as
biases remarkably this approach
eliminated the need for llms to learn
massive
activations our observations of massive
activations are not limited to llms we
also found them in Vision Transformers
vits albeit less
frequently like in llms these
activations in vits act as fixed biases
and are found at fixed feature
Dimensions but vary across patch
tokens this similarity led us to draw
connections between massive activations
and the concept of register tokens in
vits suggesting an alternative
interpretation of their function as
fixed biases rather than aggregators of
global image
information in summary our exploration
into the internal workings of llms has
unveiled the critical role of massive
activations and their implications for
Model Behavior and
performance this Insight not only
enriches our understanding of llms but
also opens new avenues for optimizing
and designing future
models section summary in this section
we investigate a surprising discovery in
large language models llms where certain
activations within the models exhibit
significantly larger magnitudes termed
as massive activations
these massive activations despite being
rare and few in number play a crucial
role as fixed bias terms in llms
influencing model performance and
attention
mechanisms we find that these massive
activations are not unique to specific
llm models but are observed across
various llms showcasing their importance
in understanding the internal mechanisms
of these
models section which
layers in this section we explore which
layers of certain language models show
the most significant activations meaning
the points where the model's output is
most
intense we looked at the output from
each layer of three different models
llama 2 to 7B 13B and phi2 and average
the results over 100
sequences our findings show that the
most intense activations tend to occur
at the same spot across many of the
middle layers of these
models specifically these intense
activation start appearing in the early
layers remain fairly constant through
the middle layers and then start to
decrease in the last few
layers for example in the Llama 2 to 7B
model these activations first show up in
layer 2 and stay consistent up to layer
30 interestingly in both the Llama 2 to
7B and 13B models these activations
appear suddenly rather than building up
gradually suggesting they're caused by a
different mechanism than we might
expect
next we delve into where within the
model's hidden States these intense
activations occur focusing on their
feature and sequence
Dimensions by examining a middle layer
we found that in the Llama 2 to 7B model
two specific feature Dimensions show
these
activations as for the sequence
Dimensions these activations are linked
to the first word token and the token
representing either a period or a new
line in the
sequence this pattern holds true even in
longer SE sequences with up to four
intense activations occurring when these
specific tokens are present in the Llama
2 to 13B model the intense activations
are consistently found in two feature
dimensions and are always associated
with the starting token of the
sequence another model mixl 8X 7B shows
a similar pattern but includes
additional tokens like and and of as
points of intense
activation summarizing our findings AC
across various models we noticed that
intense activations are typically found
in very few feature
Dimensions when it comes to sequence
Dimensions we categorize the models
based on where these activations occur
some models show them only at the
starting token others at the starting
token and the first strong delimiter and
a third group shows them at the starting
token various delimiters and certain low
semantic
words lastly we differentiate these
intense activations from from outlier
features another phenomenon observed in
language
models while both involve High
activation magnitudes intense
activations are scalar values linked to
specific tokens whereas outlier features
are vectors affecting all
tokens moreover intense activations
occur at far fewer tokens compared to
outlier
features in our analysis of llama 2 to
7B and 13B models we identified several
outlier features but found that none of
them matched the feature dimensions of
the intense activations highlighting a
clear distinction between the two
phenomena section summary in this
section we identify layers in llms where
massive activations occur showing that
these activations remain constant across
most intermediate layers and emerge
rapidly in specific layers like two and
four we also determine the feature and
sequence dimensions of these massive
activations in models like llama 2 to 7B
13B B and mixl 8x 7B finding consistent
patterns such as activations at starting
tokens and specific word
tokens additionally we differentiate
massive activations from outlier
features highlighting that massive
activations are scalar values at few
tokens distinct from outlier features
which are vectors across all tokens and
layers section massive activations act
as biases in
llms in this section we explore the
significant role that massive
activations play Within large language
models
llms initially we were curious about
whether these massive activations were
crucial for the model's internal
computations or if they were merely
Superfluous to get a clearer
understanding we decided to actively
manipulate these massive activations and
observe how such changes impact the
ilm's
behavior we began by examining the
variability of these massive activations
across different input SE
quences in addition to focusing on
massive activations we also looked at
three other types of activations based
on their average sizes specifically the
top 1% top 10% and the median within the
hidden
states by analyzing 100 sequences in two
versions of the Llama model we noticed
that the variability of massive
activations was relatively small
compared to their average values
especially when compared to other types
of
activations to delve deeper we modified
the ilm's inference process by adjusting
the massive activations at a specific
layer for any hidden state that showed
massive activations we manually set
these activations to fixed
values after making these adjustments we
continued with the normal computation
process in the subsequent
layers we conducted this experiment on
both the Llama 2 to 7B and 13B models
and assessed the impact on various
benchmarks including perplexity on data
sets like wikitext C4 and pg19 and mean
zero shot accuracy on tasks such as bu Q
piqa WIOG Grande Arc easy and Arc
challenge one of our interventions
involved setting the massive activations
to zero at the point they first appeared
within the hidden
States surprisingly this led to a
significant drop in the model's
performance with a notable increase in
perplexity for comparison we also set an
equal number of activations those with
average magnitudes close to the median
to zero and observed no performance
degradation this experiment highlighted
the indispensable role of massive
activations in the internal workings of
llms in another experiment we adjusted
the massive activations to their
empirical mean values calculated over
100
sequences this intervention resulted in
minimal changes in both perplexity and
zero shot accuracy
suggesting that the values of massive
activations are constant and not
dependent on the input thus acting
similarly to bias
terms to summarize our findings indicate
that massive activations function as
crucial fixed biases within llms
significantly influencing their
computational
processes moving on to the effects on
attention we investigated how massive
activations influence the self-
attention mechanism in
llms we observed a clear shift in
attention patterns before and after the
emergence of massive
activations specifically in layers
following the appearance of massive
activations attention was predominantly
focused on the tokens associated with
these
activations this pattern was consistent
across different llms including llama 2
to 7B llama 2 to 13B and phi2 when
processing the same
input our analysis showed that attention
logits which are calculated before the
softmax operation and determine the
distribution of attention were mostly
negative except when associated with
tokens having massive activations where
they turn slightly
positive this means that during the
softmax calculation these tokens with
massive activations attract the majority
of the attention
probability interestingly while previous
research highlighted that llms tend to
heavily focus on the starting token our
findings suggest that llms also allocate
significant attention to other tokens
linked with massive
activations this reveals a deeper reason
behind the emergence of these attention
concentration patterns underscoring the
importance of massive activations in
shaping the behavior of
llms section summary in this section we
delve deeper into large language models
llms to investigate the role of massive
activations within them by modifying
these massive activations and observing
the impact on model performance we find
that they act as crucial fixed biases in
llms significantly affecting internal
computations additionally we discover
that attention in llms is concentrated
on tokens associated with massive
activations influencing the distribution
of attention logit values and shedding
light on the internal mechanism of
llms section massive activations impose
implicit attention
biases in this section we explore how
large large language models llms use
significant activations to implicitly
influence the focus of self attention
mechanisms specifically we examine the
roles of the attention layer
normalization and the projections for
queries keys and values
qkv in llms input features are first
normalized and then transformed into
these qk V
States this approach which was
introduced with gpt2 is now a common
practice in many modern
llms our observations particularly from
analyzing the Llama 2 to 7B model reveal
that features of certain tokens which
undergo massive activations Stand Out
significantly from the rest right after
normalization these tokens then show
less variation in their subsequent qkv
States we suggest that the attention
layer normalization might be crucial in
this
phenomenon further we dissect the output
of the attention mechanism by focusing
on tokens that receive a lot of
attention due to massive
activations we break down the attention
output for each token into two
components one that comes from the
highly focused tokens and another from
the
rest this analysis shows that the
updates from the focus tokens act almost
like a constant bias even though they
are not explicitly defined as
such this pattern remains consistent
across different inputs indicating a
systematic way llms allocate attention
to certain tokens thereby introducing an
implicit bias in the attention
computation moving on WE experiment with
introducing explicit attention biases to
see if we can eliminate the need for
massive
activations by adding extra learnable
parameters to the attention mechanism we
find that models equipped with these
explicit biases do not show the massive
activations characteristic of standard
models this suggests that explicit
attention biases can replace the need
for llms to develop massive activations
during
pre-training we also investigate whether
Vision Transformers vits exhibit similar
massive
activations unlike llms vits mix tokens
globally and are not autor
regressive our findings show that
massive activations are present in some
vit models but not all for instance clip
and D2 models show few activations with
significantly larger magnitudes while
May models do not these activations in
vits act as fixed biases crucial for the
model's
performance interestingly when we modify
these activations the performance varies
highlighting their
importance recently the concept of
register tokens was introduced to
augment vits leading to smoother
attention maps and improved
performance our analysis of these models
model reveals that massive activations
are stored within specific register
tokens suggesting that these registers
serve a similar purpose to the explicit
biases we experimented with in
llms by replacing register features with
their mean values we demonstrate that
these tokens act as learned biases
confirming their role and enhancing
model
performance to summarize our findings
highlight the significant role of
massive activations in both llms and
vits in llms these activations help
Focus attention on specific tokens
acting as implicit bias
terms we show that introducing explicit
biases can eliminate the need for these
massive
activations in the context of vits
massive activations and register tokens
both serve as mechanisms to introduce
biases improving model
performance section summary in this
section we explore how massive
activations impact self- attention
mechanism Ms in language models llms and
vision Transformers
vits we find that llms use massive
activations to concentrate attention on
specific tokens introducing implicit
bias terms in attention
computations additionally by augmenting
llms with explicit attention biases we
can eliminate massive activations
showcasing an alternative approach to
address attention biases in
pre-training in vits massive activations
are observed in certain models like clip
and
D2 serving as fixed biases that
influence attention patterns similar to
the function of register tokens proposed
in
vits section related work in this
section we delve into the fascinating
characteristics of autor regressive
Transformers particularly focusing on
the observations made with gpt2 and
other large language models
llms we've noticed that in the layer
just before the last one in gpt2 there
are certain dimensions of features where
the activation levels can reach as high
as
3,000 this discovery highlights that a
small number of these Dimensions play a
significant role in several standard
methods used to evaluate how similar
representations are we also observed
that the magnitude of the feature
representing the initial token in gpt2
increases at a much faster rate compared
to other
tokens furthermore our investigations
have revealed the presence of
exceptionally large weights in the layer
Norm component of both gpt2 and llama 2
to
13B when these weights are set to zero
the performance of the models
drastically
decreases interestingly the feature
Dimension associated with this weight in
llama 2 to 13B which is 2,100 matches
that of a very large
activation when it comes to outlier
features various Studies have explored
their presence in llms noting that these
features often have high activation
values across most of their sequence
Dimensions although at first glance
massive activations might seem similar
to outlier features we've discussed
their fundamental
differences more crucially we
demonstrate that massive activations
cannot simply be explained by the
presence of outlier
features our research also covers
attention concentration patterns where
we found that attention in models like
bird often zeros in on specific tokens
such as the separate token sep
other Studies have shown that llms tend
to give most of their attention to the
first word token and have identified
attention artifacts in Vision
Transformers vits including sparse
activation patterns that draw attention
to certain
tokens our work goes further by
providing a detailed analysis of why
these patterns occur especially in
relation to massive
activations regarding biases in the
self- attention mechanism it's clear
that there are various
types for instance simple additive bias
terms can be introduced in linear layers
to compute the query key and value
States additionally position biases can
be incorporated into self attention to
account for the positional information
of each
token there are also different versions
of biases that involve manually designed
softmax
operators our findings reveal that llms
even those using standard self attention
inherently introduce implicit bias
components into the attention
calculation through massive
activations
5.0 / 5 (0 votes)