Massive Activations in Large Language Models

Arxiv Papers
28 Feb 202421:32

Summary

TLDRこの研究では、大規模言語モデル(LLM)の内部表現における驚くべき現象を発見しました。一部の活性化値が他の値に比べて桁違いに大きくなり、これらを「巨大活性化」と呼んでいます。この巨大活性化は固定的なバイアス項として機能し、モデルの性能や注意メカニズムに大きな影響を与えていることがわかりました。さらに、この巨大活性化を明示的なバイアス項で置き換えることで、モデルが学習する必要がなくなることを示しています。また、ビジョントランスフォーマー(ViT)にも同様の現象が見られ、レジスタートークンとしての役割を果たしていることを発見しました。この研究によって、LLMやViTの内部メカニズムに対するより深い理解が得られたと考えられます。

Takeaways

  • 🤖 大規模言語モデル(LLM)の内部メカニズムを理解することは重要である。
  • 🔍 LLMの一部の活性化が非常に大きいことが発見され、これらを「大規模活性化」と名付けた。
  • ✨ 大規模活性化は、数が少ないにもかかわらず、LLMのパフォーマンスに重大な影響を与える。
  • 📚 大規模活性化は、様々なLLMのさまざまなレイヤーで発生し、入力に依存しない固定バイアス成分として機能する。
  • 🌉 LLMは大規模活性化を使って、特定のトークンに注意を集中させるため、内部的な注意バイアスを導入している。
  • 🔧 LLMに明示的な注意バイアスを導入することで、大規模活性化の必要性を排除できる可能性がある。
  • 👁️ Vision Transformerにも大規模活性化が存在し、register tokensと類似の役割を果たしている。
  • 🔬 大規模活性化の発見は、LLMの内部メカニズムに関する新たな理解を提供する。
  • 💡 大規模活性化の研究は、将来のモデルの最適化とデザインにつながる可能性がある。
  • 🚀 内部メカニズムの理解を深めることで、LLMの性能と応用範囲がさらに広がる可能性がある。

Q & A

  • 大規模言語モデル(LLMs)の内部機構を探るために、どのような発見がありましたか?

    -LLMの内部表現において、中央値よりも4桁大きい「巨大なアクティベーション」が存在することを発見しました。これらはモデルのパフォーマンスに重要な役割を果たしています。

  • 巨大なアクティベーションはLLMのどの層で発生しますか?

    -巨大なアクティベーションは初期の層で突然出現し、最終層で減少する傾向があります。多くのLLMで中間層で一定に維持されます。例えば、Llama 2 to 7Bでは2層目から出現し、30層目まで持続します。

  • 巨大なアクティベーションの機能は何ですか?

    -巨大なアクティベーションは、LLMの内部計算において、固定されたバイアス項として機能しています。これらを操作するとモデルのパフォーマンスが大きく低下することから、計算プロセスに不可欠な役割を果たしていることがわかります。

  • LLMの自己注目メカニズムにおける巨大なアクティベーションの影響は?

    -巨大なアクティベーションが出現した後の層では、それらに関連するトークンに注目が集中する傾向があります。つまり、巨大なアクティベーションは注目の分布にインプリシットなバイアスをもたらしています。

  • LLMに明示的な注目バイアスを導入するとどうなりますか?

    -LLMに明示的な注目バイアスを追加すると、モデルは巨大なアクティベーションを発生させなくなります。これにより、明示的なバイアスが巨大なアクティベーションの必要性を取り除ける可能性があることがわかります。

  • Vision Transformers (ViTs)における巨大なアクティベーションはどのような役割を果たしますか?

    -ViTsにも巨大なアクティベーションが存在し、それらは固定的なバイアス項として機能しています。またRegisterトークンと巨大なアクティベーションには類似性があり、両方ともバイアスの役割を果たしてモデルのパフォーマンスを向上させます。

  • 巨大なアクティベーションと外れ値特徴の違いは何ですか?

    -巨大なアクティベーションは特定のトークンに関連する大きなスカラー値ですが、外れ値特徴はすべてのトークンに影響を与えるベクトルです。また、外れ値特徴は巨大なアクティベーションよりもはるかに多く発生します。

  • GPT-2やその他のLLMにおいて、どのような重要な観測が行われてきましたか?

    -GPT-2の最終層の手前の層で、特定の次元の特徴値が3,000程度に達することが確認されています。また、初期トークンの特徴値が他のトークンよりもはるかに速く増加することが観測されています。

  • 注目集中パターンとは何ですか? LLMではどのような傾向がありますか?

    -注目集中パターンとは、自己注目メカニズムが特定のトークンに注目を集中させる現象です。LLMでは、初期のトークンや区切り文字に注目が集中する傾向があります。

  • 自己注目メカニズムにおけるインプリシットなバイアスはどのように導入されていますか?

    -標準的な自己注目を使用するLLMでさえ、巨大なアクティベーションを介して注目計算にインプリシットなバイアス成分を導入しています。これにより、モデルの内部メカニズムにおいて注目がある特定のトークンに集中することになります。

Outlines

00:00

🚀 大規模言語モデルの内部動態

この段落では、大規模言語モデル(LLM)の内部メカニズムを探求し、驚くべき発見として「巨大活性化」を紹介しています。巨大活性化は他の活性化と比べて桁違いに大きな値を持ち、モデル内の重要な役割を果たすことが分かりました。これらは特定のトークンと関連し、LLMの様々な層に現れ、モデルの性能に大きな影響を及ぼすバイアス項として機能することが明らかになりました。

05:02

👾 巨大活性化が集中するLLMの層

この段落では、巨大活性化が最も顕著に現れる層を調査しました。LLMのいくつかの層で、これらの巨大活性化が突如現れ、中間層で比較的一定に維持され、最後の層で減少することがわかりました。また、特定の特徴次元とトークンシーケンスに関連して巨大活性化が発生することも明らかになりました。これらは異常値の特徴とは異なる現象であり、スカラー値のように機能することが分かりました。

10:04

💡 巨大活性化がLLMにおけるバイアス項として機能する

この段落では、巨大活性化がLLMの内部計算にとって不可欠なバイアス項として機能することを示しています。巨大活性化をゼロに設定すると、モデルの性能が大幅に低下しました。一方、平均値に調整するとほとんど影響がありませんでした。さらに、巨大活性化が現れた層以降、それらに関連するトークンへの注意が集中することが分かりました。これにより、LLMが固定バイアスを学び、自己注意を介して注意を集中させる仕組みが明らかになりました。

15:06

🧩 巨大活性化による自己注意のインプリシットバイアス

この段落では、LLMがどのように巨大活性化を利用して自己注意メカニズムに暗黙的にバイアスを付与するかを調査しています。巨大活性化に関連するトークンへの注意が集中し、注意の計算にインプリシットなバイアス項が導入されていることがわかりました。さらに、明示的な注意のバイアスを導入することで、巨大活性化を回避できることも示しています。Vision Transformersにおいても、巨大活性化が見られ、register tokensと同様の役割を果たすことが明らかになりました。

20:07

🔍 関連する先行研究

この段落では、自己回帰Transformerにおける先行研究と発見を概観しています。GPT-2やその他のLLMにおいて、特定の次元で非常に大きな活性化が観察されていたことが分かりました。また、outlier featuresの存在や、注意集中パターン、自己注意メカニズム内のバイアスタームなど、様々な関連トピックについても言及しています。これらの発見は、本研究におけるマッシブ活性化の発見と関連付けられ、LLMの内部メカニズムの理解を深める上で重要な知見となっています。

Mindmap

Keywords

💡大規模言語モデル (Large Language Model, LLM)

人工知能によって学習される大規模なニューラルネットワークで、自然言語をほぼ完璧に生成および理解することができます。ビジネスや教育など、様々な分野で実用的に活用されつつあります。本ビデオでは、LLMの複雑な内部構造と、それらが生み出す驚異的なメカニズムを探究しています。

💡マッシブアクティベーション (Massive Activations)

LLMの内部表現における、極めて大きな活性値またはアクティベーションのことです。通常の活性値より桁違いに大きく、稀に発生するこれらのアクティベーションは、モデルの計算プロセスの中で不可欠な固定バイアスとして機能しています。マッシブアクティベーションの存在が本ビデオの中心的な発見であり、LLMの内部メカニズムを理解する鍵となっています。

💡自己注目メカニズム (Self-Attention Mechanism)

Transformerアーキテクチャにおいてトークン間の関係を捉える重要なテクニックです。各トークンが他のトークンにどの程度注目するかを計算し、それに基づいて値を更新する仕組みです。本ビデオでは、マッシブアクティベーションがこの自己注目メカニズムに大きな影響を与えることが明らかになっています。特定のトークンの活性値が非常に大きくなると、自己注目計算はそこに集中するようになります。

💡正規化 (Normalization)

ニューラルネットワークのトレーニングや推論プロセスにおいて、活性値のスケールを適切に調整する手法のことです。マッシブアクティベーションの発生もレイヤーごとの正規化に関連していると、本ビデオは示唆しています。正規化は活性値のスケールを調整することで、モデルの学習を安定化させる役割を果たすのです。

💡事前学習 (Pre-training)

大量のデータとリソースを用いてLLMを事前に学習させる手法のことです。事前学習の段階で、LLMはマッシブアクティベーションを生成し、それらを固定バイアスとして利用することを、本ビデオでは明らかにしています。実際、明示的なバイアス値を導入することでマッシブアクティベーションの発生を抑えられるようになりました。このように、事前学習の過程で LLM は内部的な最適化方式を見出すのです。

💡視覚Transformer (Vision Transformer, ViT)

画像認識タスクに特化した Transformer モデルのことです。本ビデオでは、一部の ViT においてもマッシブアクティベーションが観察されることが示されています。これらのモデルでは、マッシブアクティベーションが各パッチトークンに対する固定バイアスとして機能していることが明らかになりました。また、最近提案された「レジスタートークン」の考え方とも関連があることが示唆されています。

💡注目集中 (Attention Concentration)

自己注目メカニズムにおいて、特定のトークンに注目が集中する傾向のことです。過去の研究では、LLMがしばしば最初のトークンに多くの注目を集中させることが確認されてきました。本ビデオの中心的な発見は、マッシブアクティベーションが特定のトークンに注目を集中させるメカニズムであることを明らかにしたことです。

💡パーミュテーション (Permutation)

順列、配置換え、シャッフルなどを意味する言葉で、トークンの順番を入れ替えることを指します。本ビデオでは、最初のトークンなどの特定の位置でマッシブアクティベーションが発生することが示されています。つまり、これらのアクティベーションはトークンの配置換えには依存しないことが分かるのです。

💡外れ値特徴 (Outlier Features)

LLMの活性値分布において、平均値から大きくはずれる特徴ベクトルのことを指します。以前の研究でもこの外れ値特徴の存在は確認されていましたが、本ビデオの中で、マッシブアクティベーションと外れ値特徴とは明確に区別されるものであることが示されています。マッシブアクティベーションは特定のトークンに関連した大きなスカラー値であるのに対し、外れ値特徴はすべてのトークンに影響を与える大きなベクトルだからです。

💡固定バイアス (Fixed Bias)

LLMの内部計算において、入力に依存せず一定の値をとる変数のことです。本ビデオでは、マッシブアクティベーションがこのような固定バイアス成分として機能していることが明らかにされています。モデルの性能に大きな影響を与えつつ、その値自体は入力によらず一定であるため、単純なスカラーバイアスとして働いていることが示されているのです。

Highlights

We found instances of certain activations that were astonishingly large, more than four orders of magnitude larger than the median, and sometimes exceeding absolute values of 15,000, even in models like llama 2 to 70 B that incorporate normalization layers.

These massive activations are not just large, they're also incredibly rare, often appearing fewer than 10 times among tens of millions of activations.

We've aptly named them 'massive activations' and our research shows that these are not isolated incidents but occur across a wide variety of llms, regardless of their size or family.

Massive activations are distinct from outlier features previously identified in llms. We discovered that they act as fixed but essential bias terms within the models, similar to the bias term in a linear layer.

For example, in llama 2 to 7B, nullifying just four of these massive activations led to a dramatic drop in model performance, underscoring their critical role.

However, adjusting them to their mean values did not adversely affect the model, suggesting they function as simple constant biases.

We found a strong connection between massive activations and self-attention mechanisms, with massive activations drawing attention to their associated tokens.

This extends the concept of attention sinks and provides a deeper understanding of how attention concentration patterns develop in llms.

We hypothesize that llms attempt to learn implicit bias components and self-attention through massive activations during their pre-training phase.

We also found massive activations in Vision Transformers (ViTs), albeit less frequently, and they act as fixed biases, found at fixed feature dimensions but vary across patch tokens.

This similarity led us to draw connections between massive activations and the concept of register tokens in ViTs, suggesting an alternative interpretation of their function as fixed biases rather than aggregators of global image information.

We identified layers in llms where massive activations occur, showing that these activations remain constant across most intermediate layers and emerge rapidly in specific layers like two and four.

We determined the feature and sequence dimensions of these massive activations in models like llama 2 to 7B, 13B, and mixl 8x 7B, finding consistent patterns such as activations at starting tokens and specific word tokens.

Massive activations are distinct from outlier features, as massive activations are scalar values at few tokens, while outlier features are vectors across all tokens and layers.

By modifying the massive activations and observing the impact on model performance, we found that they act as crucial fixed biases in llms, significantly influencing their computational processes.

Attention in llms is concentrated on tokens associated with massive activations, influencing the distribution of attention logit values and shedding light on the internal mechanism of llms.

Llms use massive activations to concentrate attention on specific tokens, introducing implicit bias terms in attention computations.

By augmenting llms with explicit attention biases, we can eliminate massive activations, showcasing an alternative approach to address attention biases in pre-training.

In ViTs, massive activations are observed in certain models like CLIP and D2, serving as fixed biases that influence attention patterns similar to the function of register tokens proposed in ViTs.

Transcripts

play00:00

section

play00:02

introduction in this section we delve

play00:05

into the fascinating world of large

play00:06

language models llms and their internal

play00:10

Dynamics our journey begins with an

play00:12

acknowledgement of the impressive Feats

play00:14

achieved by these models which have been

play00:16

primarily assessed through their

play00:17

external behaviors such as task

play00:19

performance and response

play00:21

accuracy however we believe it's equally

play00:24

important to understand what goes on

play00:26

under the hood especially as these

play00:27

models find their way into various real

play00:30

World

play00:31

applications despite the significance

play00:33

the exploration of ilm's internal

play00:35

mechanisms has been somewhat

play00:37

limited our investigation led us to a

play00:40

surprising discovery within the internal

play00:42

representations of

play00:44

llms when we looked into the hidden

play00:46

states of these models we found

play00:48

instances of certain activations that

play00:50

were astonishingly large more than four

play00:52

orders of magnitude larger than the

play00:54

median and sometimes exceeding absolute

play00:56

values of 15,000 even in models like

play00:58

llama 2 to 70 B that incorporate

play01:00

normalization

play01:02

layers these massive activations are not

play01:05

just large they're also incredibly rare

play01:07

often appearing fewer than 10 times

play01:09

among tens of millions of

play01:11

activations given their significant size

play01:14

difference compared to other activations

play01:16

we've aptly named them massive

play01:19

activations our research shows that

play01:21

these are not isolated incidents but

play01:23

occur across a wide variety of llms

play01:26

regardless of their size or family we

play01:29

took a closer look at where these

play01:30

massive activations are situated within

play01:33

the llms and found that their emergence

play01:35

is quite sudden appearing abruptly after

play01:37

a single layer of computation and then

play01:39

diminishing in the final

play01:41

layers interestingly these activations

play01:44

are not tied to specific inputs but

play01:46

occur in a small number of feature

play01:47

Dimensions often associated with the

play01:49

starting word token and delimiter

play01:52

tokens it's crucial to note that massive

play01:54

activations are distinct from outlier

play01:56

features previously identified in llms

play02:01

we discovered that they act as fixed but

play02:03

essential bias terms within the models

play02:05

similar to the bias term in a linear

play02:07

layer

play02:08

equation for example in llama 2 to 7B

play02:11

nullifying just four of these massive

play02:13

activations led to a dramatic drop in

play02:16

model performance underscoring their

play02:18

critical

play02:19

role however adjusting them to their

play02:21

mean values did not adversely affect the

play02:23

model suggesting they function as simple

play02:26

constant

play02:27

biases our analysis further reveals that

play02:30

after the initial layers llms repurpose

play02:32

tokens associated with massive

play02:34

activations to store these crucial

play02:38

biases intriguingly we found a strong

play02:41

connection between massive activations

play02:43

and self- attention mechanisms with

play02:45

massive activations drawing attention to

play02:47

their Associated

play02:48

tokens this extends the concept of

play02:51

attention syncs and provides a deeper

play02:53

understanding of how attention

play02:55

concentration patterns develop in

play02:57

llms we hypothesize that that llms

play03:00

attempt to learn implicit bias

play03:02

components and self- attention through

play03:03

massive activations during their

play03:05

pre-training phase to test this we

play03:08

experimented with augmenting self-

play03:10

attention with additional key and value

play03:12

embeddings explicitly designed as

play03:15

biases remarkably this approach

play03:17

eliminated the need for llms to learn

play03:20

massive

play03:22

activations our observations of massive

play03:24

activations are not limited to llms we

play03:27

also found them in Vision Transformers

play03:30

vits albeit less

play03:32

frequently like in llms these

play03:34

activations in vits act as fixed biases

play03:37

and are found at fixed feature

play03:38

Dimensions but vary across patch

play03:41

tokens this similarity led us to draw

play03:43

connections between massive activations

play03:46

and the concept of register tokens in

play03:48

vits suggesting an alternative

play03:50

interpretation of their function as

play03:51

fixed biases rather than aggregators of

play03:54

global image

play03:55

information in summary our exploration

play03:58

into the internal workings of llms has

play04:00

unveiled the critical role of massive

play04:02

activations and their implications for

play04:04

Model Behavior and

play04:06

performance this Insight not only

play04:08

enriches our understanding of llms but

play04:11

also opens new avenues for optimizing

play04:13

and designing future

play04:15

models section summary in this section

play04:19

we investigate a surprising discovery in

play04:21

large language models llms where certain

play04:24

activations within the models exhibit

play04:26

significantly larger magnitudes termed

play04:28

as massive activations

play04:31

these massive activations despite being

play04:33

rare and few in number play a crucial

play04:35

role as fixed bias terms in llms

play04:38

influencing model performance and

play04:39

attention

play04:41

mechanisms we find that these massive

play04:43

activations are not unique to specific

play04:45

llm models but are observed across

play04:47

various llms showcasing their importance

play04:50

in understanding the internal mechanisms

play04:52

of these

play04:54

models section which

play04:57

layers in this section we explore which

play04:59

layers of certain language models show

play05:01

the most significant activations meaning

play05:04

the points where the model's output is

play05:05

most

play05:06

intense we looked at the output from

play05:09

each layer of three different models

play05:11

llama 2 to 7B 13B and phi2 and average

play05:16

the results over 100

play05:18

sequences our findings show that the

play05:20

most intense activations tend to occur

play05:22

at the same spot across many of the

play05:24

middle layers of these

play05:26

models specifically these intense

play05:28

activation start appearing in the early

play05:30

layers remain fairly constant through

play05:32

the middle layers and then start to

play05:34

decrease in the last few

play05:37

layers for example in the Llama 2 to 7B

play05:40

model these activations first show up in

play05:43

layer 2 and stay consistent up to layer

play05:45

30 interestingly in both the Llama 2 to

play05:49

7B and 13B models these activations

play05:52

appear suddenly rather than building up

play05:54

gradually suggesting they're caused by a

play05:56

different mechanism than we might

play05:58

expect

play06:00

next we delve into where within the

play06:02

model's hidden States these intense

play06:04

activations occur focusing on their

play06:06

feature and sequence

play06:08

Dimensions by examining a middle layer

play06:11

we found that in the Llama 2 to 7B model

play06:14

two specific feature Dimensions show

play06:16

these

play06:17

activations as for the sequence

play06:19

Dimensions these activations are linked

play06:21

to the first word token and the token

play06:23

representing either a period or a new

play06:24

line in the

play06:26

sequence this pattern holds true even in

play06:28

longer SE sequences with up to four

play06:31

intense activations occurring when these

play06:33

specific tokens are present in the Llama

play06:36

2 to 13B model the intense activations

play06:39

are consistently found in two feature

play06:41

dimensions and are always associated

play06:42

with the starting token of the

play06:45

sequence another model mixl 8X 7B shows

play06:49

a similar pattern but includes

play06:51

additional tokens like and and of as

play06:54

points of intense

play06:56

activation summarizing our findings AC

play06:59

across various models we noticed that

play07:01

intense activations are typically found

play07:03

in very few feature

play07:06

Dimensions when it comes to sequence

play07:08

Dimensions we categorize the models

play07:10

based on where these activations occur

play07:12

some models show them only at the

play07:14

starting token others at the starting

play07:16

token and the first strong delimiter and

play07:18

a third group shows them at the starting

play07:20

token various delimiters and certain low

play07:23

semantic

play07:25

words lastly we differentiate these

play07:28

intense activations from from outlier

play07:30

features another phenomenon observed in

play07:32

language

play07:34

models while both involve High

play07:36

activation magnitudes intense

play07:38

activations are scalar values linked to

play07:40

specific tokens whereas outlier features

play07:43

are vectors affecting all

play07:45

tokens moreover intense activations

play07:48

occur at far fewer tokens compared to

play07:50

outlier

play07:52

features in our analysis of llama 2 to

play07:55

7B and 13B models we identified several

play07:58

outlier features but found that none of

play08:00

them matched the feature dimensions of

play08:02

the intense activations highlighting a

play08:04

clear distinction between the two

play08:06

phenomena section summary in this

play08:09

section we identify layers in llms where

play08:12

massive activations occur showing that

play08:14

these activations remain constant across

play08:16

most intermediate layers and emerge

play08:18

rapidly in specific layers like two and

play08:21

four we also determine the feature and

play08:23

sequence dimensions of these massive

play08:25

activations in models like llama 2 to 7B

play08:28

13B B and mixl 8x 7B finding consistent

play08:32

patterns such as activations at starting

play08:34

tokens and specific word

play08:36

tokens additionally we differentiate

play08:39

massive activations from outlier

play08:41

features highlighting that massive

play08:43

activations are scalar values at few

play08:45

tokens distinct from outlier features

play08:47

which are vectors across all tokens and

play08:51

layers section massive activations act

play08:54

as biases in

play08:57

llms in this section we explore the

play08:59

significant role that massive

play09:01

activations play Within large language

play09:03

models

play09:04

llms initially we were curious about

play09:07

whether these massive activations were

play09:09

crucial for the model's internal

play09:10

computations or if they were merely

play09:13

Superfluous to get a clearer

play09:15

understanding we decided to actively

play09:17

manipulate these massive activations and

play09:19

observe how such changes impact the

play09:21

ilm's

play09:23

behavior we began by examining the

play09:25

variability of these massive activations

play09:27

across different input SE

play09:29

quences in addition to focusing on

play09:32

massive activations we also looked at

play09:34

three other types of activations based

play09:36

on their average sizes specifically the

play09:38

top 1% top 10% and the median within the

play09:42

hidden

play09:44

states by analyzing 100 sequences in two

play09:47

versions of the Llama model we noticed

play09:49

that the variability of massive

play09:51

activations was relatively small

play09:53

compared to their average values

play09:55

especially when compared to other types

play09:56

of

play09:58

activations to delve deeper we modified

play10:01

the ilm's inference process by adjusting

play10:03

the massive activations at a specific

play10:06

layer for any hidden state that showed

play10:08

massive activations we manually set

play10:10

these activations to fixed

play10:13

values after making these adjustments we

play10:15

continued with the normal computation

play10:17

process in the subsequent

play10:19

layers we conducted this experiment on

play10:22

both the Llama 2 to 7B and 13B models

play10:25

and assessed the impact on various

play10:26

benchmarks including perplexity on data

play10:29

sets like wikitext C4 and pg19 and mean

play10:34

zero shot accuracy on tasks such as bu Q

play10:37

piqa WIOG Grande Arc easy and Arc

play10:42

challenge one of our interventions

play10:44

involved setting the massive activations

play10:46

to zero at the point they first appeared

play10:48

within the hidden

play10:50

States surprisingly this led to a

play10:53

significant drop in the model's

play10:54

performance with a notable increase in

play10:57

perplexity for comparison we also set an

play11:00

equal number of activations those with

play11:02

average magnitudes close to the median

play11:04

to zero and observed no performance

play11:07

degradation this experiment highlighted

play11:09

the indispensable role of massive

play11:11

activations in the internal workings of

play11:14

llms in another experiment we adjusted

play11:17

the massive activations to their

play11:19

empirical mean values calculated over

play11:22

100

play11:23

sequences this intervention resulted in

play11:25

minimal changes in both perplexity and

play11:27

zero shot accuracy

play11:29

suggesting that the values of massive

play11:31

activations are constant and not

play11:33

dependent on the input thus acting

play11:35

similarly to bias

play11:36

terms to summarize our findings indicate

play11:39

that massive activations function as

play11:41

crucial fixed biases within llms

play11:44

significantly influencing their

play11:46

computational

play11:47

processes moving on to the effects on

play11:50

attention we investigated how massive

play11:52

activations influence the self-

play11:54

attention mechanism in

play11:57

llms we observed a clear shift in

play11:59

attention patterns before and after the

play12:01

emergence of massive

play12:03

activations specifically in layers

play12:06

following the appearance of massive

play12:07

activations attention was predominantly

play12:10

focused on the tokens associated with

play12:11

these

play12:13

activations this pattern was consistent

play12:15

across different llms including llama 2

play12:18

to 7B llama 2 to 13B and phi2 when

play12:22

processing the same

play12:25

input our analysis showed that attention

play12:27

logits which are calculated before the

play12:29

softmax operation and determine the

play12:31

distribution of attention were mostly

play12:33

negative except when associated with

play12:35

tokens having massive activations where

play12:38

they turn slightly

play12:39

positive this means that during the

play12:41

softmax calculation these tokens with

play12:44

massive activations attract the majority

play12:46

of the attention

play12:48

probability interestingly while previous

play12:50

research highlighted that llms tend to

play12:53

heavily focus on the starting token our

play12:55

findings suggest that llms also allocate

play12:58

significant attention to other tokens

play12:59

linked with massive

play13:01

activations this reveals a deeper reason

play13:04

behind the emergence of these attention

play13:06

concentration patterns underscoring the

play13:08

importance of massive activations in

play13:10

shaping the behavior of

play13:13

llms section summary in this section we

play13:17

delve deeper into large language models

play13:19

llms to investigate the role of massive

play13:22

activations within them by modifying

play13:25

these massive activations and observing

play13:27

the impact on model performance we find

play13:29

that they act as crucial fixed biases in

play13:31

llms significantly affecting internal

play13:35

computations additionally we discover

play13:37

that attention in llms is concentrated

play13:40

on tokens associated with massive

play13:42

activations influencing the distribution

play13:44

of attention logit values and shedding

play13:46

light on the internal mechanism of

play13:49

llms section massive activations impose

play13:53

implicit attention

play13:55

biases in this section we explore how

play13:58

large large language models llms use

play14:01

significant activations to implicitly

play14:03

influence the focus of self attention

play14:06

mechanisms specifically we examine the

play14:09

roles of the attention layer

play14:10

normalization and the projections for

play14:12

queries keys and values

play14:16

qkv in llms input features are first

play14:20

normalized and then transformed into

play14:22

these qk V

play14:24

States this approach which was

play14:26

introduced with gpt2 is now a common

play14:29

practice in many modern

play14:31

llms our observations particularly from

play14:34

analyzing the Llama 2 to 7B model reveal

play14:37

that features of certain tokens which

play14:39

undergo massive activations Stand Out

play14:41

significantly from the rest right after

play14:45

normalization these tokens then show

play14:47

less variation in their subsequent qkv

play14:50

States we suggest that the attention

play14:53

layer normalization might be crucial in

play14:55

this

play14:57

phenomenon further we dissect the output

play15:00

of the attention mechanism by focusing

play15:01

on tokens that receive a lot of

play15:03

attention due to massive

play15:05

activations we break down the attention

play15:07

output for each token into two

play15:09

components one that comes from the

play15:11

highly focused tokens and another from

play15:13

the

play15:14

rest this analysis shows that the

play15:16

updates from the focus tokens act almost

play15:19

like a constant bias even though they

play15:20

are not explicitly defined as

play15:23

such this pattern remains consistent

play15:25

across different inputs indicating a

play15:27

systematic way llms allocate attention

play15:30

to certain tokens thereby introducing an

play15:32

implicit bias in the attention

play15:35

computation moving on WE experiment with

play15:38

introducing explicit attention biases to

play15:40

see if we can eliminate the need for

play15:42

massive

play15:44

activations by adding extra learnable

play15:46

parameters to the attention mechanism we

play15:49

find that models equipped with these

play15:50

explicit biases do not show the massive

play15:53

activations characteristic of standard

play15:56

models this suggests that explicit

play15:59

attention biases can replace the need

play16:01

for llms to develop massive activations

play16:03

during

play16:05

pre-training we also investigate whether

play16:08

Vision Transformers vits exhibit similar

play16:10

massive

play16:12

activations unlike llms vits mix tokens

play16:15

globally and are not autor

play16:18

regressive our findings show that

play16:20

massive activations are present in some

play16:22

vit models but not all for instance clip

play16:26

and D2 models show few activations with

play16:29

significantly larger magnitudes while

play16:32

May models do not these activations in

play16:35

vits act as fixed biases crucial for the

play16:37

model's

play16:39

performance interestingly when we modify

play16:41

these activations the performance varies

play16:44

highlighting their

play16:47

importance recently the concept of

play16:49

register tokens was introduced to

play16:51

augment vits leading to smoother

play16:53

attention maps and improved

play16:56

performance our analysis of these models

play16:58

model reveals that massive activations

play17:00

are stored within specific register

play17:02

tokens suggesting that these registers

play17:04

serve a similar purpose to the explicit

play17:06

biases we experimented with in

play17:09

llms by replacing register features with

play17:11

their mean values we demonstrate that

play17:14

these tokens act as learned biases

play17:16

confirming their role and enhancing

play17:17

model

play17:19

performance to summarize our findings

play17:21

highlight the significant role of

play17:23

massive activations in both llms and

play17:26

vits in llms these activations help

play17:29

Focus attention on specific tokens

play17:32

acting as implicit bias

play17:34

terms we show that introducing explicit

play17:36

biases can eliminate the need for these

play17:38

massive

play17:41

activations in the context of vits

play17:43

massive activations and register tokens

play17:46

both serve as mechanisms to introduce

play17:48

biases improving model

play17:51

performance section summary in this

play17:54

section we explore how massive

play17:56

activations impact self- attention

play17:58

mechanism Ms in language models llms and

play18:01

vision Transformers

play18:03

vits we find that llms use massive

play18:06

activations to concentrate attention on

play18:08

specific tokens introducing implicit

play18:10

bias terms in attention

play18:13

computations additionally by augmenting

play18:15

llms with explicit attention biases we

play18:18

can eliminate massive activations

play18:20

showcasing an alternative approach to

play18:22

address attention biases in

play18:24

pre-training in vits massive activations

play18:27

are observed in certain models like clip

play18:29

and

play18:30

D2 serving as fixed biases that

play18:33

influence attention patterns similar to

play18:35

the function of register tokens proposed

play18:37

in

play18:38

vits section related work in this

play18:42

section we delve into the fascinating

play18:44

characteristics of autor regressive

play18:46

Transformers particularly focusing on

play18:48

the observations made with gpt2 and

play18:50

other large language models

play18:53

llms we've noticed that in the layer

play18:55

just before the last one in gpt2 there

play18:58

are certain dimensions of features where

play19:00

the activation levels can reach as high

play19:02

as

play19:03

3,000 this discovery highlights that a

play19:05

small number of these Dimensions play a

play19:07

significant role in several standard

play19:09

methods used to evaluate how similar

play19:11

representations are we also observed

play19:14

that the magnitude of the feature

play19:16

representing the initial token in gpt2

play19:19

increases at a much faster rate compared

play19:21

to other

play19:22

tokens furthermore our investigations

play19:25

have revealed the presence of

play19:26

exceptionally large weights in the layer

play19:28

Norm component of both gpt2 and llama 2

play19:31

to

play19:32

13B when these weights are set to zero

play19:35

the performance of the models

play19:37

drastically

play19:38

decreases interestingly the feature

play19:40

Dimension associated with this weight in

play19:42

llama 2 to 13B which is 2,100 matches

play19:46

that of a very large

play19:48

activation when it comes to outlier

play19:51

features various Studies have explored

play19:53

their presence in llms noting that these

play19:55

features often have high activation

play19:57

values across most of their sequence

play20:00

Dimensions although at first glance

play20:02

massive activations might seem similar

play20:04

to outlier features we've discussed

play20:06

their fundamental

play20:08

differences more crucially we

play20:10

demonstrate that massive activations

play20:12

cannot simply be explained by the

play20:14

presence of outlier

play20:17

features our research also covers

play20:19

attention concentration patterns where

play20:21

we found that attention in models like

play20:23

bird often zeros in on specific tokens

play20:26

such as the separate token sep

play20:29

other Studies have shown that llms tend

play20:31

to give most of their attention to the

play20:33

first word token and have identified

play20:34

attention artifacts in Vision

play20:36

Transformers vits including sparse

play20:39

activation patterns that draw attention

play20:40

to certain

play20:42

tokens our work goes further by

play20:44

providing a detailed analysis of why

play20:46

these patterns occur especially in

play20:48

relation to massive

play20:51

activations regarding biases in the

play20:53

self- attention mechanism it's clear

play20:55

that there are various

play20:57

types for instance simple additive bias

play21:00

terms can be introduced in linear layers

play21:02

to compute the query key and value

play21:05

States additionally position biases can

play21:08

be incorporated into self attention to

play21:10

account for the positional information

play21:12

of each

play21:12

token there are also different versions

play21:15

of biases that involve manually designed

play21:17

softmax

play21:19

operators our findings reveal that llms

play21:22

even those using standard self attention

play21:24

inherently introduce implicit bias

play21:26

components into the attention

play21:28

calculation through massive

play21:31

activations

Rate This

5.0 / 5 (0 votes)

¿Necesitas un resumen en inglés?