Paper deep dive: Evolutionary Optimization of Model Merging Recipes

DataScienceCastnet
21 Mar 202440:00

Summary

TLDRこの動画は、Sakana Labの研究者たちが進化アルゴリズムを用いて、モデル融合の自動化に取り組んだことを紹介しています。彼らは、異なるタスクに特化した複数のモデルを組み合わせ、新しい基盤モデルを作成し、特に日本語の数学問題に対する応答能力を向上させることを目指しています。進化アルゴリズムを用いた最適化プロセスは、パラメータ空間とデータフロー空間の両方を考慮し、効果的なモデル融合を実現しています。このアプローチの結果、日本語の数学問題に対する正解率が大幅に向上し、視覚モデルと組み合わせることで、画像に関する質問に対する応答能力も向上しています。

Takeaways

  • 🌟 模型マージは、異なるタスクに特化した複数のモデルを組み合わせ、より強力な基盤モデルを作成する方法です。
  • 🔍 日本のSakana Labでは、群れ知能や生物学にインスパイアされた手法を用いて、モデルマージの研究を行っています。
  • 🧬 進化アルゴリズムを応用して、パラメータ空間とデータフロー空間でのモデルマージを自動化しようとしています。
  • 🔗 モデルマージの目的は、異なるドメインの知識を統合し、新しい能力やスキルを持つモデルを作成することです。
  • 🤖 進化アルゴリズムを使用することで、ブラックアートのような手動調整を避け、より自動化された最適解を見つけることができます。
  • 📈 結果は、日本語の数学問題に対する正解率の向上や、画像を理解するモデルと日本語モデルの統合に成功しています。
  • 🌐 モデルマージの分野はまだ新しい分野で、多くの研究が行われることが期待されています。
  • 🚀 Sakana Labの研究は、基盤モデル開発の分野において革新的なアプローチを提供しています。
  • 📚 進化アルゴリズムは、大規模なパラメータ空間を探索し、最適なモデルの組み合わせを見つけるのに役立ちます。
  • 🎯 評価指標に基づいて最適なモデルを選択し、進化アルゴリズムを用いて最適なパラメータを探索することが重要です。
  • 🌈 多目標遺伝的アルゴリズムを用いることで、複数の異なる目標を同時に最適化することが可能です。

Q & A

  • Sakana Labはどのような研究を行っていますか?

    -Sakana Labは群れ知能、生物学にインスパイアされたこと、進化的アルゴリズム、人工生命など、様々な興味深いトピックに取り組むAI研究ラボです。

  • モデルマージとは何ですか?

    -モデルマージは、既存のモデルを組み合わせることで、より強力な基礎モデルを作成する方法です。これは、人間的な直感や分野知識に頼るという暗黒魔術のようなものであり、進化的アプローチを用いて自動化しようとしています。

  • 進化的アルゴリズムはどのように適用されていますか?

    -進化的アルゴリズムは、パラメータ空間とデータフロー空間でのモデルマージに適用されています。これにより、異なるドメインのモデルを組み合わせたり、文化に配慮された視覚言語モデルを作成したりすることができます。

  • Ties MergeとDAREはどのような手法ですか?

    -Ties Mergeは、重複するパラメータ値を解消し、符号の不一致を解決する方法です。DAREは、微調整されたパラメータの更新をランダムにドロップアウトし、残りのものを再スケールする技術です。これにより、衝突を減らし、より良い結果を得ることができます。

  • Frankenマージとは何ですか?

    -Frankenマージは、同じ形状の異なるモデルの層を組み合わせ、新しいレイヤーを生成する方法です。これは、Transformerモデルの各層がデータにほとんど変更を加えないという直感に基づいています。

  • 進化的アルゴリズムのCMA戦略とは何ですか?

    -CMA戦略は、多変量問題を解決するための進化的アルゴリズムの手法です。この戦略では、パラメータ空間を探索するために、分布からサンプルをとり、最も適切な候補を選択し、分布を更新することで、最適解に近づきます。

  • データフロー空間でのマージの課題は何ですか?

    -データフロー空間でのマージの課題は、検索空間が非常に広くなり、計算的に過密になることです。これに対処するために、論文では、順序を固定して繰り返しレイヤーをスタックし、どのレイヤーを含めるかを学習するという方法を提案しています。

  • モデルマージの結果評価はどのように行われましたか?

    -モデルマージの結果評価は、訓練データセットとは異なるテストデータセットで行われました。進化的検索を使用して最適なウェイトやスケーリングファクターを発見し、組み合わされたモデルが期待どおりにパフォーマンスを発揮していることを確認しました。

  • Sakana Labの研究ではどのような成果が得られましたか?

    -Sakana Labの研究では、日本語の数学問題や画像に関する質問に対する答えを正确に提供できる、文化に配慮された日本語視覚言語モデルが開発されました。これにより、異なるドメインの知識を統合し、より包括的なAIモデルを作成することができました。

  • この研究の意義は何ですか?

    -この研究の意義は、進化的アルゴリズムを使用してモデルマージを自動化し、異なるドメインの知識を統合することで、より高度なAIモデルを作成できることです。また、この技術は一般的な言語モデルや視覚言語モデルの開発にも適用 가능であり、AIの応用範囲を広げることになります。

Outlines

00:00

🌟 進化的最適化とモデルマージの紹介

この段落では、データサイエンスの分野で進化的最適化とモデルマージの概念が紹介されています。Sakana Labという日本のAI研究ラボが、蠕動知能や生物学にインスパイアされた手法を用いて、新しいAIモデルの開発を行っていることが説明されています。彼らの目標は、複数のモデルを組み合わせることで、より強力な基盤モデルを作成することです。このプロセスは、人間的な直感や特定の知識を必要とせず、自動化された方法で行われることが期待されています。

05:02

🧠 モデルマージの手法と進化的アルゴリズムの使用

この段落では、モデルマージの様々な手法と進化的アルゴリズムの使用方法について説明されています。モデルマージは、既存のモデルを組み合わせて新しいモデルを作成する方法であり、このプロセスは人間的な直感や知識に頼っている传统的な方法と比べて、自動化されたアプローチにより優れていることが示されています。進化的アルゴリズムは、重みやデータフロー空間での最適化を自動化するのに使用され、これにより新しいモデルの作成がより迅速で効率的に行われることが期待されています。

10:05

🔍 モデルマージの背景と理論

この段落では、モデルマージの背景知識と理論的根拠について深く掘り下げています。モデルマージがどのように機能し、なぜ異なるタスクや異なるモデルから得られる知識を組み合わせることでより良い結果が得られるのかが説明されています。また、データフロー空間とパラメータ空間でのマージの違いと、その背後にある考え方についても議論されています。このセクションは、モデルマージの理解を深めるために重要な基礎知識を提供しています。

15:06

🧬 進化的アルゴリズムの詳細とデータフローの最適化

この段落では、進化的アルゴリズムの詳細とデータフロー空間での最適化の方法について説明されています。進化的アルゴリズムは、多様なパラメータ空間を探索するために使用され、最適なモデルの組み合わせを見つけ出すために利用されています。データフロー空間での最適化は、モデルの層をどのように組み合わせるかを決定する際に使用されます。このセクションでは、これらの方法が実際にどのように機能するかについて、具体的な例を交えて説明されています。

20:06

📈 モデルマージの結果と評価

この段落では、モデルマージの結果と評価方法について述べられています。Sakana Labの研究者たちは、進化的アルゴリズムを使用してモデルをマージし、その結果を評価しています。評価は、異なるデータセット上で行われ、新しいモデルが予想どおりに高い正答率を達成していることが示されています。また、この方法が異なる分野でのタスクにも適用できることが示されており、これはモデルマージ技術の柔軟性と応用範囲を広げるものです。

25:09

🌐 応用と未来の展望

最後の段落では、モデルマージ技術の応用と未来の展望について議論されています。研究者たちは、この技術が単にリーダーボードに名前を残すだけではなく、実際のさまざまなタスクで優れた性能を発揮することを期待しています。進化的アルゴリズムとモデルマージの組み合わせは、多くの小さなモデルを組み合わせて、1つの強力な基盤モデルを作成するというビジョンを可能にし、これによりAIの能力をさらに拡大することを目指しています。

Mindmap

Keywords

💡進化的最適化

進化的最適化は、生物進化の原理を応用して、問題解決や最適解を見つけるアルゴリズムです。この動画では、進化的最適化がモデル融合に適用され、より強力な基礎モデルを作成する方法について説明されています。

💡モデル融合

モデル融合は、複数の機械学習モデルを結合して、一つのモデルを作成し、それぞれのモデルの長所を活かして新しい能力を得る技術です。この動画では、モデル融合が進化的最適化と組み合わされた方法を紹介しており、異なるドメインの知識を統合することが可能です。

💡Sakana Lab

Sakana Labは、日本に基づくAI研究ラボで、David Har氏がリードしています。彼らの研究は、群れ知能や生物学にインスパイアされたテーマに焦点を当てています。この動画で取り上げられているのは、彼らのモデル融合の研究プロジェクトです。

💡群れ知能

群れ知能は、個体が集団として働くことで、より効率的で最適な解決策を見つける能力を指す概念です。これは、鳥の群れや魚の SCHOOL などの自然現象からインスピレーションを受けています。動画では、群れ知能をAIの分野に適用し、モデル融合の最適化プロセスに役立てる方法が説明されています。

💡バイオロジカルリード

バイオロジカルリードとは、生物学に基づくアイデアや概念をAI技術に取り入れるアプローチのことです。この動画では、Sakana Labの研究者たちが生物学の進化や群れ行動などを参考にして、AIのモデル融合技術を開発していることが示されています。

💡人工生命

人工生命は、コンピュータシミュレーションや機器を用いて、生命現象や生物学的プロセスを模倣する研究分野です。この動画では、人工生命の概念がAIのモデル融合に取り入れられ、新たなAI技術の開発に役立っていることが述べられています。

💡クロスドメインマージ

クロスドメインマージは、異なる分野やテーマの専門知識を持つ複数のモデルを組み合わせることで、新しい能力や理解を獲得する技術です。この動画では、日本語と数学のモデルを組み合わせて、日本語での数学問題を解決する能力を持つモデルを作ることが例として挙げられています。

💡進化戦略

進化戦略は、最適化問題を解決するために、生物の進化の原理を模倣したアルゴリズムです。この動画では、進化戦略がモデル融合に適用され、最適なモデルの組み合わせを見つけるために使用されています。

💡CMA

CMAは「共分散行列適応進化戦略」の略で、進化的最適化の一種です。CMAは、多変量問題の最適解を見つけるために、母集団からサンプルを取り出し、最適な解に近づけるように分布を更新するプロセスを繰り返します。この動画では、CMAがモデル融合のプロセスで使用され、最適なモデルの組み合わせを探索するために使われています。

💡タスク ベクター

タスク ベクターは、特定のタスクに対してモデルを微調整した際の重みの違いを表すベクトルです。この動画では、タスク ベクターがモデル融合プロセスで使用され、各タスクの方向性を示すベクトルとして、モデルをどのように組み合わせるかを決定するために使われています。

💡DARE

DAREは「ランダムな重み除去のエволュージョン的最適化」の略で、進化的最適化の一種です。DAREでは、微調整後のモデルの更新された重みのうち、一定の割合をランダムに除去することで、より効率的なモデル融合を実現します。この動画では、DAREがモデル融合プロセスで使用され、より優れた結果を得るために使われています。

💡フランケンマージ

フランケンマージは、異なるモデルの層を組み合わせて新しいモデルを作成する技術です。この方法では、同じ形状の複数のモデルから層を抽出し、それらを順番に積み重ねることで、新しい深いモデルを作成します。この動画では、フランケンマージがモデル融合プロセスで使用され、データフロー空間での最適なデータ構造を探索するために使われています。

Highlights

The paper presents a novel application of evolutionary algorithms for model merging in AI.

Sakana Lab's approach aims to automate the creation of powerful foundation models through model merging.

The study focuses on cross-domain merging, combining models from different domains like a Japanese model with a math reasoning model.

The paper introduces a method to optimize beyond just the weights of individual models, facilitating cross-domain merging.

The authors discuss the concept of model merging as a form of black art or alchemy, highlighting the need for a more systematic approach.

The paper explores various techniques for model merging, including linear interpolation, task vector combination, and Franken merging.

The authors propose an automated approach to model merging, eliminating the need for human intuition and domain knowledge.

The study introduces a method for combining models in a way that reduces interference and improves performance.

The paper discusses the use of evolutionary algorithms, specifically the CMA-ES (Covariance Matrix Adaptation Evolution Strategy) for optimization.

The authors demonstrate that their method can create a culturally aware Japanese visual language model by merging models with different capabilities.

The study shows that the merged models can perform better than individual models on Japanese math questions.

The paper presents a technique that can combine a vision model with a Japanese model to answer visual questions in Japanese.

The authors propose a unified framework that can perform both parameter space and data flow space merging.

The study suggests the potential for a swarm of specialized models that can be combined to form a larger, more capable foundation model.

The paper emphasizes the importance of avoiding overfitting and test set contamination when evaluating merged models.

The authors highlight the generalizability of their model merging approach, showing its effectiveness across different tasks and domains.

Transcripts

play00:03

hello and welcome back to data science

play00:06

Cate in today's video I thought we'd

play00:08

take a look at evolutionary optimization

play00:11

of model merging recipes from the Sakana

play00:14

lab a relatively new AI research lab in

play00:17

Japan led by David har and friends um a

play00:21

lot of really interesting researchers

play00:22

and they seem determined to go in a

play00:24

different direction to many of the big

play00:26

foundation model Labs um and so yeah

play00:30

they're all about swarm intelligence and

play00:32

uh biologically inspired things

play00:34

evolutionary algorithms artificial life

play00:37

lots of exciting topics and so this is

play00:39

the first project I think it's only been

play00:41

a couple of months since they raised um

play00:42

their first round of seed funding so

play00:44

impressive to have a paper out um and

play00:47

this one is just diving into this idea

play00:49

of model merging so in today's video

play00:51

what we'll do is we'll go through this

play00:52

paper but we'll also use it as an excuse

play00:54

to look at what is model merging what

play00:56

are some of the existing techniques that

play00:57

people do how is this paper different

play00:59

we'll dive into some of the actual

play01:01

evolutionary algorithms used um yeah

play01:04

just use it to get a feel for the space

play01:05

as a whole now I myself am a bit of a

play01:07

skeptic when it comes to model merging

play01:09

and we'll talk about why that is too um

play01:11

but for now let's look at the paper and

play01:13

we'll use this um just as we're going

play01:15

through to launch into these other

play01:17

topics okay so starting at the abstract

play01:19

we present a novel application of

play01:21

evolutionary algorithms to automate the

play01:23

creation of powerful Foundation models

play01:26

so they say model merging has emerged as

play01:27

a promising approach for llm development

play01:29

right this is taking existing models and

play01:31

combining them somehow um but this at

play01:34

the moment relies on like human

play01:35

intuition and domain knowledge it's very

play01:37

Arcane um I think in the introduction

play01:39

they call it out uh it's considered by

play01:41

many to be a form of black art or

play01:43

Alchemy right so this is this somewhat

play01:45

Arcane New Field um and so they want to

play01:48

come take an evolutionary approach that

play01:50

gets over this requires human autom uh

play01:52

sorry human um intuition and instead

play01:55

have something that's more automatic and

play01:57

more um generic and useful without

play02:00

having to have this black

play02:02

Alchemy um so they're going to talk

play02:05

about this approach that they have they

play02:06

say they're going to do things both in

play02:07

parameter space and data flow space so

play02:10

we'll make sure to look at what those

play02:11

two um options are um they say we

play02:14

optimizing Beyond just the weights of

play02:15

the individual models and this approach

play02:17

facilitates cross domain merging and so

play02:20

that's the big theme of this paper we're

play02:21

not just taking two math models and

play02:24

smashing them together to get a slightly

play02:25

better math model they're going to

play02:27

combine a Japanese model with a math

play02:29

reasoning capabilities so two separate

play02:31

domains and then they're also going to

play02:33

extend this even further to create a

play02:35

culturally aware Japanese visual

play02:37

language model by combining one model

play02:39

that understands images and one model

play02:41

that's trained on a lot of Japanese data

play02:43

and combining them um to get a model

play02:45

that understands both of those domains

play02:46

so very much uh cross domain merging is

play02:49

the is the focus here can we use

play02:52

multiple models with different talents

play02:54

to combine together and get something

play02:55

that's uh greater than the sum of the

play02:57

parts or that at least combines those

play02:59

talents

play03:00

um cool so they say this gives some new

play03:02

sa of the up models but also it's a new

play03:03

paradigm and they're very excited about

play03:05

this idea of having many many different

play03:08

models that have different uh

play03:09

capabilities and skills and then being

play03:11

able to merge and combine them with this

play03:13

evolutionary

play03:15

approach okay so what is model merging

play03:18

um these citations here are for the

play03:20

recently released merge kit um this is

play03:23

been something that's fairly recently

play03:26

become popular in the llm community um

play03:29

combining

play03:30

say two variants of an existing model

play03:32

together um but it's something that has

play03:34

been around for a while in the uh

play03:37

diffusion and and image generation uh

play03:40

sorry image generation Fields right so

play03:43

with stable diffusion and they they do

play03:44

talk about this in the paper um you've

play03:46

seen a lot of people combining oh this

play03:49

maybe this one's uh some model trained

play03:51

on some specific style and then they

play03:53

have like a a fine tune or a low rank

play03:56

adapter like a Lowa for some other

play03:58

character or con cep and they smoos the

play04:01

two together to get something that

play04:02

understands both and so you've had these

play04:04

um these uis these interfaces that let

play04:07

people combine these models in different

play04:09

ways and you can have different

play04:10

weightings right maybe I want no.9 of

play04:13

this base model but No.1 of this model

play04:14

that's better at I don't know anime cat

play04:16

airs or whatever the really specific

play04:19

subject that the person wants um and so

play04:21

a lot of the most popular

play04:24

um stable diffusion based models for a

play04:26

while have been these big mergers where

play04:29

someone takes one that's trained really

play04:30

well on Photo realistic images and

play04:32

another that's trained really well on

play04:33

really good fantasy images maybe another

play04:35

that's trained really well on I don't

play04:36

know human anatomy and they they mash

play04:38

them all together um so this has been

play04:40

something that's been done for some time

play04:43

and then going back even further this

play04:45

model soup workor um this was a paper

play04:47

back in the image classification days

play04:50

that said as long as you're starting

play04:52

from the same initialization or maybe

play04:53

something that's been trained a little

play04:54

bit um then you do multiple different

play04:57

training runs and you just average the

play04:59

weights just linearly average all of the

play05:01

weights and you get a better model um so

play05:05

this was like you know drawing from the

play05:07

intuition around ensembling and things

play05:09

like that but there was a lot of like

play05:11

debate at the time I remember this what

play05:13

is going on how does this even work a

play05:15

lot of people saying oh you know you've

play05:17

got to think about each individual one

play05:18

mostly sort of fits the true

play05:20

distribution of the data but has these

play05:22

weird spikes and overfitting by

play05:24

averaging them all together we're like

play05:25

flattening the Lost landscape and this

play05:27

is some sort of theoretical Improvement

play05:30

um yeah so there's a lot of work around

play05:32

like can we use this to Ensemble is this

play05:34

efficient for training um but this is

play05:38

kind of like separate to the more recent

play05:40

oh you know I want to like have explicit

play05:43

outcomes that I'm looking for I take one

play05:44

model that's good at pencil drawing and

play05:46

one model that's good at the celebrity

play05:48

and I specifically combine them to get

play05:49

good pencil drawings with that celebrity

play05:51

that's a more recent

play05:53

Trend um okay so that's the that's the

play05:56

setting the scene right we try to Mush

play05:58

these models together somehow

play06:00

um and so we'll talk about some of the

play06:02

approaches but then we should also look

play06:03

at what this paper is contributing which

play06:05

is to say um we're going to do this in

play06:07

an automated way rather than just uh

play06:10

hoping for the best or having to

play06:11

understand like maybe what the models

play06:13

are each good at and hoping that the

play06:15

combination is good um cross domain

play06:17

merging so not just very similar things

play06:19

but transferring very disperate sets of

play06:22

skills um they're going to result in set

play06:25

of the out performance they say and

play06:26

we'll check that that's indeed the case

play06:29

um generaliz ability and efficiency so

play06:30

we're not spending too much time when we

play06:32

could actually just be you know maybe

play06:33

training a better model know this is

play06:35

going to be faster somehow um and yeah

play06:38

they're going to at the end of it have

play06:39

this cool culturally aware VM that's um

play06:42

better than any of the existing

play06:44

ones um okay so I guess we should talk

play06:47

about how the model merging happens and

play06:49

look into some of the backgrounds there

play06:52

um so they mention that um linear or

play06:55

spherical linear interpolation right

play06:57

literally just taking some weight at

play06:58

some of the weights

play07:00

um has been a popular approach um but

play07:03

then for language models there's a few

play07:04

more recent works so we can take a very

play07:08

brief look at each of these

play07:10

um this task AR arithmetic from last

play07:13

year was an early one um idea here is

play07:17

pretty simple and the intuition is we

play07:19

have a base model then we've done a

play07:21

little bit of fine tuning for some

play07:22

specific task if we look at the

play07:25

difference between the fine-tuned model

play07:27

and the base model we get this like this

play07:30

Delta and this is going to be a

play07:32

direction in weight space that shows us

play07:35

getting better at this particular task

play07:38

right so they say oh we've got this task

play07:39

Vector this is a direction this is a

play07:42

difference so if I take my base model

play07:43

and I add this task Vector to the

play07:45

weights I hopefully get something that's

play07:47

better at this task and they say oh we

play07:49

can actually um have multiple of these

play07:51

task vectors and we can combine them

play07:53

right so I could say I want .7 times my

play07:56

math task Vector so I get a little bit

play07:58

better at maths and n comma 9 times my

play08:01

science question answering vector and I

play08:04

combine those two together and I get

play08:05

something that's hopefully good at

play08:06

science and math um and they explore

play08:09

doing different um combinations there um

play08:12

okay so that was an early one again this

play08:14

is very close to just like the linear

play08:16

interpolation SL linear combinations of

play08:19

Weights um so that was an earlier work

play08:21

then there's others that have um

play08:24

attempts at improving that and so one is

play08:27

this um ties mer in and so here they

play08:31

make the observation that okay it's all

play08:33

well and good to think of these Deltas

play08:36

um these task vectors or whatever you

play08:38

want to call them but when you have

play08:40

multiple different ones that interfere

play08:43

somehow that's where you get a

play08:44

performance drop and so they say yeah

play08:48

existing methods often ignore this

play08:50

interference we're going to try and get

play08:52

around this um either by eliminating

play08:55

redundant parameter values or

play08:57

disagreements on the sign so their

play08:59

method said trim elect sign and merge

play09:01

this is going to do a few different

play09:02

things one if you have um some

play09:05

parameters that have only Changed by a

play09:06

very small amount um they're just going

play09:08

to not make those changes so the

play09:10

assumption is I have my base model I

play09:12

fine tune it a bit some parameters are

play09:14

going to change quite dramatically and

play09:15

these are going to be the ones that are

play09:16

relevant to whatever task I'm training

play09:18

on but then a lot of them might just

play09:20

move around a little bit just from

play09:21

random noise so those we probably don't

play09:23

care and we should probably just reset

play09:24

them to the base models value um then

play09:28

resolving sign conflicts okay now I'm

play09:29

trying to combine three different models

play09:31

and two of them drastically increase

play09:33

this parameter and one drastically

play09:35

decreases that parameter you know how do

play09:37

I deal with that um so that those what

play09:39

they call sign conflicts where the

play09:41

direction is different and then merging

play09:43

only parameters that are in alignment

play09:45

with that final agreed upon sign um yeah

play09:48

so just trying to reduce these clashes

play09:50

where you have different updates pulling

play09:51

in different directions how do we handle

play09:53

that um this ties merging is one

play09:55

approach to handling that and they find

play09:57

that this does better than some of the

play09:59

pre previous

play10:00

methods um then another work dare is um

play10:04

and the reason I'm I'm spending time on

play10:06

these is these two together are what

play10:08

this paper we're looking at uses um so

play10:11

dare I think is called something like

play10:13

Transformer models are Super Mario

play10:15

language models are Super Mario um

play10:17

absorbing abilities from homologous

play10:19

models as a free lunch it's a bit of a

play10:20

weird title it's a bit of a weird paper

play10:22

to be honest um but what they observe

play10:25

the observation is kind of interesting

play10:26

the whole paper is a lot of words around

play10:29

this one key observation and this

play10:30

technique and basically the technique is

play10:32

to randomly drop these Delta parameters

play10:34

with a ratio and then to rescale the

play10:36

remaining ones so remember I said we

play10:38

have this direction from the base

play10:40

weights that are update we we've

play10:42

fine-tuned this model and we have maybe

play10:43

multiple of these models we look at the

play10:45

difference and that Delta is like oh

play10:48

cool this tells us how to edit these

play10:49

weights um this is saying oh if you only

play10:52

apply some fraction of those updates and

play10:55

you zero out the rest with some like

play10:57

random Dropout you probably still get a

play10:59

lot of the benefit of that fine tuning

play11:01

and in fact they show that you can drop

play11:03

um quite a high percentage of the

play11:06

parameter updates before you start

play11:08

losing performance so this is somewhat

play11:11

counterintuitive um and later if you go

play11:14

look at their tables and things you'll

play11:15

see that the improvement from their

play11:17

approach versus more you know simple

play11:20

approaches is not huge right the numbers

play11:22

are always somewhat close together um

play11:25

but it is interesting to think like okay

play11:27

what could this be telling us and to me

play11:30

part of what this hints is

play11:32

that there are these clashes so if I

play11:35

have two fine tunes that maybe both add

play11:37

some skill and I'm trying to combine

play11:38

them together there might be

play11:39

interference there might be like um

play11:42

reduced performance because of that

play11:44

interference and so if I'm dropping out

play11:47

and only keeping 10 20 30% of each of

play11:51

those sets of updates the overlap is

play11:53

going to be lower and the chance of

play11:54

those like destructive interferences is

play11:56

going to be lower so maybe that gives

play11:58

you like a better result not because

play11:59

it's actually better than if you could

play12:01

more intelligently combine them but just

play12:03

because you have less of these weird

play12:04

clashes and conflicts um and so yeah

play12:07

this almost talks back to um some of you

play12:09

may have seen the video I did on zip

play12:11

Lowa where they're trying to combine um

play12:14

specifically lowers of diffusion models

play12:17

but they also had this issue of like oh

play12:19

some updates would be um interfering

play12:21

with each other from two different

play12:22

lowers if they both had the same update

play12:24

you shouldn't just like naively combine

play12:27

them you should have some way of

play12:28

detecting um or scaling or adjusting so

play12:31

that they didn't have those conflicts so

play12:33

I think the language model merging crowd

play12:36

um could maybe benefit from some of the

play12:37

diffusion lower emerging techniques and

play12:39

vice versa um but this is all active

play12:41

research I know um yeah I've spoken to I

play12:45

think was some some of the people on

play12:46

this team um yeah and everyone's busy

play12:49

working on this and and figuring out

play12:51

better and better ways to do this um

play12:53

anyway so that's how exactly we're

play12:55

emerging these different models and in

play12:57

this paper they're going to to use a

play13:00

combination of this dropping out some of

play13:02

those updates and then using the ties

play13:05

merging to say for the remaining ones

play13:07

how do we actually combine them rather

play13:08

than just linearly combining them we're

play13:11

going to do this oh check you know zero

play13:13

out any that are really small um rescale

play13:16

appropriately check the sign and only

play13:18

adjust if the sign agrees that kind of

play13:20

thing um okay then there's one

play13:23

additional type of merging called

play13:25

Franken merging and this is different so

play13:28

everything we've talked up to until now

play13:30

has been I have two models that are the

play13:32

same shape or two layers that are the

play13:34

same shape and I'm somehow combining the

play13:36

weights from those two to give a new

play13:37

layer that's also that same shape

play13:40

Franken

play13:41

merging has been around for a little

play13:43

while in the language model community

play13:45

and it's built on this intuition that um

play13:48

each layer in these Transformer models

play13:50

mostly passes through the data untouched

play13:53

and at best it makes small updates to

play13:55

that hidden State and so if you take

play13:59

layers 1 through 12 of a model then the

play14:03

data would normally go into 13 through

play14:05

20 say um but then you could skip a few

play14:08

of those layers and jump straight to

play14:09

layer 17 and the data coming into that

play14:13

layer would look slightly different to

play14:14

what it's used to but not that different

play14:17

right the the difference from the start

play14:19

of each layer until the end of each

play14:20

layer in the middle of these

play14:22

Transformers is generally quite small

play14:24

they're only occasionally making updates

play14:26

and the intuition here is that it's only

play14:29

when something specific to that language

play14:31

model head and that layer that it's

play14:33

learned some particular fact or some

play14:35

particular pattern then it's making an

play14:37

update but otherwise it's kind of almost

play14:38

doing like a a no op right it's just

play14:40

passing on the data untouched or only

play14:42

tiny a tiny bit adjusted um so Franken

play14:45

merging was buil built on that

play14:47

observation it's like oh well what if we

play14:48

then like stacked multiple extra layers

play14:52

in there right so we take a model that

play14:53

was 30 layers we expand it to be 50

play14:56

layers so it has the first 10 layers

play14:57

untouched then has has two copies of the

play14:59

11th layer two copies of the 12th layer

play15:01

maybe um the 13th and 14th layers are

play15:03

from a different fine tune of that same

play15:05

base model and all of these different

play15:07

layers are shoved sequentially together

play15:09

to give you a deeper model um and this

play15:12

is how you see like 128 or 120 billion

play15:16

parameter models based on a 70 billion

play15:17

parameter model or 10 billion parameter

play15:20

models based on 7 billion parameter

play15:21

models people just duplicating layers or

play15:24

combining different variants of a given

play15:26

layer from different models but not by

play15:29

averaging the weights just by stacking

play15:30

them

play15:31

sequentially so that's what they mean by

play15:32

Franken merging and we'll see that this

play15:34

ties into their um when they talk about

play15:38

the data flow Space versus the parameter

play15:40

space parameter space is going to be all

play15:42

the other kinds of merging where you're

play15:44

combining the weights data flow space is

play15:46

going to be uh stacking more layers and

play15:48

changing the order of those

play15:51

layers cool all right that's a lot of

play15:54

background we can finally get to the

play15:56

method and so their goal is to create a

play15:58

unified frame that can do both of these

play16:00

types of mergers and to give us a

play16:02

resulting model that hopefully surpasses

play16:04

any individual in the collection right

play16:06

so we want to combine multiple models

play16:07

together to get something that's even

play16:09

better um so they're going to apply

play16:11

evolutionary algorithms and we can talk

play16:13

about that shortly um and they're going

play16:16

to split this merging process into these

play16:19

two different spaces um the merging by

play16:24

combining parameters and the merging by

play16:26

changing the data flow so this diagram

play16:28

here here is a nice overview here we

play16:30

have two models these are our original

play16:32

models both trained from the same base

play16:34

but with different fine tunes on some

play16:36

different

play16:38

task um and so you can see here this

play16:41

model is the same shape as these two um

play16:44

and each layer is some combination but

play16:46

the weighting is different so here it's

play16:48

mostly this first model it's mostly blue

play16:51

the second one is sort of a mix the

play16:53

third one's mostly red um but you can

play16:55

see the shape hasn't changed it's just

play16:56

that the weights have been combined in

play16:58

one of these fancy

play16:59

um merging

play17:01

techniques the second model here is also

play17:04

a combination of these two but instead

play17:05

of averaging the weights all they've

play17:07

done is stacked some of the layers from

play17:09

one and then a layer from the second

play17:11

right so now we have more layers in

play17:12

total um but each of the layers

play17:14

individually hasn't changed so this has

play17:16

just changed in the data flow it hasn't

play17:18

changed the weights of Any Given layer

play17:20

um but there's no reason we can't

play17:21

combine these and so here they combine

play17:24

this model which was a merge of the

play17:25

weights plus an extra layer from one of

play17:27

the other models so now we've also

play17:29

changed the data flow and so yeah that's

play17:32

going to be their approach is going to

play17:34

be doing bits of both and combining them

play17:37

together and seeing which performs

play17:40

best um now yeah at this point we can

play17:44

look at um so they're going to say we're

play17:46

enhancing ties mer merging with dare so

play17:48

they're combining those two techniques

play17:49

we looked at um they're doing layerwise

play17:52

merging um and they're going to optimize

play17:55

this with an evolutionary algorithm and

play17:58

guided by some task specific metric so

play18:01

we should talk about now what is

play18:05

evolutionary computation what what is

play18:07

going on in this paper um because this

play18:09

is different to the kind of fine-tuning

play18:11

training gradient-based um

play18:14

differentiable uh updates and training

play18:16

that you might be used to this is going

play18:18

to be some different approach and this

play18:19

is kind of one of the ways in which this

play18:22

lab is trying to be different to

play18:23

everyone else doing the same gradient

play18:25

based approaches so I'm going to switch

play18:28

to a different screen

play18:29

um and we're going to try and get an

play18:31

intuition specifically for this CMA

play18:34

algorithm so this is

play18:36

um what does this stand for covariance

play18:39

Matrix adaption um evolutionary strategy

play18:42

something like that um but the core idea

play18:45

here is that we have multiple parameters

play18:48

that we're trying to optimize so for

play18:50

example um oops you can consider like

play18:53

okay we have some scale here and this

play18:56

could be the um the relative weight of

play19:00

model A versus model B in our merge and

play19:03

we have some different parameter and

play19:04

this might be um maybe this is like the

play19:07

um percentage of the weights that we

play19:09

drop in the Dare part of the merging

play19:12

right and we can have many more of these

play19:14

continuous variables that we're

play19:15

searching over so there's different

play19:17

scalings for each layer and that kind of

play19:19

thing so this is the space in which we

play19:21

want to search and every Point here

play19:24

corresponds to some output model right

play19:26

so this one could be mostly Model A

play19:28

versus model model B um and this is low

play19:31

density versus high density in terms of

play19:33

like how much am I dropping out um I

play19:35

could also try this merge here um so

play19:39

what this strategy does is says okay

play19:41

we're going to initialize a whole

play19:42

population of candidates and the way we

play19:44

going to do this is we're going to have

play19:46

some distribution in this search bace

play19:48

right so I'll draw the distribution here

play19:50

with a little mean and you can imagine

play19:53

the distribution being like spread

play19:55

around that mean um and so our

play19:58

population is going to be samples that

play20:01

are more likely to be close to that mean

play20:03

but scattered around um and each of

play20:06

these is going to be some candidate that

play20:07

we evaluate right now you can imagine

play20:10

that some of these candidates might

play20:12

perform really well like these three

play20:13

here might get um pretty good scores and

play20:17

a few over here might get extremely

play20:19

terrible scores so once we've picked a

play20:21

few candidates and we've tested them out

play20:24

um what we're going to do next in this

play20:26

algorithm is to say let's select the

play20:30

good ones right so these candidates here

play20:33

these all did pretty

play20:34

well this is going to be like my

play20:37

survivors then I'm going to update the

play20:41

distribution that I'm using to search so

play20:43

remember we had this distribution

play20:44

centered around the mean I'm going to

play20:46

update this such that it's closer to the

play20:49

distribution of those

play20:50

survivors um and so now I've got a new

play20:53

mean right it's not the mean those

play20:55

survivors is some update step size so

play20:58

I'm not going to all the way but I do

play20:59

have a new distribution and that new

play21:02

distribution is closer to the part of

play21:03

this parameter space that produced those

play21:06

hopeful candidates so now I'll sample

play21:08

some candidates from that uh

play21:10

distribution again some of them will do

play21:12

better than others I'll pick the top

play21:13

scoring candidates I'll update the mean

play21:16

and we apply this again and again um as

play21:18

many steps as we like until we find some

play21:21

stopping criteria or we we stop

play21:23

improving um but the idea of this

play21:25

algorithm is that it lets us search this

play21:27

space these continuous parameters um but

play21:30

importantly it lets us do it without any

play21:32

of this having to be differentiable

play21:34

right and so we're not able to find a

play21:37

gradient like we would with the

play21:38

parameters of a neural network based on

play21:40

some like uh differentiable loss instead

play21:43

we're just using these random samples

play21:45

and then we're kind of like Computing

play21:47

almost a pseudo gradient right or some

play21:49

like Direction in the space that might

play21:51

be useful but we're continuing to sample

play21:53

lots of points randomly to get lots of

play21:55

exploration and each of these ones the

play21:58

evaluation of this candidate here this

play22:00

doesn't have to be differentiable this

play22:01

could be like oh I then fed at a bunch

play22:03

of multiple choice questions and I

play22:04

looked at the accuracy all that matters

play22:06

is that we can pick out what are the

play22:07

highest performing candidates um yeah so

play22:10

it's a pretty interesting algorithm um

play22:13

and this is more broadly what

play22:15

evolutionary algorithms are really good

play22:17

at it's like searching a really large

play22:19

space um doesn't have to be continuous

play22:21

can sometimes be like discreet um none

play22:24

of the uh the outcomes have to be

play22:27

differentiable there's no gradients

play22:28

flowing it's more like seeing which ones

play22:30

work and which ones don't in some like

play22:32

population that we generate and then

play22:35

somehow um updating

play22:37

our parameters that we're searching over

play22:40

to try and find the best combination of

play22:42

parameters and so in this case um for

play22:45

the parameter space merging those

play22:47

parameters are referring to like the

play22:49

waiting and so on of how exactly we're

play22:52

combining these different models um yeah

play22:55

and we can look at now what is the

play22:56

equivalent for the data flow space

play23:00

um recent analysis implies that

play23:02

knowledge is stored distributedly in

play23:04

language models I thought this was a

play23:05

very interesting observation and a very

play23:07

interesting paper that it links to um so

play23:11

yeah I'll leave this mostly for as an

play23:13

exercise for the reader um but this is

play23:15

feeding into that idea I spoke about

play23:17

earlier of like why would stacking

play23:18

layers one above the other out of the

play23:20

order they originally trained why would

play23:22

that even work and the intuition is that

play23:24

yeah they're activated on like specific

play23:26

facts or specific patterns and only then

play23:29

are they making updates to the

play23:31

distribution of likely tokens um and a

play23:34

lot of the time the residual that's

play23:35

passed it's not going through you know

play23:38

there's some path that goes through the

play23:39

layer and there's some path that goes

play23:40

directly on and they're combined and the

play23:42

path that's being fed through the the

play23:44

feed forward or the attention head um

play23:47

that part is like a modification a Delta

play23:50

and those are sometimes small um and

play23:53

sometimes large and these um the the key

play23:56

intuition is that they sort of Stack um

play23:58

and if we could stack more layers that

play23:59

maybe had more general knowledge that

play24:01

might not be a terrible thing anyway um

play24:04

slight diversion but this is kind of

play24:06

giving some justification for why this

play24:08

data flow um tweaking might even make

play24:13

sense

play24:14

so we just talked about evolutionary

play24:16

algorithms being able to search this

play24:18

parameter space and try different

play24:19

combinations that's great um but it's

play24:21

not a Magic Bullet and one of the issues

play24:24

is that you can sometimes end up with a

play24:25

space that's just too vast to explore

play24:28

and that's the case if we consider this

play24:30

um data flow merging technique

play24:34

where we want to get up to T layers

play24:37

right so I have two models that have 32

play24:39

layers each and I want to create a new

play24:40

Franken merge that's got 40

play24:43

layers if I've got n different models

play24:45

that I'm choosing

play24:47

from um and I could have lots of

play24:50

different layers in each of those models

play24:53

and I could stack them in any order the

play24:55

search space is vast right I could

play24:57

choose any layer from any model for any

play24:59

layer in the final model just way too

play25:02

many different options to

play25:04

explore and

play25:06

so this even if you had a really good

play25:08

evolutionary search um would just be

play25:11

like kind of way too computationally

play25:13

intensive or even

play25:15

impossible so to try and do that this

play25:18

paper is saying how can we reduce this

play25:20

search Spas down and they come up with a

play25:23

a somewhat interesting but somewhat

play25:25

hacky approach which is to say okay we

play25:29

will have a fixed ordering where we'll

play25:34

take

play25:35

um all the layers in sequential order so

play25:38

we're never going to do something where

play25:39

we have like layer seven then layer six

play25:41

then layer five then layer four then

play25:42

layer three we kind of going to assume

play25:44

that they should probably go in roughly

play25:46

the order that they were added um but

play25:48

we're going to repeat them so I'll have

play25:50

layers 1 through 30 of of model one

play25:52

layers 1 through3 of model 2 layers 1

play25:54

through3 of model 3 then I'll do layers

play25:56

1 through 30 of model 1 again

play25:58

and same for two and three and then some

play26:01

number of repeats so maybe you have

play26:02

three repeats um and so we have all of

play26:06

the layers in this kind of like somewhat

play26:08

sequential order with

play26:10

repeats and then the only thing that

play26:12

I'll be able to learn is whether to

play26:14

include any given layer or not this

play26:16

indicator here is like a if this is

play26:18

above one we include the layer if it's

play26:20

below one we don't um and so now instead

play26:23

of having to have all possible orderings

play26:25

we just have for this long list of

play26:27

sequential but repeated candidate layers

play26:30

we just have a one or a zero or a number

play26:32

of like whether or not they should be

play26:33

included um for every like index in that

play26:36

list and so now we've reduced it to 2 to

play26:38

the power of T options versus um n plus

play26:41

1 to the power of T So a much smaller

play26:43

space um and there's one extra Nuance

play26:46

which is that okay so we're optimizing

play26:47

this I we trying to pick different

play26:49

configurations of this you know include

play26:51

or not um array um they find that if you

play26:56

just do that it's not ideal there's some

play26:58

problems with jumping straight from

play26:59

layer seven to layer 12 um they want to

play27:03

do more theoretical analysis of this but

play27:05

for now they say

play27:07

well we find that practically it helps

play27:09

to do some scaling as well so if

play27:12

I'm um choosing different layers from

play27:15

different parts of the model and I'm

play27:16

putting them in order um what I should

play27:18

do is probably scale the input by some

play27:21

weight and these weights are also going

play27:22

to be something that we optimize right

play27:25

so instead of just having the yes or no

play27:27

indicator array we're also going to have

play27:29

this W array that we're optimizing

play27:31

during this evolutionary search process

play27:33

as

play27:34

well um and there's ways to make it even

play27:36

smaller um okay so that's the that's the

play27:39

framing here so the first one was a lot

play27:42

easier to visualize we're just changing

play27:43

the weights with which we're merging and

play27:46

that's parameters that we can search

play27:48

over using this evolutionary search um

play27:51

the second one there's some trickiness

play27:53

around trying to make the search space

play27:55

manageable so they have this particular

play27:57

types of ordering this um inclusion

play28:00

index um thing and then this waiting um

play28:04

but these are still just parameters that

play28:05

we can VAR According to some

play28:07

distribution and then we can evaluate

play28:09

the candidates by actually doing the

play28:10

merge and seeing how well it does um and

play28:13

then we can update the parameters that

play28:14

we're searching and we can try a new set

play28:16

of a new population um yeah okay so they

play28:21

took they say these are orthogonal

play28:22

approaches in other words they're both

play28:23

useful but we can combine them together

play28:25

um and so they're going to do first one

play28:28

and then the other um and then they also

play28:30

talk about um being able to apply this

play28:33

with multi-objective genetic algorithms

play28:36

this as far as I know they don't

play28:37

actually do much of in the paper um but

play28:40

if you're curious this here is an

play28:41

algorithm that lets you take multiple

play28:44

different objectives right so maybe I

play28:45

want to do good on math questions and

play28:47

science questions and Japanese uh

play28:49

cultural questions or something like

play28:50

that um if we are exploring the space of

play28:55

possible model mergers maybe some are

play28:56

better at one of those and some better

play28:58

than another we're really interested in

play29:00

like I want to know the models that are

play29:02

decent at all of them maybe some are

play29:04

more good at one and some are more good

play29:06

at another but I want that kind of that

play29:08

Pito Frontier right um and that's

play29:11

exactly what this kind of algorithm is

play29:12

good at is saying like Okay well here

play29:14

are some combinations that are worse

play29:16

than other combinations on all of those

play29:18

things and here are some that are kind

play29:20

of like as good as each other and maybe

play29:22

better in one metric or another so these

play29:24

are like the candidates that we'd rely

play29:26

care about and then we can choose those

play29:27

trade-offs of which skills do I

play29:29

particularly value um but yeah we can

play29:32

have this like multiobjective thing

play29:33

coming in anyway small side note cool

play29:37

that's a lot of background um thank you

play29:39

for bearing with me I think now we can

play29:40

get into the results so we've talked

play29:42

about how they're doing this they're

play29:43

doing this evolutionary search over

play29:44

these merging parameters um we should

play29:46

now answer the important question of

play29:48

does this actually work and so to try

play29:50

this they're going to set things up with

play29:52

a Japanese model um by the way all of

play29:55

these are based on the Mistral 7B model

play29:58

um so they have a Japanese llm and then

play30:00

they have two math llms neither of which

play30:03

is trained on Japanese they're all

play30:05

trained from the same base model um and

play30:07

then they're going to test these on

play30:09

Japanese math questions so they have a

play30:11

translation of a grade school math

play30:14

question data set um and they're going

play30:16

to use that to say can we get a model

play30:18

that's good at math in Japanese and

play30:21

they're going to evaluate this on those

play30:22

math

play30:23

problems um cool okay so they're doing

play30:25

that algorithm we spoke about they're

play30:27

having some initial parameter values so

play30:29

it's like a mean and a a sigma or

play30:32

variance or a standard deviation um some

play30:35

population size then they'll pick the

play30:36

best they'll update the initial

play30:39

parameters to have a new mean and a new

play30:40

Sigma then they'll sample more a new

play30:43

population from that new distribution

play30:45

and so on and so forth um they're

play30:48

evaluating their candidates on some

play30:50

training samples that are different from

play30:52

the test set so they're saying okay I've

play30:54

got some different math questions in

play30:55

Japanese um because I don't want to just

play30:57

train the test set I do a th trials they

play31:01

take the best one um as the final

play31:04

model and then similarly for the data

play31:07

flow stuff they um they line up all

play31:10

their candidate layers they have some on

play31:12

and some off and they do their

play31:13

optimization over that and they end up

play31:15

with some combination of layers from

play31:17

different input models um okay and the

play31:20

key

play31:21

results the accuracy here the general

play31:24

Japanese model was not so good at maths

play31:26

the general maths model was not so good

play31:27

Japanese so neither of them do

play31:29

particularly well but their mergers here

play31:33

all do fantastically right and so the

play31:36

merging just in the parameter space

play31:37

combining these three models that's what

play31:39

this means 1 plus 2 plus three combining

play31:41

these three input models you get an

play31:43

output model of the same size just

play31:44

because this is parameter space only but

play31:46

the accuracy is a lot higher because

play31:48

we've managed to hopefully take the

play31:49

Japanese skills plus the math skills and

play31:52

we can now answer these Japanese math

play31:54

questions um likewise for data flow

play31:58

um taking some layers of this math model

play32:01

and some layers of this Japanese model

play32:03

it does work right we do get something

play32:05

that gets a little bit better than any

play32:06

of the inputs um but what's really even

play32:09

better than that is to say ah let's take

play32:11

um some layers from our merge that we've

play32:16

combined in weight space and some from

play32:18

the general um Japanese model and smush

play32:21

them together by reordering and that's

play32:23

where they get the highest

play32:25

performance um so you can see there's

play32:27

the two types of merging here this is

play32:29

the um parameter space only this is the

play32:31

data flow space only and then combining

play32:33

the two together by doing one and then

play32:35

the other that's where they get their

play32:37

highest gains and they now have a

play32:38

slightly larger model um but it performs

play32:40

better um and in fact it beats a lot of

play32:42

the

play32:43

existing Japanese models very

play32:46

handily um and this is just another data

play32:48

set that they evaluate on and same thing

play32:50

so this is a good sign that it's not

play32:51

just this data set um it extends to um

play32:55

General Japanese abilities as well um

play32:58

yeah so that's fantastic we get a a

play33:00

better model out that's got both skills

play33:02

that we wanted um just as you would have

play33:04

hoped would happen and they were able to

play33:06

do this without having to you know

play33:08

manually like guess at those scaling

play33:10

figures instead they could use their

play33:12

search

play33:14

approaches um okay so that was nice I

play33:17

like this figure it kind of shows if you

play33:18

look at which ones the math models get

play33:20

right um those tend to flow over into

play33:23

which ones the combined models get right

play33:26

so it says you know um this is kind of

play33:29

what we'd hope that we're not just

play33:31

magically getting some new abilities

play33:32

instead it's like the kinds of questions

play33:34

that some of the input models would get

play33:36

right are the kinds of questions that

play33:37

the merged models also get right and the

play33:40

kinds of questions that none of the

play33:41

input models can answer also none of the

play33:43

merged models really can

play33:45

answer um

play33:47

cool uh then okay um I guess we can look

play33:51

at yeah you can see that all three input

play33:53

models have some weight and some density

play33:56

um so they're all contributing um

play33:58

likewise in the data flow um they

play34:01

initialize things so that you get a lot

play34:02

of the early layers in order this seems

play34:05

to be very helpful um but then over the

play34:08

course of this uh search you end up with

play34:11

some new ordering um you can see it's

play34:14

still in this like sequential withd

play34:16

repeats setup um but different scalings

play34:20

the size of these dots is the um the

play34:22

weight Matrix this the scaling Factor um

play34:25

yeah so we end up with a stack of layers

play34:27

some from one one model some from the

play34:28

other model that's the color um and this

play34:31

is an ordering that seems to help and

play34:33

give the best

play34:35

results okay um jumping up even further

play34:38

in difficulty can we take a model that's

play34:40

been

play34:41

trained um where we have a vision

play34:44

encoder extracting image features and

play34:46

then we're projecting these into the um

play34:49

embedding space of this language model

play34:50

and then the language model is um

play34:52

learning to interpret these these are

play34:55

like non-word tokens you know these like

play34:57

soft prompts um it's learning to somehow

play35:00

make sense of those and answer questions

play35:01

about those um yeah so that's a very

play35:04

different domain to just general

play35:06

language modeling um and so they use

play35:09

this lava model which is exactly that

play35:10

it's learning to take these these image

play35:12

embeddings and then some text eddings as

play35:15

well to ask and answer questions about

play35:17

the image caption it and so on um yeah

play35:20

so their question is can we combine this

play35:23

image understanding that's been added to

play35:25

this model with the Japanese

play35:27

understanding of our Japanese model and

play35:30

so this is a very tricky task right this

play35:32

Japanese model has never been trained on

play35:33

any um image

play35:36

inputs um yeah and so that's what

play35:38

they're going to try and do they have

play35:39

some uh question answering data sets and

play35:42

they create a new

play35:43

one

play35:45

um excuse

play35:47

me um yeah so they apply their technique

play35:51

um same as before doing parameter space

play35:54

and uh data flow space merging and yeah

play35:57

shocking ly and impressively the result

play35:59

is something that does better than

play36:02

either the base Vision model or the an

play36:06

existing Japanese um Vision language

play36:09

model that was trained specifically on

play36:11

this um they get better results than

play36:13

both on both the existing data set and

play36:15

the um the new data set that they

play36:18

create right so you have to think about

play36:21

this for a little bit to appreciate it

play36:22

this is a model that is merged from one

play36:25

model that doesn't really focus on

play36:27

Japanese

play36:28

but that does focus on adding image

play36:30

understanding to the base model right M

play36:32

wasn't trained on images but this fine

play36:34

tune lava was we have another model that

play36:36

was just fine-tuned on Japanese um text

play36:38

and Japanese culture and we're able to

play36:41

combine them to get something that's

play36:42

able to answer visual questions in

play36:44

Japanese with the appropriate context um

play36:48

yeah so very very cool results and nice

play36:50

to see that the way they do this is they

play36:52

just oh apply our technique right

play36:54

there's no fine-tuning and tweaking and

play36:56

trying different scales manually just we

play36:58

have a technique now that's able to like

play37:00

robustly take two input models and

play37:02

figure out an optimal waiting or at

play37:04

least a really good waiting using this

play37:06

Evolution

play37:08

research um yeah so that's the core of

play37:10

this paper the discussion and

play37:12

conclusions you can see this is

play37:14

something they're very excited about

play37:15

they have this idea of maybe like a

play37:18

whole swarm of different models out

play37:21

there in the world learning different

play37:23

things from different people um and

play37:25

being able to like improve on these

play37:27

different subtasks and then their

play37:30

evolutionary techniques or other

play37:31

techniques able to like combine all of

play37:33

those individual small models together

play37:35

into some larger Foundation model that

play37:36

has all of these capabilities um yeah so

play37:39

you can see they have this really cool

play37:40

Vision

play37:42

um I do think this is a really nice

play37:45

paper I think this is a really nice

play37:46

approach especially combined or compared

play37:48

with some of the

play37:50

um the existing work so if you go back

play37:53

to the um the merge kit citation um I

play37:56

don't have the browser tab open but we

play37:58

can talk about that just briefly um what

play38:01

was happening is that you'd have people

play38:03

looking at the leaderboard the hugging

play38:04

face open LM leaderboard and then they'd

play38:07

pick a couple of models on there and

play38:08

then they'd merge them with this like

play38:09

easy oneclick tool and then they'd

play38:12

submit that for evaluation and it might

play38:13

get you know a slightly higher score on

play38:14

the leaderboard you know and this was

play38:17

rinsed and repeated and rinsed and

play38:18

repeated to the point where you get some

play38:19

model which is a merge of two other

play38:21

models and each of those is a merge of

play38:22

two other models and those were merges

play38:24

from some base models and some initial

play38:25

models um and so you have this whole

play38:28

like lineage this family tree of this

play38:30

model that's getting you know .1% better

play38:33

than some of the other models and so

play38:34

it's a it's the top of the leaderboard

play38:35

it's the best 7B model ever um but

play38:38

several issues one you don't know

play38:40

whether any of those initial models had

play38:42

um test set contamination I know some of

play38:44

them definitely do now everything that's

play38:47

a merge of one of those merges of one of

play38:49

those merges of one of those

play38:50

contaminated models um you're not sure

play38:52

if the performance is because it's

play38:54

actually good at the task or whether it

play38:55

just happened to be trained on some of

play38:56

the test sets

play38:58

um so you have this complication and

play39:00

then also you kind of have some

play39:01

overfitting right because these are

play39:03

being evaluated on the leaderboard and

play39:04

then we're picking combinations based on

play39:06

which ones do well evaluate on the same

play39:08

leaderboard again picking which ones do

play39:10

well um so you end up with something

play39:11

that does really well on that

play39:12

leaderboard um but that doesn't

play39:14

necessarily translate to does well on

play39:16

other

play39:17

tasks whereas the paper that we've

play39:19

looked at here they're very careful to

play39:21

say we have our evaluation set that we

play39:23

use for the evolution research that's

play39:25

separate to the test set and then we're

play39:26

also going to check does this apply to

play39:29

other similar domains does this still

play39:31

have good knowledge across other

play39:32

Japanese tasks as well you know

play39:34

basically is this something that's

play39:35

somewhat General versus just like oh we

play39:37

overfit to this very small test set um

play39:40

and then we call that good so I really

play39:42

enjoy this paper um congratulations to

play39:45

the Sakana team I'm looking forward to

play39:46

seeing what else comes out of this lab

play39:49

um yeah and I hope you've enjoy this um

play39:52

Deep dive into model merging

play39:54

evolutionary algorithms and um a really

play39:56

fantastic paper

play39:57

thank you so much for watching

Rate This

5.0 / 5 (0 votes)

Related Tags
AI研究モデル融合進化アルゴリズムSakana Labデータサイエンス人工知能日本学術研究技術革新
Do you need a summary in English?