Paper deep dive: Evolutionary Optimization of Model Merging Recipes
Summary
TLDRこの動画は、Sakana Labの研究者たちが進化アルゴリズムを用いて、モデル融合の自動化に取り組んだことを紹介しています。彼らは、異なるタスクに特化した複数のモデルを組み合わせ、新しい基盤モデルを作成し、特に日本語の数学問題に対する応答能力を向上させることを目指しています。進化アルゴリズムを用いた最適化プロセスは、パラメータ空間とデータフロー空間の両方を考慮し、効果的なモデル融合を実現しています。このアプローチの結果、日本語の数学問題に対する正解率が大幅に向上し、視覚モデルと組み合わせることで、画像に関する質問に対する応答能力も向上しています。
Takeaways
- 🌟 模型マージは、異なるタスクに特化した複数のモデルを組み合わせ、より強力な基盤モデルを作成する方法です。
- 🔍 日本のSakana Labでは、群れ知能や生物学にインスパイアされた手法を用いて、モデルマージの研究を行っています。
- 🧬 進化アルゴリズムを応用して、パラメータ空間とデータフロー空間でのモデルマージを自動化しようとしています。
- 🔗 モデルマージの目的は、異なるドメインの知識を統合し、新しい能力やスキルを持つモデルを作成することです。
- 🤖 進化アルゴリズムを使用することで、ブラックアートのような手動調整を避け、より自動化された最適解を見つけることができます。
- 📈 結果は、日本語の数学問題に対する正解率の向上や、画像を理解するモデルと日本語モデルの統合に成功しています。
- 🌐 モデルマージの分野はまだ新しい分野で、多くの研究が行われることが期待されています。
- 🚀 Sakana Labの研究は、基盤モデル開発の分野において革新的なアプローチを提供しています。
- 📚 進化アルゴリズムは、大規模なパラメータ空間を探索し、最適なモデルの組み合わせを見つけるのに役立ちます。
- 🎯 評価指標に基づいて最適なモデルを選択し、進化アルゴリズムを用いて最適なパラメータを探索することが重要です。
- 🌈 多目標遺伝的アルゴリズムを用いることで、複数の異なる目標を同時に最適化することが可能です。
Q & A
Sakana Labはどのような研究を行っていますか?
-Sakana Labは群れ知能、生物学にインスパイアされたこと、進化的アルゴリズム、人工生命など、様々な興味深いトピックに取り組むAI研究ラボです。
モデルマージとは何ですか?
-モデルマージは、既存のモデルを組み合わせることで、より強力な基礎モデルを作成する方法です。これは、人間的な直感や分野知識に頼るという暗黒魔術のようなものであり、進化的アプローチを用いて自動化しようとしています。
進化的アルゴリズムはどのように適用されていますか?
-進化的アルゴリズムは、パラメータ空間とデータフロー空間でのモデルマージに適用されています。これにより、異なるドメインのモデルを組み合わせたり、文化に配慮された視覚言語モデルを作成したりすることができます。
Ties MergeとDAREはどのような手法ですか?
-Ties Mergeは、重複するパラメータ値を解消し、符号の不一致を解決する方法です。DAREは、微調整されたパラメータの更新をランダムにドロップアウトし、残りのものを再スケールする技術です。これにより、衝突を減らし、より良い結果を得ることができます。
Frankenマージとは何ですか?
-Frankenマージは、同じ形状の異なるモデルの層を組み合わせ、新しいレイヤーを生成する方法です。これは、Transformerモデルの各層がデータにほとんど変更を加えないという直感に基づいています。
進化的アルゴリズムのCMA戦略とは何ですか?
-CMA戦略は、多変量問題を解決するための進化的アルゴリズムの手法です。この戦略では、パラメータ空間を探索するために、分布からサンプルをとり、最も適切な候補を選択し、分布を更新することで、最適解に近づきます。
データフロー空間でのマージの課題は何ですか?
-データフロー空間でのマージの課題は、検索空間が非常に広くなり、計算的に過密になることです。これに対処するために、論文では、順序を固定して繰り返しレイヤーをスタックし、どのレイヤーを含めるかを学習するという方法を提案しています。
モデルマージの結果評価はどのように行われましたか?
-モデルマージの結果評価は、訓練データセットとは異なるテストデータセットで行われました。進化的検索を使用して最適なウェイトやスケーリングファクターを発見し、組み合わされたモデルが期待どおりにパフォーマンスを発揮していることを確認しました。
Sakana Labの研究ではどのような成果が得られましたか?
-Sakana Labの研究では、日本語の数学問題や画像に関する質問に対する答えを正确に提供できる、文化に配慮された日本語視覚言語モデルが開発されました。これにより、異なるドメインの知識を統合し、より包括的なAIモデルを作成することができました。
この研究の意義は何ですか?
-この研究の意義は、進化的アルゴリズムを使用してモデルマージを自動化し、異なるドメインの知識を統合することで、より高度なAIモデルを作成できることです。また、この技術は一般的な言語モデルや視覚言語モデルの開発にも適用 가능であり、AIの応用範囲を広げることになります。
Outlines
🌟 進化的最適化とモデルマージの紹介
この段落では、データサイエンスの分野で進化的最適化とモデルマージの概念が紹介されています。Sakana Labという日本のAI研究ラボが、蠕動知能や生物学にインスパイアされた手法を用いて、新しいAIモデルの開発を行っていることが説明されています。彼らの目標は、複数のモデルを組み合わせることで、より強力な基盤モデルを作成することです。このプロセスは、人間的な直感や特定の知識を必要とせず、自動化された方法で行われることが期待されています。
🧠 モデルマージの手法と進化的アルゴリズムの使用
この段落では、モデルマージの様々な手法と進化的アルゴリズムの使用方法について説明されています。モデルマージは、既存のモデルを組み合わせて新しいモデルを作成する方法であり、このプロセスは人間的な直感や知識に頼っている传统的な方法と比べて、自動化されたアプローチにより優れていることが示されています。進化的アルゴリズムは、重みやデータフロー空間での最適化を自動化するのに使用され、これにより新しいモデルの作成がより迅速で効率的に行われることが期待されています。
🔍 モデルマージの背景と理論
この段落では、モデルマージの背景知識と理論的根拠について深く掘り下げています。モデルマージがどのように機能し、なぜ異なるタスクや異なるモデルから得られる知識を組み合わせることでより良い結果が得られるのかが説明されています。また、データフロー空間とパラメータ空間でのマージの違いと、その背後にある考え方についても議論されています。このセクションは、モデルマージの理解を深めるために重要な基礎知識を提供しています。
🧬 進化的アルゴリズムの詳細とデータフローの最適化
この段落では、進化的アルゴリズムの詳細とデータフロー空間での最適化の方法について説明されています。進化的アルゴリズムは、多様なパラメータ空間を探索するために使用され、最適なモデルの組み合わせを見つけ出すために利用されています。データフロー空間での最適化は、モデルの層をどのように組み合わせるかを決定する際に使用されます。このセクションでは、これらの方法が実際にどのように機能するかについて、具体的な例を交えて説明されています。
📈 モデルマージの結果と評価
この段落では、モデルマージの結果と評価方法について述べられています。Sakana Labの研究者たちは、進化的アルゴリズムを使用してモデルをマージし、その結果を評価しています。評価は、異なるデータセット上で行われ、新しいモデルが予想どおりに高い正答率を達成していることが示されています。また、この方法が異なる分野でのタスクにも適用できることが示されており、これはモデルマージ技術の柔軟性と応用範囲を広げるものです。
🌐 応用と未来の展望
最後の段落では、モデルマージ技術の応用と未来の展望について議論されています。研究者たちは、この技術が単にリーダーボードに名前を残すだけではなく、実際のさまざまなタスクで優れた性能を発揮することを期待しています。進化的アルゴリズムとモデルマージの組み合わせは、多くの小さなモデルを組み合わせて、1つの強力な基盤モデルを作成するというビジョンを可能にし、これによりAIの能力をさらに拡大することを目指しています。
Mindmap
Keywords
💡進化的最適化
💡モデル融合
💡Sakana Lab
💡群れ知能
💡バイオロジカルリード
💡人工生命
💡クロスドメインマージ
💡進化戦略
💡CMA
💡タスク ベクター
💡DARE
💡フランケンマージ
Highlights
The paper presents a novel application of evolutionary algorithms for model merging in AI.
Sakana Lab's approach aims to automate the creation of powerful foundation models through model merging.
The study focuses on cross-domain merging, combining models from different domains like a Japanese model with a math reasoning model.
The paper introduces a method to optimize beyond just the weights of individual models, facilitating cross-domain merging.
The authors discuss the concept of model merging as a form of black art or alchemy, highlighting the need for a more systematic approach.
The paper explores various techniques for model merging, including linear interpolation, task vector combination, and Franken merging.
The authors propose an automated approach to model merging, eliminating the need for human intuition and domain knowledge.
The study introduces a method for combining models in a way that reduces interference and improves performance.
The paper discusses the use of evolutionary algorithms, specifically the CMA-ES (Covariance Matrix Adaptation Evolution Strategy) for optimization.
The authors demonstrate that their method can create a culturally aware Japanese visual language model by merging models with different capabilities.
The study shows that the merged models can perform better than individual models on Japanese math questions.
The paper presents a technique that can combine a vision model with a Japanese model to answer visual questions in Japanese.
The authors propose a unified framework that can perform both parameter space and data flow space merging.
The study suggests the potential for a swarm of specialized models that can be combined to form a larger, more capable foundation model.
The paper emphasizes the importance of avoiding overfitting and test set contamination when evaluating merged models.
The authors highlight the generalizability of their model merging approach, showing its effectiveness across different tasks and domains.
Transcripts
hello and welcome back to data science
Cate in today's video I thought we'd
take a look at evolutionary optimization
of model merging recipes from the Sakana
lab a relatively new AI research lab in
Japan led by David har and friends um a
lot of really interesting researchers
and they seem determined to go in a
different direction to many of the big
foundation model Labs um and so yeah
they're all about swarm intelligence and
uh biologically inspired things
evolutionary algorithms artificial life
lots of exciting topics and so this is
the first project I think it's only been
a couple of months since they raised um
their first round of seed funding so
impressive to have a paper out um and
this one is just diving into this idea
of model merging so in today's video
what we'll do is we'll go through this
paper but we'll also use it as an excuse
to look at what is model merging what
are some of the existing techniques that
people do how is this paper different
we'll dive into some of the actual
evolutionary algorithms used um yeah
just use it to get a feel for the space
as a whole now I myself am a bit of a
skeptic when it comes to model merging
and we'll talk about why that is too um
but for now let's look at the paper and
we'll use this um just as we're going
through to launch into these other
topics okay so starting at the abstract
we present a novel application of
evolutionary algorithms to automate the
creation of powerful Foundation models
so they say model merging has emerged as
a promising approach for llm development
right this is taking existing models and
combining them somehow um but this at
the moment relies on like human
intuition and domain knowledge it's very
Arcane um I think in the introduction
they call it out uh it's considered by
many to be a form of black art or
Alchemy right so this is this somewhat
Arcane New Field um and so they want to
come take an evolutionary approach that
gets over this requires human autom uh
sorry human um intuition and instead
have something that's more automatic and
more um generic and useful without
having to have this black
Alchemy um so they're going to talk
about this approach that they have they
say they're going to do things both in
parameter space and data flow space so
we'll make sure to look at what those
two um options are um they say we
optimizing Beyond just the weights of
the individual models and this approach
facilitates cross domain merging and so
that's the big theme of this paper we're
not just taking two math models and
smashing them together to get a slightly
better math model they're going to
combine a Japanese model with a math
reasoning capabilities so two separate
domains and then they're also going to
extend this even further to create a
culturally aware Japanese visual
language model by combining one model
that understands images and one model
that's trained on a lot of Japanese data
and combining them um to get a model
that understands both of those domains
so very much uh cross domain merging is
the is the focus here can we use
multiple models with different talents
to combine together and get something
that's uh greater than the sum of the
parts or that at least combines those
talents
um cool so they say this gives some new
sa of the up models but also it's a new
paradigm and they're very excited about
this idea of having many many different
models that have different uh
capabilities and skills and then being
able to merge and combine them with this
evolutionary
approach okay so what is model merging
um these citations here are for the
recently released merge kit um this is
been something that's fairly recently
become popular in the llm community um
combining
say two variants of an existing model
together um but it's something that has
been around for a while in the uh
diffusion and and image generation uh
sorry image generation Fields right so
with stable diffusion and they they do
talk about this in the paper um you've
seen a lot of people combining oh this
maybe this one's uh some model trained
on some specific style and then they
have like a a fine tune or a low rank
adapter like a Lowa for some other
character or con cep and they smoos the
two together to get something that
understands both and so you've had these
um these uis these interfaces that let
people combine these models in different
ways and you can have different
weightings right maybe I want no.9 of
this base model but No.1 of this model
that's better at I don't know anime cat
airs or whatever the really specific
subject that the person wants um and so
a lot of the most popular
um stable diffusion based models for a
while have been these big mergers where
someone takes one that's trained really
well on Photo realistic images and
another that's trained really well on
really good fantasy images maybe another
that's trained really well on I don't
know human anatomy and they they mash
them all together um so this has been
something that's been done for some time
and then going back even further this
model soup workor um this was a paper
back in the image classification days
that said as long as you're starting
from the same initialization or maybe
something that's been trained a little
bit um then you do multiple different
training runs and you just average the
weights just linearly average all of the
weights and you get a better model um so
this was like you know drawing from the
intuition around ensembling and things
like that but there was a lot of like
debate at the time I remember this what
is going on how does this even work a
lot of people saying oh you know you've
got to think about each individual one
mostly sort of fits the true
distribution of the data but has these
weird spikes and overfitting by
averaging them all together we're like
flattening the Lost landscape and this
is some sort of theoretical Improvement
um yeah so there's a lot of work around
like can we use this to Ensemble is this
efficient for training um but this is
kind of like separate to the more recent
oh you know I want to like have explicit
outcomes that I'm looking for I take one
model that's good at pencil drawing and
one model that's good at the celebrity
and I specifically combine them to get
good pencil drawings with that celebrity
that's a more recent
Trend um okay so that's the that's the
setting the scene right we try to Mush
these models together somehow
um and so we'll talk about some of the
approaches but then we should also look
at what this paper is contributing which
is to say um we're going to do this in
an automated way rather than just uh
hoping for the best or having to
understand like maybe what the models
are each good at and hoping that the
combination is good um cross domain
merging so not just very similar things
but transferring very disperate sets of
skills um they're going to result in set
of the out performance they say and
we'll check that that's indeed the case
um generaliz ability and efficiency so
we're not spending too much time when we
could actually just be you know maybe
training a better model know this is
going to be faster somehow um and yeah
they're going to at the end of it have
this cool culturally aware VM that's um
better than any of the existing
ones um okay so I guess we should talk
about how the model merging happens and
look into some of the backgrounds there
um so they mention that um linear or
spherical linear interpolation right
literally just taking some weight at
some of the weights
um has been a popular approach um but
then for language models there's a few
more recent works so we can take a very
brief look at each of these
um this task AR arithmetic from last
year was an early one um idea here is
pretty simple and the intuition is we
have a base model then we've done a
little bit of fine tuning for some
specific task if we look at the
difference between the fine-tuned model
and the base model we get this like this
Delta and this is going to be a
direction in weight space that shows us
getting better at this particular task
right so they say oh we've got this task
Vector this is a direction this is a
difference so if I take my base model
and I add this task Vector to the
weights I hopefully get something that's
better at this task and they say oh we
can actually um have multiple of these
task vectors and we can combine them
right so I could say I want .7 times my
math task Vector so I get a little bit
better at maths and n comma 9 times my
science question answering vector and I
combine those two together and I get
something that's hopefully good at
science and math um and they explore
doing different um combinations there um
okay so that was an early one again this
is very close to just like the linear
interpolation SL linear combinations of
Weights um so that was an earlier work
then there's others that have um
attempts at improving that and so one is
this um ties mer in and so here they
make the observation that okay it's all
well and good to think of these Deltas
um these task vectors or whatever you
want to call them but when you have
multiple different ones that interfere
somehow that's where you get a
performance drop and so they say yeah
existing methods often ignore this
interference we're going to try and get
around this um either by eliminating
redundant parameter values or
disagreements on the sign so their
method said trim elect sign and merge
this is going to do a few different
things one if you have um some
parameters that have only Changed by a
very small amount um they're just going
to not make those changes so the
assumption is I have my base model I
fine tune it a bit some parameters are
going to change quite dramatically and
these are going to be the ones that are
relevant to whatever task I'm training
on but then a lot of them might just
move around a little bit just from
random noise so those we probably don't
care and we should probably just reset
them to the base models value um then
resolving sign conflicts okay now I'm
trying to combine three different models
and two of them drastically increase
this parameter and one drastically
decreases that parameter you know how do
I deal with that um so that those what
they call sign conflicts where the
direction is different and then merging
only parameters that are in alignment
with that final agreed upon sign um yeah
so just trying to reduce these clashes
where you have different updates pulling
in different directions how do we handle
that um this ties merging is one
approach to handling that and they find
that this does better than some of the
pre previous
methods um then another work dare is um
and the reason I'm I'm spending time on
these is these two together are what
this paper we're looking at uses um so
dare I think is called something like
Transformer models are Super Mario
language models are Super Mario um
absorbing abilities from homologous
models as a free lunch it's a bit of a
weird title it's a bit of a weird paper
to be honest um but what they observe
the observation is kind of interesting
the whole paper is a lot of words around
this one key observation and this
technique and basically the technique is
to randomly drop these Delta parameters
with a ratio and then to rescale the
remaining ones so remember I said we
have this direction from the base
weights that are update we we've
fine-tuned this model and we have maybe
multiple of these models we look at the
difference and that Delta is like oh
cool this tells us how to edit these
weights um this is saying oh if you only
apply some fraction of those updates and
you zero out the rest with some like
random Dropout you probably still get a
lot of the benefit of that fine tuning
and in fact they show that you can drop
um quite a high percentage of the
parameter updates before you start
losing performance so this is somewhat
counterintuitive um and later if you go
look at their tables and things you'll
see that the improvement from their
approach versus more you know simple
approaches is not huge right the numbers
are always somewhat close together um
but it is interesting to think like okay
what could this be telling us and to me
part of what this hints is
that there are these clashes so if I
have two fine tunes that maybe both add
some skill and I'm trying to combine
them together there might be
interference there might be like um
reduced performance because of that
interference and so if I'm dropping out
and only keeping 10 20 30% of each of
those sets of updates the overlap is
going to be lower and the chance of
those like destructive interferences is
going to be lower so maybe that gives
you like a better result not because
it's actually better than if you could
more intelligently combine them but just
because you have less of these weird
clashes and conflicts um and so yeah
this almost talks back to um some of you
may have seen the video I did on zip
Lowa where they're trying to combine um
specifically lowers of diffusion models
but they also had this issue of like oh
some updates would be um interfering
with each other from two different
lowers if they both had the same update
you shouldn't just like naively combine
them you should have some way of
detecting um or scaling or adjusting so
that they didn't have those conflicts so
I think the language model merging crowd
um could maybe benefit from some of the
diffusion lower emerging techniques and
vice versa um but this is all active
research I know um yeah I've spoken to I
think was some some of the people on
this team um yeah and everyone's busy
working on this and and figuring out
better and better ways to do this um
anyway so that's how exactly we're
emerging these different models and in
this paper they're going to to use a
combination of this dropping out some of
those updates and then using the ties
merging to say for the remaining ones
how do we actually combine them rather
than just linearly combining them we're
going to do this oh check you know zero
out any that are really small um rescale
appropriately check the sign and only
adjust if the sign agrees that kind of
thing um okay then there's one
additional type of merging called
Franken merging and this is different so
everything we've talked up to until now
has been I have two models that are the
same shape or two layers that are the
same shape and I'm somehow combining the
weights from those two to give a new
layer that's also that same shape
Franken
merging has been around for a little
while in the language model community
and it's built on this intuition that um
each layer in these Transformer models
mostly passes through the data untouched
and at best it makes small updates to
that hidden State and so if you take
layers 1 through 12 of a model then the
data would normally go into 13 through
20 say um but then you could skip a few
of those layers and jump straight to
layer 17 and the data coming into that
layer would look slightly different to
what it's used to but not that different
right the the difference from the start
of each layer until the end of each
layer in the middle of these
Transformers is generally quite small
they're only occasionally making updates
and the intuition here is that it's only
when something specific to that language
model head and that layer that it's
learned some particular fact or some
particular pattern then it's making an
update but otherwise it's kind of almost
doing like a a no op right it's just
passing on the data untouched or only
tiny a tiny bit adjusted um so Franken
merging was buil built on that
observation it's like oh well what if we
then like stacked multiple extra layers
in there right so we take a model that
was 30 layers we expand it to be 50
layers so it has the first 10 layers
untouched then has has two copies of the
11th layer two copies of the 12th layer
maybe um the 13th and 14th layers are
from a different fine tune of that same
base model and all of these different
layers are shoved sequentially together
to give you a deeper model um and this
is how you see like 128 or 120 billion
parameter models based on a 70 billion
parameter model or 10 billion parameter
models based on 7 billion parameter
models people just duplicating layers or
combining different variants of a given
layer from different models but not by
averaging the weights just by stacking
them
sequentially so that's what they mean by
Franken merging and we'll see that this
ties into their um when they talk about
the data flow Space versus the parameter
space parameter space is going to be all
the other kinds of merging where you're
combining the weights data flow space is
going to be uh stacking more layers and
changing the order of those
layers cool all right that's a lot of
background we can finally get to the
method and so their goal is to create a
unified frame that can do both of these
types of mergers and to give us a
resulting model that hopefully surpasses
any individual in the collection right
so we want to combine multiple models
together to get something that's even
better um so they're going to apply
evolutionary algorithms and we can talk
about that shortly um and they're going
to split this merging process into these
two different spaces um the merging by
combining parameters and the merging by
changing the data flow so this diagram
here here is a nice overview here we
have two models these are our original
models both trained from the same base
but with different fine tunes on some
different
task um and so you can see here this
model is the same shape as these two um
and each layer is some combination but
the weighting is different so here it's
mostly this first model it's mostly blue
the second one is sort of a mix the
third one's mostly red um but you can
see the shape hasn't changed it's just
that the weights have been combined in
one of these fancy
um merging
techniques the second model here is also
a combination of these two but instead
of averaging the weights all they've
done is stacked some of the layers from
one and then a layer from the second
right so now we have more layers in
total um but each of the layers
individually hasn't changed so this has
just changed in the data flow it hasn't
changed the weights of Any Given layer
um but there's no reason we can't
combine these and so here they combine
this model which was a merge of the
weights plus an extra layer from one of
the other models so now we've also
changed the data flow and so yeah that's
going to be their approach is going to
be doing bits of both and combining them
together and seeing which performs
best um now yeah at this point we can
look at um so they're going to say we're
enhancing ties mer merging with dare so
they're combining those two techniques
we looked at um they're doing layerwise
merging um and they're going to optimize
this with an evolutionary algorithm and
guided by some task specific metric so
we should talk about now what is
evolutionary computation what what is
going on in this paper um because this
is different to the kind of fine-tuning
training gradient-based um
differentiable uh updates and training
that you might be used to this is going
to be some different approach and this
is kind of one of the ways in which this
lab is trying to be different to
everyone else doing the same gradient
based approaches so I'm going to switch
to a different screen
um and we're going to try and get an
intuition specifically for this CMA
algorithm so this is
um what does this stand for covariance
Matrix adaption um evolutionary strategy
something like that um but the core idea
here is that we have multiple parameters
that we're trying to optimize so for
example um oops you can consider like
okay we have some scale here and this
could be the um the relative weight of
model A versus model B in our merge and
we have some different parameter and
this might be um maybe this is like the
um percentage of the weights that we
drop in the Dare part of the merging
right and we can have many more of these
continuous variables that we're
searching over so there's different
scalings for each layer and that kind of
thing so this is the space in which we
want to search and every Point here
corresponds to some output model right
so this one could be mostly Model A
versus model model B um and this is low
density versus high density in terms of
like how much am I dropping out um I
could also try this merge here um so
what this strategy does is says okay
we're going to initialize a whole
population of candidates and the way we
going to do this is we're going to have
some distribution in this search bace
right so I'll draw the distribution here
with a little mean and you can imagine
the distribution being like spread
around that mean um and so our
population is going to be samples that
are more likely to be close to that mean
but scattered around um and each of
these is going to be some candidate that
we evaluate right now you can imagine
that some of these candidates might
perform really well like these three
here might get um pretty good scores and
a few over here might get extremely
terrible scores so once we've picked a
few candidates and we've tested them out
um what we're going to do next in this
algorithm is to say let's select the
good ones right so these candidates here
these all did pretty
well this is going to be like my
survivors then I'm going to update the
distribution that I'm using to search so
remember we had this distribution
centered around the mean I'm going to
update this such that it's closer to the
distribution of those
survivors um and so now I've got a new
mean right it's not the mean those
survivors is some update step size so
I'm not going to all the way but I do
have a new distribution and that new
distribution is closer to the part of
this parameter space that produced those
hopeful candidates so now I'll sample
some candidates from that uh
distribution again some of them will do
better than others I'll pick the top
scoring candidates I'll update the mean
and we apply this again and again um as
many steps as we like until we find some
stopping criteria or we we stop
improving um but the idea of this
algorithm is that it lets us search this
space these continuous parameters um but
importantly it lets us do it without any
of this having to be differentiable
right and so we're not able to find a
gradient like we would with the
parameters of a neural network based on
some like uh differentiable loss instead
we're just using these random samples
and then we're kind of like Computing
almost a pseudo gradient right or some
like Direction in the space that might
be useful but we're continuing to sample
lots of points randomly to get lots of
exploration and each of these ones the
evaluation of this candidate here this
doesn't have to be differentiable this
could be like oh I then fed at a bunch
of multiple choice questions and I
looked at the accuracy all that matters
is that we can pick out what are the
highest performing candidates um yeah so
it's a pretty interesting algorithm um
and this is more broadly what
evolutionary algorithms are really good
at it's like searching a really large
space um doesn't have to be continuous
can sometimes be like discreet um none
of the uh the outcomes have to be
differentiable there's no gradients
flowing it's more like seeing which ones
work and which ones don't in some like
population that we generate and then
somehow um updating
our parameters that we're searching over
to try and find the best combination of
parameters and so in this case um for
the parameter space merging those
parameters are referring to like the
waiting and so on of how exactly we're
combining these different models um yeah
and we can look at now what is the
equivalent for the data flow space
um recent analysis implies that
knowledge is stored distributedly in
language models I thought this was a
very interesting observation and a very
interesting paper that it links to um so
yeah I'll leave this mostly for as an
exercise for the reader um but this is
feeding into that idea I spoke about
earlier of like why would stacking
layers one above the other out of the
order they originally trained why would
that even work and the intuition is that
yeah they're activated on like specific
facts or specific patterns and only then
are they making updates to the
distribution of likely tokens um and a
lot of the time the residual that's
passed it's not going through you know
there's some path that goes through the
layer and there's some path that goes
directly on and they're combined and the
path that's being fed through the the
feed forward or the attention head um
that part is like a modification a Delta
and those are sometimes small um and
sometimes large and these um the the key
intuition is that they sort of Stack um
and if we could stack more layers that
maybe had more general knowledge that
might not be a terrible thing anyway um
slight diversion but this is kind of
giving some justification for why this
data flow um tweaking might even make
sense
so we just talked about evolutionary
algorithms being able to search this
parameter space and try different
combinations that's great um but it's
not a Magic Bullet and one of the issues
is that you can sometimes end up with a
space that's just too vast to explore
and that's the case if we consider this
um data flow merging technique
where we want to get up to T layers
right so I have two models that have 32
layers each and I want to create a new
Franken merge that's got 40
layers if I've got n different models
that I'm choosing
from um and I could have lots of
different layers in each of those models
and I could stack them in any order the
search space is vast right I could
choose any layer from any model for any
layer in the final model just way too
many different options to
explore and
so this even if you had a really good
evolutionary search um would just be
like kind of way too computationally
intensive or even
impossible so to try and do that this
paper is saying how can we reduce this
search Spas down and they come up with a
a somewhat interesting but somewhat
hacky approach which is to say okay we
will have a fixed ordering where we'll
take
um all the layers in sequential order so
we're never going to do something where
we have like layer seven then layer six
then layer five then layer four then
layer three we kind of going to assume
that they should probably go in roughly
the order that they were added um but
we're going to repeat them so I'll have
layers 1 through 30 of of model one
layers 1 through3 of model 2 layers 1
through3 of model 3 then I'll do layers
1 through 30 of model 1 again
and same for two and three and then some
number of repeats so maybe you have
three repeats um and so we have all of
the layers in this kind of like somewhat
sequential order with
repeats and then the only thing that
I'll be able to learn is whether to
include any given layer or not this
indicator here is like a if this is
above one we include the layer if it's
below one we don't um and so now instead
of having to have all possible orderings
we just have for this long list of
sequential but repeated candidate layers
we just have a one or a zero or a number
of like whether or not they should be
included um for every like index in that
list and so now we've reduced it to 2 to
the power of T options versus um n plus
1 to the power of T So a much smaller
space um and there's one extra Nuance
which is that okay so we're optimizing
this I we trying to pick different
configurations of this you know include
or not um array um they find that if you
just do that it's not ideal there's some
problems with jumping straight from
layer seven to layer 12 um they want to
do more theoretical analysis of this but
for now they say
well we find that practically it helps
to do some scaling as well so if
I'm um choosing different layers from
different parts of the model and I'm
putting them in order um what I should
do is probably scale the input by some
weight and these weights are also going
to be something that we optimize right
so instead of just having the yes or no
indicator array we're also going to have
this W array that we're optimizing
during this evolutionary search process
as
well um and there's ways to make it even
smaller um okay so that's the that's the
framing here so the first one was a lot
easier to visualize we're just changing
the weights with which we're merging and
that's parameters that we can search
over using this evolutionary search um
the second one there's some trickiness
around trying to make the search space
manageable so they have this particular
types of ordering this um inclusion
index um thing and then this waiting um
but these are still just parameters that
we can VAR According to some
distribution and then we can evaluate
the candidates by actually doing the
merge and seeing how well it does um and
then we can update the parameters that
we're searching and we can try a new set
of a new population um yeah okay so they
took they say these are orthogonal
approaches in other words they're both
useful but we can combine them together
um and so they're going to do first one
and then the other um and then they also
talk about um being able to apply this
with multi-objective genetic algorithms
this as far as I know they don't
actually do much of in the paper um but
if you're curious this here is an
algorithm that lets you take multiple
different objectives right so maybe I
want to do good on math questions and
science questions and Japanese uh
cultural questions or something like
that um if we are exploring the space of
possible model mergers maybe some are
better at one of those and some better
than another we're really interested in
like I want to know the models that are
decent at all of them maybe some are
more good at one and some are more good
at another but I want that kind of that
Pito Frontier right um and that's
exactly what this kind of algorithm is
good at is saying like Okay well here
are some combinations that are worse
than other combinations on all of those
things and here are some that are kind
of like as good as each other and maybe
better in one metric or another so these
are like the candidates that we'd rely
care about and then we can choose those
trade-offs of which skills do I
particularly value um but yeah we can
have this like multiobjective thing
coming in anyway small side note cool
that's a lot of background um thank you
for bearing with me I think now we can
get into the results so we've talked
about how they're doing this they're
doing this evolutionary search over
these merging parameters um we should
now answer the important question of
does this actually work and so to try
this they're going to set things up with
a Japanese model um by the way all of
these are based on the Mistral 7B model
um so they have a Japanese llm and then
they have two math llms neither of which
is trained on Japanese they're all
trained from the same base model um and
then they're going to test these on
Japanese math questions so they have a
translation of a grade school math
question data set um and they're going
to use that to say can we get a model
that's good at math in Japanese and
they're going to evaluate this on those
math
problems um cool okay so they're doing
that algorithm we spoke about they're
having some initial parameter values so
it's like a mean and a a sigma or
variance or a standard deviation um some
population size then they'll pick the
best they'll update the initial
parameters to have a new mean and a new
Sigma then they'll sample more a new
population from that new distribution
and so on and so forth um they're
evaluating their candidates on some
training samples that are different from
the test set so they're saying okay I've
got some different math questions in
Japanese um because I don't want to just
train the test set I do a th trials they
take the best one um as the final
model and then similarly for the data
flow stuff they um they line up all
their candidate layers they have some on
and some off and they do their
optimization over that and they end up
with some combination of layers from
different input models um okay and the
key
results the accuracy here the general
Japanese model was not so good at maths
the general maths model was not so good
Japanese so neither of them do
particularly well but their mergers here
all do fantastically right and so the
merging just in the parameter space
combining these three models that's what
this means 1 plus 2 plus three combining
these three input models you get an
output model of the same size just
because this is parameter space only but
the accuracy is a lot higher because
we've managed to hopefully take the
Japanese skills plus the math skills and
we can now answer these Japanese math
questions um likewise for data flow
um taking some layers of this math model
and some layers of this Japanese model
it does work right we do get something
that gets a little bit better than any
of the inputs um but what's really even
better than that is to say ah let's take
um some layers from our merge that we've
combined in weight space and some from
the general um Japanese model and smush
them together by reordering and that's
where they get the highest
performance um so you can see there's
the two types of merging here this is
the um parameter space only this is the
data flow space only and then combining
the two together by doing one and then
the other that's where they get their
highest gains and they now have a
slightly larger model um but it performs
better um and in fact it beats a lot of
the
existing Japanese models very
handily um and this is just another data
set that they evaluate on and same thing
so this is a good sign that it's not
just this data set um it extends to um
General Japanese abilities as well um
yeah so that's fantastic we get a a
better model out that's got both skills
that we wanted um just as you would have
hoped would happen and they were able to
do this without having to you know
manually like guess at those scaling
figures instead they could use their
search
approaches um okay so that was nice I
like this figure it kind of shows if you
look at which ones the math models get
right um those tend to flow over into
which ones the combined models get right
so it says you know um this is kind of
what we'd hope that we're not just
magically getting some new abilities
instead it's like the kinds of questions
that some of the input models would get
right are the kinds of questions that
the merged models also get right and the
kinds of questions that none of the
input models can answer also none of the
merged models really can
answer um
cool uh then okay um I guess we can look
at yeah you can see that all three input
models have some weight and some density
um so they're all contributing um
likewise in the data flow um they
initialize things so that you get a lot
of the early layers in order this seems
to be very helpful um but then over the
course of this uh search you end up with
some new ordering um you can see it's
still in this like sequential withd
repeats setup um but different scalings
the size of these dots is the um the
weight Matrix this the scaling Factor um
yeah so we end up with a stack of layers
some from one one model some from the
other model that's the color um and this
is an ordering that seems to help and
give the best
results okay um jumping up even further
in difficulty can we take a model that's
been
trained um where we have a vision
encoder extracting image features and
then we're projecting these into the um
embedding space of this language model
and then the language model is um
learning to interpret these these are
like non-word tokens you know these like
soft prompts um it's learning to somehow
make sense of those and answer questions
about those um yeah so that's a very
different domain to just general
language modeling um and so they use
this lava model which is exactly that
it's learning to take these these image
embeddings and then some text eddings as
well to ask and answer questions about
the image caption it and so on um yeah
so their question is can we combine this
image understanding that's been added to
this model with the Japanese
understanding of our Japanese model and
so this is a very tricky task right this
Japanese model has never been trained on
any um image
inputs um yeah and so that's what
they're going to try and do they have
some uh question answering data sets and
they create a new
one
um excuse
me um yeah so they apply their technique
um same as before doing parameter space
and uh data flow space merging and yeah
shocking ly and impressively the result
is something that does better than
either the base Vision model or the an
existing Japanese um Vision language
model that was trained specifically on
this um they get better results than
both on both the existing data set and
the um the new data set that they
create right so you have to think about
this for a little bit to appreciate it
this is a model that is merged from one
model that doesn't really focus on
Japanese
but that does focus on adding image
understanding to the base model right M
wasn't trained on images but this fine
tune lava was we have another model that
was just fine-tuned on Japanese um text
and Japanese culture and we're able to
combine them to get something that's
able to answer visual questions in
Japanese with the appropriate context um
yeah so very very cool results and nice
to see that the way they do this is they
just oh apply our technique right
there's no fine-tuning and tweaking and
trying different scales manually just we
have a technique now that's able to like
robustly take two input models and
figure out an optimal waiting or at
least a really good waiting using this
Evolution
research um yeah so that's the core of
this paper the discussion and
conclusions you can see this is
something they're very excited about
they have this idea of maybe like a
whole swarm of different models out
there in the world learning different
things from different people um and
being able to like improve on these
different subtasks and then their
evolutionary techniques or other
techniques able to like combine all of
those individual small models together
into some larger Foundation model that
has all of these capabilities um yeah so
you can see they have this really cool
Vision
um I do think this is a really nice
paper I think this is a really nice
approach especially combined or compared
with some of the
um the existing work so if you go back
to the um the merge kit citation um I
don't have the browser tab open but we
can talk about that just briefly um what
was happening is that you'd have people
looking at the leaderboard the hugging
face open LM leaderboard and then they'd
pick a couple of models on there and
then they'd merge them with this like
easy oneclick tool and then they'd
submit that for evaluation and it might
get you know a slightly higher score on
the leaderboard you know and this was
rinsed and repeated and rinsed and
repeated to the point where you get some
model which is a merge of two other
models and each of those is a merge of
two other models and those were merges
from some base models and some initial
models um and so you have this whole
like lineage this family tree of this
model that's getting you know .1% better
than some of the other models and so
it's a it's the top of the leaderboard
it's the best 7B model ever um but
several issues one you don't know
whether any of those initial models had
um test set contamination I know some of
them definitely do now everything that's
a merge of one of those merges of one of
those merges of one of those
contaminated models um you're not sure
if the performance is because it's
actually good at the task or whether it
just happened to be trained on some of
the test sets
um so you have this complication and
then also you kind of have some
overfitting right because these are
being evaluated on the leaderboard and
then we're picking combinations based on
which ones do well evaluate on the same
leaderboard again picking which ones do
well um so you end up with something
that does really well on that
leaderboard um but that doesn't
necessarily translate to does well on
other
tasks whereas the paper that we've
looked at here they're very careful to
say we have our evaluation set that we
use for the evolution research that's
separate to the test set and then we're
also going to check does this apply to
other similar domains does this still
have good knowledge across other
Japanese tasks as well you know
basically is this something that's
somewhat General versus just like oh we
overfit to this very small test set um
and then we call that good so I really
enjoy this paper um congratulations to
the Sakana team I'm looking forward to
seeing what else comes out of this lab
um yeah and I hope you've enjoy this um
Deep dive into model merging
evolutionary algorithms and um a really
fantastic paper
thank you so much for watching
5.0 / 5 (0 votes)