[ML News] Jamba, CMD-R+, and other new models (yes, I know this is like a week behind 🙃)
Summary
TLDR在这段视频脚本中,我们探索了最近两周内发布的一些令人兴奋的新型人工智能模型。首先,AI 22 Labs 推出了 Jamba 模型,这是一个混合模型,结合了 Mamba 架构和注意力层,实现了长文本理解而不会显著增加内存需求。Data Bricks 也发布了一个超过100亿参数的大型语言模型,它在自然语言处理、编程和数学方面表现出色。此外,还介绍了 Cohere 的 Command R Plus,这是一个性能更强的模型,专门针对命令优化和检索增强的生成进行了优化。Google Research 展示了 Video Poet,这是一个文本到视频的零样本视频生成模型,以及 Magic Lens,一个使用自然语言检索图像的模型。还有关于合成数据生成的研究,以及 Salesforce AI Research 发布的 MOAI 时间序列基础模型,旨在提供通用的预测能力。最后,还提到了 H2O AI 的 H2O Den 2,一个在多个基准测试中表现出色的1.8亿参数基础语言模型。这些模型的发展不仅展示了人工智能领域的进步,也预示着未来技术应用的广阔前景。
Takeaways
- 🚀 AI 22 labs发布了Jamba模型,这是一个混合模型,结合了Mamba架构和注意力层,能够在不增加内存需求的情况下实现长文本上下文的推理。
- 📚 Databricks推出了新的dbrx模型,这是一个超过100亿参数的大型语言模型,不仅在自然语言理解上表现出色,还在编程和数学方面有优异的性能。
- 🔍 Cohere发布了Command R Plus,这是一个优化的命令检索增强生成模型,支持10种不同的语言,但需要在商业用途中付费。
- 📈 Mistil发布了新的7B模型,这是一个基础模型,用于训练他们的新指令模型,具有32,000个令牌的上下文窗口和指令进行微调。
- 🎥 Google Research的Video Poet是一个文本到视频的零样本视频生成模型,能够处理各种文本和视频混合的任务。
- 🔎 Google Deep Mind的Magic Lens是一个图像检索模型,使用自然语言进行检索,并且大部分使用合成数据进行训练。
- 🧪 Cosmomedia项目是Fi模型的一个开源复制,使用合成数据训练,旨在展示高质量训练数据的重要性。
- 📊 Google Deep Mind的另一份论文探讨了在大型语言模型中进行长篇事实性研究,开发了一种验证方法和模型,以提高事实性评估的准确性。
- 🤖 Snap和特拉维夫大学的研究论文myvLM探讨了如何个性化视觉语言模型,使模型能够从用户的视角理解和推理特定的概念。
- ⏱️ Nvidia的Latte 3D是一个文本到3D模型,能够快速生成高质量的纹理网格,推动了文本到3D领域的研究。
- 🌐 Meta的场景脚本模型旨在理解周围的物理空间,支持增强现实技术,通过合成数据进行训练。
Q & A
Jamba模型是如何结合Mamba架构和注意力层的?
-Jamba模型是一种混合模型,它结合了Mamba架构和一些注意力层。Mamba架构是一种状态空间模型的进化,类似于循环神经网络,但能够一次性完成所有操作。Jamba模型通过这种结合实现了非常长的上下文性能推断,同时没有显著增加内存需求。
dbrx模型在哪些领域表现出色?
-dbrx模型不仅在自然语言理解方面表现出色,还在编程和数学领域也有很好的表现。它是一个大型模型,拥有超过100亿参数,并使用了专家混合架构,总共有1320亿参数,但任何给定输入时只有360亿参数是活跃的。
Command R plus模型的主要特点是什么?
-Command R plus是一个更高效的状态命令优化和检索增强的生成优化模型。它支持10种不同的语言,并且被训练用于引用和使用工具。这个模型不是开源的,而是以开放权重的形式提供,允许个人使用,但在商业环境中使用需要向Cohere支付费用。
视频诗人(Video Poet)模型是如何工作的?
-视频诗人模型是一个文本到视频的零样本视频生成模型,它不仅能够进行视频生成,还能够处理任何类型的文本和视频混合。该模型经过多种不同目标的训练,使其能够执行文本到视频、图像到视频的风格化等任务。
Magic Lens模型的主要功能是什么?
-Magic Lens模型是一个图像检索模型,它使用开放式指令,允许用户使用自然语言进行检索,模型会找到相应的内容。它在很大程度上是使用合成数据生成开发的,包括网络抓取、数据集分组、清洁、元数据扩展等步骤。
CosmoPedia项目的目的是什么?
-CosmoPedia项目是尝试使用合成数据重新创建类似于微软的Fi模型的训练过程。Fi模型主要在教科书上进行训练,CosmoPedia项目的目的是通过使用高质量的合成数据,减少训练数据的需求,从而使得模型不必过于庞大也能表现出色。
LLM(大型语言模型)基础事实验证的重要性是什么?
-LLM基础事实验证的重要性在于它可能标志着基于LLM的事实验证时代的开始。通过使用LLM代理作为评估者,开发了一种安全搜索增强的事实性评估方法,该方法在与人群源化注释者的意见不一致时,有76%的情况下自动系统是正确的。
myvLM模型的目的是什么?
-myvLM模型的目的是个性化视觉语言模型。它尝试教授视觉语言模型关于“我”、“我的狗”、“我的房子”和“我的朋友”等概念,使模型能够从第一人称视角回答问题和接收指令。
Latte 3D模型的主要创新点是什么?
-Latte 3D模型的主要创新点在于它能够非常快速地从文本生成高质量的纹理网格。它通过结合3D先验、摊销优化和表面渲染的第二阶段,在仅400毫秒内稳健地生成这些网格。
Dolphin 2.8模型的特点是什么?
-Dolphin 2.8模型是一个微调的mstl 7B模型,它的特点是未审查的,意味着数据集已经被过滤以去除任何偏见和对齐样本。这使得模型更加合规,建议在将模型作为服务公开之前实施自己的对齐层。
Quen 1.5e模型的性能如何?
-Quen 1.5e模型的性能与7亿参数模型相当,但它只有三分之一的激活参数。这表明通过使用混合专家模型,可以在减少活跃参数数量的同时保持模型性能。
Outlines
🚀 AI 22 Labs的Jamba模型介绍
本段落介绍了AI 22 Labs的Jamba模型,这是一个混合模型,结合了Mamba架构和注意力层。Jamba模型被称为开创性的SSM(State Space Machine)Transformer模型,它在不增加内存需求的情况下实现了长文本上下文的推理性能。模型公开可用,遵循Apache 2许可,并在关键基准测试上表现良好。此外,还提到了Data Bricks的dbrx模型,这是一个超过100亿参数的大型模型,不仅在自然语言理解方面表现出色,还在编程和数学方面有优异的性能。
📈 新模型和数据集的发布
这段内容涉及了多个新模型和数据集的发布。首先,提到了cohere的command R plus模型,这是一个在10种不同语言中训练的、性能更优的命令优化和检索增强的生成模型。然后是True的7B模型,这是一个基础模型,用于训练新的指令模型。接着,介绍了Google Research的Video Poet模型,这是一个文本到视频的零样本视频生成模型。此外,还提到了Magic Lens,这是一个使用自然语言进行图像检索的模型。最后,提到了Cosmo Media项目,该项目尝试使用合成数据来训练模型,以及Google Deep Mind的关于大型语言模型中长篇事实性的研究。
🧩 模型个性化和3D合成的新进展
本段落讨论了模型个性化和3D合成领域的最新进展。首先,提到了Snap和特拉维夫大学合作的myvLM论文,该研究旨在教授视觉语言模型理解与个人相关的概念,如“我”、“我的狗”、“我的房子”和“我的朋友”。接着,Nvidia的Latte 3D论文介绍了一种新的文本到3D合成方法,该方法能够快速生成高质量的纹理网格。此外,还有Meta的场景脚本模型,旨在理解物理空间,支持增强现实。最后,提到了Sakana AI关于进化基础模型的研究,探讨了模型合并和神经架构搜索的进展。
🔍 Salesforce AI研究的moai模型和其他新模型
这一段介绍了Salesforce AI研究发布的moai模型,这是一个用于通用预测的时间序列基础模型。接着,提到了关于QAR(查询响应系统)的泄露信息,以及H2O AI发布的H2O Den 2模型,这是一个在许多令牌上训练且在排行榜上表现良好的1.8亿参数基础语言模型。此外,还提到了Materiality Labs的Garment 3D Gen模型,该模型用于生成服装,以提高虚拟现实中的购物体验。还有Nexa AI的Octopus V2模型,这是一个为Android API工具使用和功能调用优化的20亿参数模型。
📚 无审查模型和OCR数据集的发布
本段落讨论了Dolphin Emoji模型,这是一个未经审查的、经过微调的mistal 7B模型,它去除了数据集中的偏见和对齐样本。接着,提到了新闻研究所发布的1bit llm或1.5 bit llm的复现,以及Jet Moe的模型,这是一个以不到10万美元成本训练的8亿参数模型,性能与Llama 2相当。此外,还提到了Quen 1.5e模型,它以较少的激活参数达到了7亿参数模型的性能。最后,提到了lm聊天机器人竞技场排行榜和Berkeley函数调用排行榜,以及LaMINI发布的摄影记忆评估套件和Hugging Face上的OCR数据集。
🏆 模型评估和排行榜更新
这段内容主要关注模型评估和排行榜的更新。首先,提到了Starling 7 Beam模型在排行榜上的表现,尽管其参数较小,但与其他大型模型相比表现良好。接着,讨论了Claude 3模型在Berkeley函数调用排行榜上的表现。此外,还提到了TREC(文本检索会议)的跟踪评估,以及LaMINI发布的摄影记忆评估套件,这是一个集合了多个基准测试的套件,专注于精确匹配特定事物的任务。最后,提到了Clement Delong发布的两个最大的OCR开放数据集,旨在改进OCR技术。
Mindmap
Keywords
💡Jamba模型
💡Mamba架构
💡dbrx模型
💡合成数据生成
💡长形式事实性
💡个性化视觉语言模型
💡3D合成
💡时间序列基础模型
💡模型合并
💡OCR数据集
💡对话系统
Highlights
AI 22 labs发布了Jamba模型,这是一个混合模型,结合了Mamba架构和注意力层,实现了长上下文性能推断,同时没有显著增加内存需求。
Jamba模型在关键基准测试上表现良好,并且可以在Apache 2许可下公开获取。
Data Bricks发布了一个新的大型语言模型dbrx,它在自然语言理解、编程和数学方面表现出色。
dbrx模型使用了专家混合架构,具有1320亿参数,但每次给定输入时只有360亿参数处于活跃状态。
Cohere发布了Command R Plus,这是一个性能更优的模型,专门针对命令优化和检索增强的生成进行了优化。
Command R Plus模型不是开源的,而是以高级开放权重的形式提供,个人使用免费,商业用途需要付费。
Misto发布了一个新的7B模型,这是一个基础模型,用于训练他们的新指令模型,具有32,000个令牌的上下文窗口。
Google Research发布了Video Poet模型,这是一个文本到视频的零样本视频生成模型。
Google Deep Mind的Magic Lens是一个图像检索模型,可以使用自然语言进行检索。
CosmoMed项目使用合成数据来训练模型,展示了高质量训练数据的重要性。
Google Deep Mind的另一份论文探讨了大型语言模型中的长篇事实性问题,提出了一种安全搜索增强的事实性评估方法。
Snap和特拉维夫大学的研究论文myvLM探讨了视觉语言模型的个性化问题。
Nvidia的Latte 3D是一个文本到3D合成的模型,能够快速生成高质量的纹理网格。
Meta的场景脚本模型旨在理解物理空间,支持增强现实。
Sakana AI发布了关于进化基础模型的新博客文章,探讨了模型合并的策略。
Google Research的Rad Splat是一种用于鲁棒实时渲染的辐射场信息高斯溅射方法。
Salesforce AI Research发布了MOAI,这是一个用于通用预测的时间序列基础模型。
H2O AI发布了H2O Den 2,这是一个1.8亿参数的基础语言模型,在排行榜上表现良好。
Materiality Labs的Garment 3D Gen模型可以生成逼真的服装,为虚拟世界和增强现实提供更好的购物体验。
Nexa AI的Octopus V2是一个为Android API工具使用和功能调用优化的20亿参数模型。
Dolphin 2.8 Mistal 7B v0.2是一个经过微调的模型,去除了数据集中的偏见和对齐样本。
News Research发布了1bit llm或1.5 bit llm的复现,即Microsoft的Bit Net论文。
Quen 1.5e展示了如何通过激活更少的参数来匹配7亿参数模型的性能。
LM's chatbot Arena排行榜显示,较小的模型Starling 7在与大型模型的比较中表现良好。
Claude 3在Berkeley功能调用排行榜上名列前茅,该排行榜基于gorilla系列的模型和数据集。
LaMaMi发布了摄影记忆评估套件,这是一系列关注精确匹配特定事物的基准测试。
Clement Delong宣布发布了两个最大的OCR开放数据集,以改善OCR技术。
Transcripts
helloo everyone I hope you're having a
wonderful Monday today we're going to
dive into some new models that came out
in the last 2 weeks it's an exciting
time and the first one is AI 22 labs's
Jamba model this is a hybrid model
between the Mamba architecture and if
you don't know what a Mamba is it's
essentially an evolution of State space
models which is sort of like a recurrent
neural network but not really because
you can like do everything at once and
it's like a convolution but not really
but it's also like really long attention
but not really I've made a video on
Mamba if you want to check that out go
check that out in any case Jamba is well
they call it a groundbreaking SSM
Transformer model it mixes the Mamba
architecture with a few attention layers
and therefore achieves really really
long context performance inference
without blowing up memory requirements
this could be really interesting the
model is available openly under Apache 2
very very cool and performs quite well
on key benchmarks so they solve they
essentially this diagram is their
advertising they do get benefits like
the quality benefits from Transformers
using attention layers but they also get
very high throughput and low memory
footprint very very excellent big fan so
check out Jamba there's also a paper
that includes exactly how they mix the
different layers together a part of that
is also covered in the blog post as you
can see right here a Jamba block
essentially contains Mamba layers
mixture of expert layers Transformer
blocks and so on another openly
available language model is
dbrx new state-of-the-art open llm by
data bricks this is a model that's large
so it is over 100 billion parameters but
it performs really really well across
not just natural language understanding
but also programming and math excellent
model openly available also uses mixture
of expert architecture so it has 132
billion parameters in total but only 36
billion parameters are active on any
given input it's been trained a lot of
tokens code and data and they say
compared to other open mixture of expert
models it is fine grained meaning it
uses a larger number of smaller experts
so it has 16 experts and chooses four
this provides 65 times more possible
combination of experts and they found
this improves model quality they do
release numbers on a whole set of
benchmarks and outperform the
competition models on many of them and
even if they compareed to closed models
like big API models their performance
holds up very well now it didn't take
long before people discovered what its
system prompt is and the interesting
bits in here is that essentially it has
kind of the guard rails prompting inside
of it but then also it has a section
about you were not trained on
copyrighted books song lyrics poems
video transcripts or news articles you
do not divulge in details of your
training data like sure sure I have no
clue whether it was trained on these
things or not but sometimes like
malicious journalists will like type
into a model were you trained on
copyrighted material and the model will
say I don't I don't know like these
journalists do not understand that that
type of question is absolutely
nonsensical to ask such a model and any
output is just random noise and then
they will write like a big story about
it or something like this out of their
complete lack of understanding so I get
the fact that you want to prevent that
by putting it into the system prompt but
it's still kind of funny Co here
introduces command R plus after command
R command R plus is a more performant
model that's state-of-the-art command
optimized and retrieval augmented
generation optimized it's available in
10 different languages and it's trained
to do citations to use tools and so on
this model is not available under an
open source license it's available as
open weight so it's essentially premium
you get to play with it personally as
much as you want but then as soon as you
want to use it in a commercial context
you have to pay coh here some money also
nice cookie Banner here coh here so
there's an accept all button there's no
decline there's a manage settings and
here oh yeah okay everything's turned
off that that's nice okay how about the
oh no it's again accept all but that's
on top ah Shenanigans in any case for
the research Community this is
definitely a cool model essentially what
you can do with this is you can kind of
prepare for up coming true open- Source
models that have these capabilities we
can expect that in a month or two fully
open- Source models will have caught up
with this development and you can
essentially use this in order to prepare
for that now obviously coh here's game
plan is that you'll get so used to this
model and you'll make your prompts so
that they work well in this model such
that when you start making money with it
you'll have to come to them we'll see
it's a new world and new business models
emerge this of Premium type open weight
models is an approach that we've seen
from other places as well back to True
open source mistol releases a new 7B
model it's a base model they say it's a
raw pre-train model used to train their
new instruct model 32,000 tokens context
window and instructions to f- tune they
released this for a hackathon but does
seem to be quite performant all right
back to the world of places that do not
release their models video poet by
Google research still a very cool model
this is a text to video Zero shot video
generation model it's not only video
generation but it's any sort of text and
video intermingling you can see here the
video poet model is trained on a
multitude of different objectives which
then allow it to do yeah text to video
image to video stylization and so on
there is a paper available if you want
to read that and quite a number of Dem
demonstrations yeah that's it another
paper by Google Deep Mind Is Magic lens
this is a image retrieval with
open-ended instructions meaning you can
retrieve with natural language and it
will go and find that stuff so kind of
like a search engine I guess but the
clue here is that this was developed to
a large part with synthetic data
generation and that's something that we
see more and more frequent in recent
times there is an entire pipeline here
that goes into you know how they did it
uh web scraping grouping cleaning
metadata expansion right not just
extraction but expansion so this already
includes some trained models then
scoring and filtering including other
text and image models such as clip and
then instruction generation meaning
image description and so on so
essentially reversing the process
starting from the images generating the
synthetic data around that and then that
gives you a data set to do this kind of
open world or how do you call it open
instruction image retrieval training so
the model itself yeah it's cool but I
think the trend of synthetic data
generation is a cool Trend and certainly
something to look out for another
investigation into synthetic training
generation is the cosmop media project
there is a blog post on hugging face
this is a reproduction of fi Microsoft's
fi models have been trained largely on
textbooks essentially their premise is
hey if you use really high quality
training data you don't need so much
training data and your models don't have
to be so large and you can still perform
really really well and cosmop media is a
attempt at recreating that except using
synthetic data starting out from all
kinds of seed prompts and so on Oh no I
got it wrong according to F 1.5
technical report the authors curated
20,000 topics to produce M billion
tokens of synthetic textbooks while
using samples from the web data sets for
diversity see I got it wrong fi already
made synthetic data I wasn't aware of
that I thought really they just had high
quality data apparently fi already used
synthetic data and then this is a open
reproduction of that very very cool
another Google Deep Mind paper long form
factuality in large language models this
is a paper that goes into well as they
say long form factuality in open domains
they first generate a data set long fact
a prompt set comprising thousands of
questions then they develop a method to
use llm agents as evaluators which they
call Safe Search augmented factuality
evaluator and then they develop a model
that tackles this they develop a data
set they develop an evaluation method
and they develop a model this could be
three different papers but it's all in
one right here what I found really
really interesting is they obviously
have to research and validate each of
those steps and especially this safe
method this verification method using
agents they say it agrees with
crowdsourced human annotators 72% of the
time so not a huge disagree and not a
huge agreement I would have expected
more but on a random subset of 100
disagreement cases safe wins 76% of the
time whenever this and humans disagree
more often than not the automated system
is correct and this is pretty
interesting the era of llm based fact
verification might have begun so the
code to this is available thankfully so
you can go and check that out my vlm is
a paper by snap and Tel Aviv University
and it is about personalization of
vision language models this paper
attempts to to teach a vision language
model about Concepts such as me and my
dog and my house and my friend so
personalizing these models to then be
able to answer questions and give
instructions for receive instructions
from an ego perspective from a
perspective of of me and you know what's
important to me I guess given a set of
images depicting user specific Concepts
such as you your dog your friend we
teach pre-trained Vision language model
to understand and reason over these
Concepts so the result is that the
vision language model can do for example
do personalized visual question
answering for example what are you doing
it will understand who you are what is
your friend wearing it'll understand who
that is and then do the correct thing
obviously very interesting for social
media platforms such as snap also
interesting probably in many other cases
latte 3D large scale amorti text to
enhanced 3 D synthesis is paper by
Nvidia and takes care of text to 3D text
to 3D is another new domain propping up
in the whole text to XYZ stuff the clue
here is that they managed to do this
really really quickly so they say latte
3D generates highquality textured meshes
from text robustly in just 400
milliseconds by combining 3D priors
amortized optimization and a second
stage of surface rendering generating
these things so far has always taken a
long time and with combining several
advances in the field and this paper
pushes this ahead and can do really
really fast generation that wasn't
generation that's just a website but
still really fast scene script by meta
is a model that is aimed to understand
the physical spaces around you with the
goal of supporting augmented reality as
you can see it will try to reconstruct
or annotate things in the environment
and do scene understanding of what there
is interestingly enough this is also
trained on synthetic data from newer
from simulators again synthetic data
being the basis for many of those newer
things very neat very cool champ is a
human image animation method taking an
image of a person you can make it go
boing boing and Dany dance I have not
followed this branch of research too
much I have to say however you can see
the advantage like the how it goes
forward in the field it used to be that
you could maybe turn your head a little
bit one year ago and then you could
maybe make the pictures say something
and now from a single image you can have
them fullblown dance around with giant
movements and being consistent now this
is powered by a parametric model of
humans so actually something that's
aware that humans have arms and head and
so on but it's still extremely
impressive just the progress that is
made by these models in such a short
time Sakana AI has released a new blog
post about evolving Foundation models
they are investigating into evolutionary
model merging model merging is when you
combine different fine tunes of the same
models you can take a layer from here
and take a layer from here and kind
of the problem of course is if you have
bunch of models there are almost an
infinite way of merging the two models
so how are you going to do that and I
think this is where we go back to a
domain that has been I say neglected a
little bit in the last few years which
is kind of architecture search neural
architecture search and automated neural
architecture search except now this
obviously is in the context of model
merging so can we evolve architectures
merging uh strategies between different
models and that's what this blog post
goes into so it's very very cool if you
want to explore that another Advance by
Google research paper rad Splat is
Radiance field in formed gausian
splatting for robust real-time rendering
with over 900 frames per second again I
remember just like a very few years ago
when Nerfs came on the scene and were
like oh wow if you just invest one hour
of GPU time you can sort of tilt the
camera a little bit around the scene
that had to be really defined really in
one place you can you can like move
through rooms and go at 900 FPS and
whatnot absolutely crazy absolutely
crazy Salesforce AI research releases
moai a Time series Foundation model for
Universal forecasting a Cutting Edge
time series Foundation model offering
Universal forecasting capabilities this
is supposed to be one model for all
kinds of Time series which is a bit
special because time series are they're
just pound together by the fact that
they somehow do something over time
there's nothing inherently connecting
and know web search patterns with uh
Health metrics with e-commerce stuff and
so on yet still this model is a combined
model that attempts to unify all these
things kind of like how we shove all
kinds of language into language models
including PR programming including
different languages and so on this model
aims to be a foundation model and a
universal forecaster for any kind of
Time series meaning that if this works
out it would mean technically that
there's something kind of fundamental to
the domain of being a Time series and
maybe that is some sort of deeper
insight into the Universe I have no idea
but in any case very cool that there is
an investigation into that looking
forward to seeing how this field
develops several people have published
leaked info around qar in a paste bin
that's now deleted can confirm the
authenticity oh it's a dialogue system
conceptualized by open AI yab yab
there's energy based model for dialogue
generation evaluate potential responses
holistically you the optimization and
Abstract representation this is complete
crap I mean it could be and it could not
be it's just funny that people have wild
imaginations let's say H2O AI releases
H2O Den 2 which is a 1.8 billion
parameter Foundation language model
trained on many many tokens and
Performing very well on leader boards
compared to models of its own size it's
available if you want to check that out
garment 3D gen by materiality Labs is a
model that well as you can see generates
garments the idea is that I guess in the
metaverse you can look at your clothes
and they look really realistic there and
by being able to very well render
clothes in sort of augmented reality and
so on making sure they you know the
fabric flows well around any sort of
animated human it gives you a better
shopping experience so you can look at
really how this is going to look in real
life or ultimately we won't have to ever
leave VR ever again again and then it's
also cool that we have clothes that
behave like in a 100 years kids will be
like oh this is like a history book of
what the real world used to be like now
we're all just in capsules in some uh
goo plugged into the Matrix I guess you
can consider this an important moment
for future historians octopus V2 by Nexa
AI is a 2 billion parameter model that
is specifically optimized for Tool use
and function calling specifically in the
domain or in the context of the Android
API so the idea is that you can interact
with your Android phone or Android
device using natural language and this
model is very good at using the Android
API to get different things for you done
very cool dolphin 2.8 mistal 7B v0.2
dolphin Emoji is a model that is a fine
tune of mistal 7B notably it is a fine
tune of mstl 7B v0.2 the model we saw
before but notably as all dolphin models
they are uncensored which essentially
means that the data set has been
filtered to remove any kind of bias and
Alignment samples that are in there
saying this makes the model more
compliant you are advised to implement
your own alignment layer before any
exposing the model as a service it will
be highly compliant to any request even
unethical ones I for one am very
grateful that people going into this
direction this doesn't mean I endorse
using these models for any sort of evil
or dumb purposes what it does mean is
that you want to pass the responsibility
of making sure that your deployments are
safe to the people actually deploying
the model in this way they have a choice
over how to implement their own guard
rails and not be subject to the guard
rails that are already in place and
especially for specialized Fields this
makes a lot of sense for example the
apis like open AI they're super duper
careful for anything medical right now
imagine you actually want to purpose
build this thing in the medical domain
where you know all the users are
educated professionals who are aware of
the limitations of these language models
it makes a lot of sense that you can put
into place your own guard rails here
news research is releasing A
reproduction of the famous 1bit llm or
1.5 bit llm the bit net paper by
Microsoft they published their findings
on weights and biases and the hugging
face Hub Excellent Jet Moe says they're
reaching llama 2 performance with 0.1
million this is a 8 billion parameter
model trained with less than $0.1
million cost whereas llama 2 of the same
size cost probably who has a
multi-billion dollar training resources
I like the fact that they still express
it in millions to make the number seem
small like this is still 80,000 doll
right this is not like oh it's so cheap
it is certainly getting cheaper and
investigations into how to be more
efficient MoneyWise in training these
things is very welcome very very cool
that people are doing it but it's still
80 grand it's still quite a hefty thing
it's an h100 cluster 96 of them for 2
weeks that's going to run you quite a
bill one notable thing of this paper is
that it has two phases of training the
first phase is more General data mixture
while the second phase is more
specialized more highquality data set
mixture notably during the first phase
the learning rate ramps up and then
stays constant and during the second
phase the learning rate actually decays
learning rate in combination with the
sequencing of data seems to be quite an
important piece in training these modern
language models and lastly quen
1.5e says they're matching a 7 billion
parameter model performance with 1/3
activated parameters it's a new world so
how you present your models and which
numbers and so on they are looking at
okay what if we take like mistal 7B or
quen 1.5 7B which are 7 billion
parameter models but they're densely
activated models so all 7 billion
parameters are active like what if we
compare that to a mixture of expert
model that only has 2.7 billion
activated parameters per forward pass
now this isn't saying how many
parameters are total there however like
you can play around with the numbers to
make your model look as good as possible
still these are very very cool
developments and I do think one
recognition here is the fact that they
say the training I'm not sure where it
was but essentially the initialization
of these mixture of expert models can
make quite a difference what they do is
they don't train a mixture of expert
models from scratch but they take an
already pre-trained small model so their
1.8 billion parameter model and they use
it as an initialization of these
different experts so already having
pre-trained models in place and then
kind of initializing your experts with
those they say makes a big difference to
their training every now and then we
like to keep an eye on the lm's chatbot
Arena leaderboard this is where models
are pitted against each other with the
same prompts and then humans decide
which one's better and notably you can
see the kind of usual contenders here
actually claw three Opus Rising on top
which is uh very cool to see but you can
see that among all of these really big
models is one model that's quite small
Starling 7 beam under Apache 2.0 7
billion parameters and holds itself
quite well compared to these other
models you can see the first comparable
model is another Starling model that's
way down here yeah it seems to hold up
fairly well against these other models
not this is one way of evaluating models
it's obviously not going to be good at
everything whereas I would trust that
these larger models they're going to
have much more of a spread of abilities
so you can use them in a versatile
manner whereas this might just be good
in this sort of okay give one prompt and
then do a head-to-head comparison
between the to single shot and then
whatever humans care about which might
not be the full spectrum of abilities
but still very cool another leaderboard
Claude 3 also tops the Berkeley function
calling leaderboard so the function
calling leaderboard is a leaderboard
that's based on the gorilla series of
papers models data sets I'm not sure how
to call that but it measures how well
language models can call functions which
is important for any sort of agent
framework or anything like this
apparently Claude 3 now topping that
leaderboard Trek now has a rag uh track
Tre track on rack an evaluation Tre is
an information retrieval conference I
guess they organize evaluations of
information retrieval systems it's very
cool to see a retrieval augmented
generation track now inside of those
evaluations so obviously that
organization has a lot of experience
evaluating information retrieval systems
and now there's going to be one in tre
2024 for rack lamini releases a
photographic memory evaluation Suite
this is a conglomeration of benchmarks
essentially a collection of benchmarks
that all have to do with what they call
photographic memory what is photographic
memory it's not what you think at first
essentially what they just say is there
are tasks where it's really important to
remember the thing correctly like if you
want to classify diseases into their
disease codes there are these icd1 codes
for diseases you really better make sure
that it's the actual correct one and not
just like a similar one a collection of
these benchmarks that really focus on
exact matching of a particular thing is
comprises this evaluation Suite so these
are existing benchmarks but there is one
script that you can run and evaluate
your model on these kinds of benchmarks
very cool and lastly Clement Delong just
tweeted out or X Out someone telling me
he just posted that they have released
the two biggest open data sets for OCR
excellent you can check those on hugging
face OCR still a mystery after all these
years because layouts of documents are
just dumb and the PDF standard is just
dumb and fonts are just dumb after all
these years OCR still a big problem and
it's very cool to see data sets being
published in order to make OCR better it
will will help us all thank you all
right this was it for models this week
and eval and other fun stuff uh will
return with a bit more see you
[Music]
around
関連動画をさらに表示
5.0 / 5 (0 votes)