Mistral 7B: Smarter Than ChatGPT & Meta AI - AI Paper Explained
Summary
TLDRA new open-source AI model called Mistal 7 billion was recently released, boasting 7.3 billion parameters. It outperforms models like LLama 2 and even approaches Codex's performance, despite being much smaller. Mistal uses novel techniques like grouped query attention and sliding window attention to achieve high efficiency and 2x speedups. In benchmarks, Mistal matches or exceeds all other models on metrics like reasoning and math. Developed in just 3 intense months, Mistal sets a new bar for open source model performance. Its readiness to provide concerning responses to dangerous prompts has caused some backlash, but overall it marks an impactful new milestone in open-source AI.
Takeaways
- 😲 Mister AI released Mistal 7.3B, an open-source 7.3 billion parameter model that outperforms other models like LLama 2 (13B params) on benchmarks while approaching Codex Lama 7B's performance
- 📈 Mistal achieves state-of-the-art results on benchmarks like MMLU, Knowledge/Reasoning/Comprehension tasks, HumanEval, and MBP
- 🚀 Mistal uses grouped sparse attention and sliding window attention for faster inference
- 🔬 The sliding window attention attends to previous 4,096 hidden states in each layer, allowing longer effective context
- ⚡️ Sliding window attention provides 2x speedup on sequences of length 16k while reducing memory usage
- 🤯 Mistal achieves 5.4x better performance per parameter compared to LLama 2
- 💰 Mister AI raised $113M in seed funding, allowing them to produce Mistal
- 🧠 Mistal was trained for 3 intense months by assembling top ML engineers
- 🔒 The model gives potentially dangerous responses to malicious instructions
- 👍🏻 The open-sourced Mistal is already transforming the landscape and shocking researchers
Q & A
What is the new open-source model that was recently released?
-The new open-source model is called Mistal 7 Billion. It is a 7.3 billion parameter model.
How does Mistal 7 Billion compare in performance to other large language models?
-Mistal 7 Billion outperforms models like LLMA 2 (13B parameters) and LLMA 134B on many benchmarks. It even approaches the performance of Code LLMA 7B on code tasks while remaining good at English.
What licensing does Mistal 7 Billion use?
-Mistal 7 Billion uses the PARI2 license, which allows developers to modify the code and use it as they would like.
What technique does Mistal 7 Billion use to achieve faster inference speeds?
-Mistal 7 Billion uses a sliding window attention mechanism which attends to the previous 4,096 hidden states in each layer. This allows faster processing while still modeling long term dependencies.
How long did it take to train the Mistal 7 Billion model?
-According to the GitHub page, Mistal 7 Billion is the result of 3 months of intense work by the Mistal AI team.
What criticisms were levied against the Mistal AI company previously?
-Mistal AI was criticized by some in the ML community for raising a very large ($113 million) seed round, with questions about how they would use the money.
Is the Mistal 7 Billion model safe to use?
-No, early reports indicate the Mistal 7 Billion model is unsafe - it readily provides information in response to malicious prompts.
Did Mistal 7 Billion use any proprietary training data?
-No, Mistal AI states no proprietary data was used. The model was fine-tuned on publicly available datasets.
What transformational impacts could Mistal 7 Billion have on open source AI?
-As a very large yet efficient open source model using an permissive license, Mistal 7 Billion could significantly advance open source AI capabilities across many domains and applications.
What techniques contribute to the performance efficiency of Mistal 7 Billion?
-The use of sliding window attention combined with techniques like rotating buffers to limit memory use allow Mistal 7 Billion to achieve very high performance relative to its model size.
Outlines
😲 Overview of new open-source Mistal 7 billion model
Paragraph 1 provides an overview and initial reaction to the release of Mistal 7 billion, a new open-source 7.3 billion parameter model. It benchmarks very well against other models like LLama 2 and LLama 13 billion on tests of reasoning, comprehension, and math ability. It even approaches the performance of Code Llama 7 billion on coding tasks while remaining good at English. The company behind it, Anthropic, had raised a lot of money which caused some skepticism, but now they are delivering this impressive model.
📈 How Mistal 7 billion achieves efficiency
Paragraph 2 analyzes the techniques Mistal uses to improve efficiency and performance. It utilizes sliding window attention based on blockwise transformations from the Longformer paper. This allows the model to attend to previous hidden states in an efficient manner. Combined with tuning of hyperparameters like window size, Mistal is able to attain much better performance per parameter compared to other models.
😳 Concerns over safety and ethics of Mistal 7 billion
Paragraph 3 raises concerns over the safety and ethics of the Mistal 7 billion model. Despite only being an LLM, it readily provides dangerous information when prompted. The team also linked to a torrent file when releasing the model. So while the technical abilities of Mistal are groundbreaking, its potential for misuse is equally worrying.
Mindmap
Keywords
💡Mistal 7 billion
💡Sliding window attention
💡Model efficiency
💡Benchmark performance
💡Model safety
💡Proprietary data
💡Model architecture
💡Training duration
💡Model criticism
💡Open-source advances
Highlights
Mistal 7 billion is a new open-source model with 7.3 billion parameters
Mistal 7 billion outperforms LLama 2 13B and LLama 134B on many benchmarks
Mistal 7 billion approaches CLIP's performance on English tasks while remaining good at code
Mistal 7 billion uses grouped sparse attention for faster inference
Mistal 7 billion outperforms other models on benchmarks like MML and knowledge reasoning
At 60% MML, Mistal 7B competes with 23B parameter LLama 2
For reasoning, at 69% Mistal 7B competes with 38B parameter LLama 2
Mistal 7B shows 5.4x better performance efficiency over LLama 2
Mistal 7B uses sliding window attention for faster inference
Sliding window exploits transformer layers to attend beyond window size
Rotating buffers in sliding window save 50% cache memory
Mistal 7B was trained for 3 months before release
People found Mistal 7B unsafe as it readily gives malicious instructions
Mistal 7B team linked a torrent for the model files
Mistal 7B raises concerns around AI safety with its capabilities
Transcripts
so there's been a lot of news recently
there's a new open-source model out
called mistal 7 billion it's the best 7
billion parameter model to date um as it
says here and it's also a Pary 2 the
license is a pari2 so developers can
modify the code and use it as they would
like and what's interesting as well is
that the company Mr AI they actually
criticized a few months back
um by a lot of people in the community
for raising a humongous amount of money
um as you can see through this
article um they had a 113 million seed
round you know a lot of people wondering
what the hell they going to be doing
with this this money um so I guess now
we're finding out um um so let's see
what what exactly is happening here so m
is a 7.3 B billion parameter model it
outperforms llama 2 13 billion llama 134
billion on many benchmarks and amazingly
it even approaches C Lama 7 billion
performance in code while it's remaining
good at English tasks so it's even more
generalizable than uh code
Lama um so they say it use uses uh
grouped curer attention for faster
inference and also sliding window
tension we'll get to that a bit later um
but honestly the the performance here
just it's a bit ridiculous so here we
get into the performance details we have
the MML and the knowledge and the
reasoning and comprehension so we comp
comparing it to three other models Lama
2 L 27 billion Lama 2 13 billion Lama
134 billion so mistal 7 billion comes
out on top and in parity with Lama 134
billion on reasoning but and all these
other benchmarking tasks it comes out as
number one which is just so amazing I
mean l two just came out not too long
ago um and mystal AI already have this 7
billion model that comes on parity with
a 13 billion U model on things like
knowledge and reasoning comprehension
and math the the level of um advancement
here is just um so surprising so here we
can see the the benchmarks um
how they categorized by um so I won't go
too much into this but you can just see
the ones that are zero shot the ones
that are five shot eight shot Etc um and
then here we get into more of the um
even more
benchmarks um it comes out on top on so
many of these um and remember it's it's
only half the size of the of the Llama 2
13 billion so it comes out on top on the
mm
um and then on human eval the code llama
7 billion is square pretty much uh on
top uh it's it's outperforms that and
also on the MB PP test um it outperforms
it on that as well um but on math does
better so here they break it down for us
the sort of performance to Sor
performance the cost um the efficiency
ratio where you get the performance to
model param size so by the time so like
at 60%
MML um the 7 billion um model is
competing with llama 2 which is at a 23
billion parameter size at that point and
here it's it's it looks like uh
69% reasoning U Mel is competing with uh
Lama 2
at 38 billion so it's a 5.4 times
performance uh efficiency which is
honestly so surprising yeah and you get
you think it's surprising but then now
let's get to how they did it so they
have here Flash and Furious very nice so
obviously they they're giving a hint to
flash
attention which is um by
trial um he was one he was the one who
spearheaded this uh idea of flash
attention and Flash attention is all
about finding um speed ups and
efficiencies and um um the block the
block sizes so here it says seven misual
7 billon uses a sliding window tension
SWA mechanism um in which uh layer
attends in which each layer attends to
the previous
4,000 and 96 hidden States so here this
is the one of the papers they linked
this is the long forer paper it's the
long document Transformer
paper so here is the an idea so here we
have the full um enter the power to
attention we have distillated um sliding
window and we have the global sliding
window so here we have this you can see
the sliding window uh
attention uh pattern and again it's it's
a blockwise a very block requires um
attention mechanism um which it gets um
it's it's it goes across the blocks in
order to better generalize and more
efficient efficiently uh
approximate um what is it the knowledge
around um the knowledge of the the model
compared to this one or that one um
again this one is so much more let me
zoom in so much more efficient and how
it uh it's modeling uh the data and sort
of generalizing to it and these are
other techniques but in this paper this
is the one they're using for um their
blockwise attention continuing on here
um they say in practice changes made to
flash attention and X form is yield a
two times a 2X speed Improvement for
sequence length of 16,000 with a window
of
4,000 um and they're giving a shout out
to tral
at Daniel
haiza um so the sliding window tension
exploits the Stacked layers of a
transformer to attend in in the P to
attend in the past beyond the window
size um so a token I at layer K attends
to tokens I minus the sliding window and
um /i at layer K minus
one
and then finally a fixed tension span
means we can limit our our cash to a
size of sliding window tokens using
rotating buffers um this saves half the
cash memory for inference on sequence
length of
8,192
um and again so here they're getting
into sort of how they also achieve this
um they did some fine tuning um for min
um um instruct and they throwing some
punches here when they find tuned they
find tuned fine tun it sorry on uh
instruction data sets publicly available
on hugging face and there's no tricks no
propriety data used um so again very
very impressive um compared to a lot of
these um so um a lot of these other
models they use um they have a tendency
to have propriety data okay so let's go
to the GitHub page where there we get a
better grasp of what's happening with
the with how this thing works cuz again
this this thing is so phenomenal so of
course we got the installation what what
there's many videos out there that teach
you how to download and install this
thing um but let's get to the SL sliding
window attention so here's a better idea
on how it works so here we have basic
attention um this sort of um this sort
of uh
attention mechanism ensures that the
model is causal and it can that it can
only use information from the past to
predict the future now if we go to the
sliding window again it's like a sliding
window it says here note that tokens
outside the sliding window still
influence the next word prediction at
each layer information can move forward
by W tokens at most after two attention
layers and then this is the rolling
buffer for cash that they are using um
but again the biggest the biggest thing
here is the sliding window attenion to
allow them to get such amazing um
efficiencies and then I think another
further very important question is how
long um did they train it for cuz that's
what I was wondering is like when did
they train this thing how long did they
train it for um and we did get some a
bit of details here where they said
under the mistal AI first steps they
said this is a result of 3 months of
intense work um and we St semble the
Myster AI team and rebuild top
performance ml OB Stacks we'll see how
this transforms the open source uh um
landscape um and honestly it's already
transforming it in a way because let's
go on X or Twitter and see what exactly
people have said here we have a tweet
where someone is saying that after
spending just 20 minutes with Mr AI 7
billion model I'm shocked at how unsafe
it is and it's very rare these days to
see a new model so readily to reply even
the most malicious instructions I did
check out what they were saying they did
site um they did sh share a spreadsheet
and also did tried myself you could ask
at very very sensitive things and it
would give you a very direct answer and
how to do it um you know you can use
your imagination on what you can ask and
it will give you an answer now a lot of
the times the answer is quite um
childish or not childish but it's not
super duper Advanced or effective as one
might think again these are just llms
but still it does give you an answer
another cool nugget was that they
released um when they released this
thing they actually linked uh T magnet
which is quite funny um unbelievable um
stuff from this uh Team
Browse More Related Video
5.0 / 5 (0 votes)